Genetic association studies provide near-unbiased screens of common and rare variants’ association with complex traits. Genome-wide association (GWA) studies highlight distinct loci, and thereby reduced, yet sizable, sets of genes among which to search for likely causal candidates (1
). Complex trait-based exome chip analyses (2
) and exome sequencing studies (3
) highlight coding mutations within specific genes, but generally lack statistical power to establish significant associations. Therefore, association studies and rare variant analyses typically rely on downstream bioinformatics analysis, to further reduce their shortlisted candidate genes to numbers that allow in depth experimental follow-up studies.
Genetic alterations may trigger a downstream cascade of changes in cellular states (4
). Consequently, analyses of genetic variation data have been augmented by integration with complementary data sets, among others differential- or tissue-specific gene expression data (5
), protein–protein interaction data (6
) or existing literature-based knowledge (7
). Although there are highly specialized tools that facilitate gene prioritization in chromosomal regions [e.g. Endeavour (8
), or Prioritizer (9
)], or GWA loci [e.g. GRAIL (10
), or DAPPLE (11
)], there is only a limited number of tools that allow researchers to combine their in-house portfolio of genomics data sets with relevant publicly available data sets [see (12
) for an in-depth review of existing gene prioritization methods]. One of these approaches is MetaRanker 1.0 (13
), our previously published approach, which augments genetic analyses by prioritizing the genome in relation to a specific phenotype of interest through integration of heterogeneous and complementary data sources. MetaRanker facilitates integration of the following data types:
- Single nucleotide polymorphism (SNP) to phenotype associations from GWA studies, which represent a rapidly growing resource of unbiased common variant associations.
- High-confidence protein–protein interaction networks centred on proteins encoded by user-defined phenotype-related susceptibility genes, which may contribute with non-obvious pathway-based information.
- Data from linkage studies capturing co-segregation of chromosomal regions and disease-specific phenotypes, thereby highlighting chromosomal intervals likely to harbour causal genes.
- Quantitative data on disease similarities, which may add information that exploit overlaps in disease definitions.
- Tissue-specific or differential gene expression data from microarray or sequencing-based studies.
These data sources are treated as evidence layers that can be used in any combination, and are collapsed into an integrative meta-layer. We validated MetaRanker 1.0 by discovering a novel bipolar disorder susceptibility locus (rs1049583, near YWHAH
), which we replicated through genotyping in independent cohorts. Another tool that allows prioritization of disease genes by integration through various data types is CANDID (14
). We benchmarked MetaRanker successfully against this method.
In this article, we describe MetaRanker 2.0, which extends our original approach in several significant ways:
- Integration of new user-specified data sets, such as data from next-generation sequencing studies, or additional gene expression experiments. (User input: Gene IDs and gene-based scores).
- Integration of copy-number variation data. (User input: Chromosomal regions),
- Improved gene ranking based on large-scale text-mining. (User input: Key words).
- Improved GWA data-based scoring of genes.
- Improved usability of the web server.