Despite the success of genome wide association (GWA) studies in identifying common single nucleotide variants (SNVs) that contribute to complex diseases1
, the vast majority of genetic variants contributing to disease susceptibility are yet to be discovered. In fact, it has been argued that these variants are not likely to be captured in current GWA study paradigms that focus on common SNVs.2
It is now widely believed that many genetic and epigenetic factors are likely to contribute to common complex diseases, including multiple rare SNVs (defined by convention as those that have frequencies < 1%), copy number variations (CNVs), and other forms of structural variation. 3–12
Irrespective of how one might define ‘rare variant’ (which, although we have adopted the convention <1% frequency, might range from <0.1% to <0.01% depending on the context13
) it is essential to recognize that such variants likely contribute to phenotypic expression in conjunction with, or over-and-above, common variants. This consideration has important implications when designing a study or choosing a statistical method for analyzing associations involving rare variants.
There are many reasons to believe that multiple rare variants, both within the same gene and across different genes, collectively influence the expression and prevalence of traits and diseases in the population at large. First, it has been argued that population phenomena, such as the recent expansion of the human population, are likely to have resulted in a large number of segregating, functionally-relevant, rare variants that mediate phenotypic variation.14, 15
Second, the discovery of rare independent somatic mutations within and across genes contributing to tumorigenesis may parallel the functional effects of inherited variants contributing to congenital disease.11, 16, 17
Third, the identification of multiple rare variants within the same gene contributing to largely monogenic disorders such as Cystic Fibrosis and BRCA1 and BRCA2-associated breast cancer18, 19
suggests that rare variants might also influence common complex traits and diseases. Fourth, the identification of multiple functional variants within the same gene and the association of these variants with both in vitro
and clinical phenotypes indicates that multiple rare variants could influence general clinical phenotypic expression20
. Fifth, importantly, sequencing studies focusing on specific genes have shown that collections of rare variants can indeed associate with particular phenotypes ().
Recent Studies Pursuing Rare Variant Association Analyses
To comprehensively characterize the contribution of rare variants to phenotypic expression, one could either sequence genomic regions of interest using high-throughput DNA sequencing technologies21
or genotype common and rare variants identified in previous sequencing studies using custom genotyping chips. There are a number of ways to approach association studies involving rare variants, which are independent of sequencing or genotyping technology. For example, one could: focus on candidate disease genes 22
; focus on genomic regions implicated in linkage or genome-wide association studies, under the assumption that phenotypically-relevant rare variants also exist in those regions; consider multiple functional genomic regions, such as exons 23
; or study entire genomes.12, 24
The sampling framework for such studies is also extremely important as one could focus on: cases and controls, possibly in DNA pools22
or with oversampling of controls to achieve greater power in studies of rare diseases; individuals phenotyped for a particular quantitative trait; individuals with ‘extreme’ phenotype values in order to increase efficiency25, 26
; or families in order to exploit parent-offspring transmission patterns.12, 24
In addition to a sequencing technology and an appropriate sampling and study design, bioinformatic methods for analyzing the potentially massive amounts of sequence data likely to be generated in a study are needed, as are algorithms for accurately identifying rare variants and assigning genotypes to individuals from sequence data12, 27
. Importantly, statistical analysis methods for relating rare variants to phenotypes of interest are needed. Association analyses involving rare variants are not as straightforward as analyses involving common variations since the power to detect an association between a single rare variant is low in even very large samples ().14, 28, 29
Therefore, researchers have begun to develop data analysis strategies that assess the collective effects of multiple rare variants within and across genomic regions 13, 28, 30
. This challenge of statistical analysis is the focus of this Review.
Sample size requirements and statistical power for variants of different frequencies
There are many settings in which a collection of rare variants might exhibit an association with a trait. Of the many different methods that could be used for testing associations, not all of them are likely to work well in each of these settings. Here, we consider the rationales behind different data analysis methods, pointing out their limitations and advantages. We also outline areas for further research. As noted, appropriately sophisticated methods for identifying variants, assigning genotypes, and sampling individuals are crucial for rare variant analyses, but we do not discuss them here. There are, however, a few additional issues that researchers need to consider in any association study involving rare variants, as briefly described in Box 1
. Finally, although we focus on the analysis of rare SNVs, aspects of the analytical methods discussed can be used with other forms of variation including rare CNVs, although certain caveats apply, which we mention briefly.
Box 1. Issues Impacting the Interpretation of Rare-Variant Association Studies
There are a number of statistical analysis issues that go beyond the choice of an association test statistic in studies of rare variants. These are outlined briefly below.
Sequencing and Genotyping errors
It has been shown that differential genotyping error rate can have substantial impact on common-variant based GWA studies.89
Given that current sequencing protocols have inherent error rates, more research is needed to understand how false positive variant calls and nucleotide misassignments in sequence-based association studies of rare variants will impact inferences.
Rare variant effects can manifest as compound heterozygosity,90
the ‘unmasking’ of deleterious variants via deletions on a homologous chromosome12
, and other haplotype context-dependent phenomena. Thus, leveraging phase information in an association study of rare variants may be crucial, but obtaining phase from sequence data alone is not trivial.24, 91–93
The potential for false positive associations due to population stratification is large in studies involving rare variants since specific rare variants are more likely to be unique to a particular geoethnic group. Thus, even if focus in a rare variant study is on a particular gene or genomic region, it is important to genotype the individuals in the study on enough additional markers to assess and control for stratification using standard strategies.94, 95
The Use of In Silico Controls
The practice of identifying and quantifying allele frequencies in a group of individuals and comparing them with historical or publicly available ‘control’ sets in studies involving rare variants is highly problematic due to the potential for stratification and sampling variation effects.96
In order to avoid this, either sophisticated genetic background matching strategies or de novo
sequencing of a case and control group are recommended, but more work in this area is needed.
Genomic Units of Analysis
Different strategies for testing a genomic region for association involving rare variants exist. For example, one could test all the variants in a region (depending on its size) for collective frequency differences between, e.g., cases and controls, define particular regions of interest, such as exons or transcription factor binding sites (Box 2
), or pursue a ‘moving window’ analysis in which variants in contiguous, possibly overlapping, subregions are tested. Each of these strategies impacts the number and nature of multiple testing problems.
Box 2. In Silico Functional Assessment of Sequence Variations
Identifying groups of variants that reside in genomic regions known or likely to be of functional significance, such as exons, promoters, enhancers, etc. can be pursued through the use of genome browsers such as the UCSC genome browser. One can also assess the more specific functional potential of individual sequence variants given their sequence contexts and incorporate this information into an association analysis (e.g., by weighting them more heavily in test statistics). The table below lists web resources for such assessments. Finally, one could identify variants that participate in common multigene pathway and processes and assess their collective effects on a phenotype.
Functional Element Annotation
Beyond the basic annotations presented in the UCSC genome browser, numerous prediction methods exist for transcription factor binding sites exist (TFsearch, Consite:100
, enhancers (VISTA Enhancer Browser
), microRNAs (miRBase:103
), microRNA binding sites (Targetscan:104
), intronic splice sites105
, and exonic splicing enhancers106, 107
, silencers108, 109
, regulatory elements110–112
(). Epigenetic and/or regulatory factors derived from the ENCODE project113
, such as histone binding/methylation/acylation, CpG islands, nuclease accessible sites, transcription start sites, and others are also available through the UCSC Genome Browser114
Integrative Web-Servers for Variant Annotation
Pathway and Process Assessment
There are numerous resources for pathway information and analysis. Open source databases that include pathway information, but not necessarily analysis of datasets, include Reactome 115
, BioCarta and the Kyoto Encyclopedia of Genes and Genomes (KEGG) 116
, as well as a biological process resource, The Gene Ontology (GO) database117
. Publically available pathway analysis tools that link to these databases include, but are not limited to, Cytoscape 118
, GenMAPP 119
, and the DAVID Bioinformatics Resource 120
. Commercially available tools that build off these databases and include proprietary pathway information include Ingenuity Pathway Analysis and GeneGo by MetaCore. For a more complete review of pathway analysis tools, see Suderman and Hallet.121
Functional Impact Prediction Modeling
Functional predictions often leverage various types of information, including but not limited to protein structure information, sequence conservation, motif conservation, etc., in order to build models that generate a probability that a particular variant is functionally important. Some of these methods, and many integrative web servers for this purpose, have been reviewed.122–124
Functional prediction for non-coding variants are generally limited to scoring the deviation of a polymorphism from known regulatory factor motifs, and examples are limited but include MaxEntScan
for splicing prediction105
, or RAVEN
for regulatory regions.125
Generality of Annotators
A number of webservers and algorithms attempt to integrate the various functionally-relevant genomic features in order to explicitly weight or prioritize variants investigated in an association study. A subset of the tools attempt to prioritize SNPs based upon scores returned from the various functional impact predictors while many simply present the functional elements and leave it up to the user to draw their own conclusions about ultimate functionality. A few tools, such as SeattleSeq and Sequence Variant Analyzer integrate various types of biological data in order to annotate novel sequence variants, whereas Trait-O-Matic annotates variations with respect to overt phenotypic features that they have been associated with.
There is a great deal of precedent for assigning individuals who have not been sequenced or genotyped at a specific locus common genotypes based on available neighboring locus genotype information and linkage disequilibrium patterns via imputation methods.97
Although highly problematic in situations involving de novo
or even moderately rare variants (<1%), imputation methods involving rare variants have begun to receive attention and could be extremely useful in future association studies.98
Accommodating Multiple Comparisons
Controlling for false positive findings due to multiple testing is necessary. Pre-specified Bonferroni-like corrections on association p-values are not likely to be appropriate given possible correlations between defined groups of rare variants and/or overlapping windows to be tested. Such correlations will also impact false discovery rate (FDR) procedures for accommodating multiple testing a posteriori
Simulation studies and permutation testing that consider the entire set of tests performed (e.g., all windows and groups of variants across all genomic regions considered) to a get a global false positive rate are the most appropriate given their flexibility and sound theoretical bases, but will likely be very computationally intensive.75
More work in this area is also sorely needed.