|Home | About | Journals | Submit | Contact Us | Français|
The limitations of genome-wide association (GWA) studies that focus on the phenotypic influence of common genetic variants have motivated human geneticists to consider the contribution of rare variants to phenotypic expression. The increasing availability of high-throughput sequencing technology has enabled studies of rare variants, but will not be sufficient for their success since appropriate analytical methods are also needed. We consider data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts.
Despite the success of genome wide association (GWA) studies in identifying common single nucleotide variants (SNVs) that contribute to complex diseases1, the vast majority of genetic variants contributing to disease susceptibility are yet to be discovered. In fact, it has been argued that these variants are not likely to be captured in current GWA study paradigms that focus on common SNVs.2 It is now widely believed that many genetic and epigenetic factors are likely to contribute to common complex diseases, including multiple rare SNVs (defined by convention as those that have frequencies < 1%), copy number variations (CNVs), and other forms of structural variation. 3–12 Irrespective of how one might define ‘rare variant’ (which, although we have adopted the convention <1% frequency, might range from <0.1% to <0.01% depending on the context13) it is essential to recognize that such variants likely contribute to phenotypic expression in conjunction with, or over-and-above, common variants. This consideration has important implications when designing a study or choosing a statistical method for analyzing associations involving rare variants.
There are many reasons to believe that multiple rare variants, both within the same gene and across different genes, collectively influence the expression and prevalence of traits and diseases in the population at large. First, it has been argued that population phenomena, such as the recent expansion of the human population, are likely to have resulted in a large number of segregating, functionally-relevant, rare variants that mediate phenotypic variation.14, 15 Second, the discovery of rare independent somatic mutations within and across genes contributing to tumorigenesis may parallel the functional effects of inherited variants contributing to congenital disease.11, 16, 17 Third, the identification of multiple rare variants within the same gene contributing to largely monogenic disorders such as Cystic Fibrosis and BRCA1 and BRCA2-associated breast cancer18, 19 suggests that rare variants might also influence common complex traits and diseases. Fourth, the identification of multiple functional variants within the same gene and the association of these variants with both in vitro and clinical phenotypes indicates that multiple rare variants could influence general clinical phenotypic expression20. Fifth, importantly, sequencing studies focusing on specific genes have shown that collections of rare variants can indeed associate with particular phenotypes (Table 1).
To comprehensively characterize the contribution of rare variants to phenotypic expression, one could either sequence genomic regions of interest using high-throughput DNA sequencing technologies21 or genotype common and rare variants identified in previous sequencing studies using custom genotyping chips. There are a number of ways to approach association studies involving rare variants, which are independent of sequencing or genotyping technology. For example, one could: focus on candidate disease genes 22; focus on genomic regions implicated in linkage or genome-wide association studies, under the assumption that phenotypically-relevant rare variants also exist in those regions; consider multiple functional genomic regions, such as exons 23; or study entire genomes.12, 24 The sampling framework for such studies is also extremely important as one could focus on: cases and controls, possibly in DNA pools22 or with oversampling of controls to achieve greater power in studies of rare diseases; individuals phenotyped for a particular quantitative trait; individuals with ‘extreme’ phenotype values in order to increase efficiency25, 26; or families in order to exploit parent-offspring transmission patterns.12, 24
In addition to a sequencing technology and an appropriate sampling and study design, bioinformatic methods for analyzing the potentially massive amounts of sequence data likely to be generated in a study are needed, as are algorithms for accurately identifying rare variants and assigning genotypes to individuals from sequence data12, 27. Importantly, statistical analysis methods for relating rare variants to phenotypes of interest are needed. Association analyses involving rare variants are not as straightforward as analyses involving common variations since the power to detect an association between a single rare variant is low in even very large samples (Figure 1).14, 28, 29 Therefore, researchers have begun to develop data analysis strategies that assess the collective effects of multiple rare variants within and across genomic regions 13, 28, 30. This challenge of statistical analysis is the focus of this Review.
There are many settings in which a collection of rare variants might exhibit an association with a trait. Of the many different methods that could be used for testing associations, not all of them are likely to work well in each of these settings. Here, we consider the rationales behind different data analysis methods, pointing out their limitations and advantages. We also outline areas for further research. As noted, appropriately sophisticated methods for identifying variants, assigning genotypes, and sampling individuals are crucial for rare variant analyses, but we do not discuss them here. There are, however, a few additional issues that researchers need to consider in any association study involving rare variants, as briefly described in Box 1. Finally, although we focus on the analysis of rare SNVs, aspects of the analytical methods discussed can be used with other forms of variation including rare CNVs, although certain caveats apply, which we mention briefly.
There are a number of statistical analysis issues that go beyond the choice of an association test statistic in studies of rare variants. These are outlined briefly below.
It has been shown that differential genotyping error rate can have substantial impact on common-variant based GWA studies.89 Given that current sequencing protocols have inherent error rates, more research is needed to understand how false positive variant calls and nucleotide misassignments in sequence-based association studies of rare variants will impact inferences.
Rare variant effects can manifest as compound heterozygosity,90 the ‘unmasking’ of deleterious variants via deletions on a homologous chromosome12, and other haplotype context-dependent phenomena. Thus, leveraging phase information in an association study of rare variants may be crucial, but obtaining phase from sequence data alone is not trivial.24, 91–93
The potential for false positive associations due to population stratification is large in studies involving rare variants since specific rare variants are more likely to be unique to a particular geoethnic group. Thus, even if focus in a rare variant study is on a particular gene or genomic region, it is important to genotype the individuals in the study on enough additional markers to assess and control for stratification using standard strategies.94, 95
The practice of identifying and quantifying allele frequencies in a group of individuals and comparing them with historical or publicly available ‘control’ sets in studies involving rare variants is highly problematic due to the potential for stratification and sampling variation effects.96 In order to avoid this, either sophisticated genetic background matching strategies or de novo sequencing of a case and control group are recommended, but more work in this area is needed.
Different strategies for testing a genomic region for association involving rare variants exist. For example, one could test all the variants in a region (depending on its size) for collective frequency differences between, e.g., cases and controls, define particular regions of interest, such as exons or transcription factor binding sites (Box 2), or pursue a ‘moving window’ analysis in which variants in contiguous, possibly overlapping, subregions are tested. Each of these strategies impacts the number and nature of multiple testing problems.
Identifying groups of variants that reside in genomic regions known or likely to be of functional significance, such as exons, promoters, enhancers, etc. can be pursued through the use of genome browsers such as the UCSC genome browser. One can also assess the more specific functional potential of individual sequence variants given their sequence contexts and incorporate this information into an association analysis (e.g., by weighting them more heavily in test statistics). The table below lists web resources for such assessments. Finally, one could identify variants that participate in common multigene pathway and processes and assess their collective effects on a phenotype.
Beyond the basic annotations presented in the UCSC genome browser, numerous prediction methods exist for transcription factor binding sites exist (TFsearch, Consite:100, TRANSFAC:101, enhancers (VISTA Enhancer Browser:102), microRNAs (miRBase:103), microRNA binding sites (Targetscan:104), intronic splice sites105, and exonic splicing enhancers106, 107, silencers108, 109, regulatory elements110–112 (Table B.1). Epigenetic and/or regulatory factors derived from the ENCODE project113, such as histone binding/methylation/acylation, CpG islands, nuclease accessible sites, transcription start sites, and others are also available through the UCSC Genome Browser114.
There are numerous resources for pathway information and analysis. Open source databases that include pathway information, but not necessarily analysis of datasets, include Reactome 115, BioCarta and the Kyoto Encyclopedia of Genes and Genomes (KEGG) 116, as well as a biological process resource, The Gene Ontology (GO) database117. Publically available pathway analysis tools that link to these databases include, but are not limited to, Cytoscape 118, GenMAPP 119, and the DAVID Bioinformatics Resource 120. Commercially available tools that build off these databases and include proprietary pathway information include Ingenuity Pathway Analysis and GeneGo by MetaCore. For a more complete review of pathway analysis tools, see Suderman and Hallet.121
Functional predictions often leverage various types of information, including but not limited to protein structure information, sequence conservation, motif conservation, etc., in order to build models that generate a probability that a particular variant is functionally important. Some of these methods, and many integrative web servers for this purpose, have been reviewed.122–124 Functional prediction for non-coding variants are generally limited to scoring the deviation of a polymorphism from known regulatory factor motifs, and examples are limited but include MaxEntScan for splicing prediction105, or RAVEN for regulatory regions.125
A number of webservers and algorithms attempt to integrate the various functionally-relevant genomic features in order to explicitly weight or prioritize variants investigated in an association study. A subset of the tools attempt to prioritize SNPs based upon scores returned from the various functional impact predictors while many simply present the functional elements and leave it up to the user to draw their own conclusions about ultimate functionality. A few tools, such as SeattleSeq and Sequence Variant Analyzer integrate various types of biological data in order to annotate novel sequence variants, whereas Trait-O-Matic annotates variations with respect to overt phenotypic features that they have been associated with.
There is a great deal of precedent for assigning individuals who have not been sequenced or genotyped at a specific locus common genotypes based on available neighboring locus genotype information and linkage disequilibrium patterns via imputation methods.97 Although highly problematic in situations involving de novo or even moderately rare variants (<1%), imputation methods involving rare variants have begun to receive attention and could be extremely useful in future association studies.98
Controlling for false positive findings due to multiple testing is necessary. Pre-specified Bonferroni-like corrections on association p-values are not likely to be appropriate given possible correlations between defined groups of rare variants and/or overlapping windows to be tested. Such correlations will also impact false discovery rate (FDR) procedures for accommodating multiple testing a posteriori.99 Simulation studies and permutation testing that consider the entire set of tests performed (e.g., all windows and groups of variants across all genomic regions considered) to a get a global false positive rate are the most appropriate given their flexibility and sound theoretical bases, but will likely be very computationally intensive.75 More work in this area is also sorely needed.
As noted, rare variants are likely to influence a trait along with common variants.4, 14 In addition, just as interaction effects involving either genetic or environmental factors must be considered in standard GWA studies9, they are also likely to be important in association studies involving rare variants. With these facts in mind, there are a number of different settings in which rare variants within a defined genomic region could influence a phenotype. Figure 2 provides a few contrasting examples, including situations in which: a common variant is associated with a phenotype; rare variants influence a phenotype independently of one another; rare variants, along with variants with more moderate or common frequencies, act synergistically to influence a phenotype; or only a subset of the rare variants influences the phenotype due to their locations in a functional element within the region of interest.
Of these possible settings, the one receiving the most attention by statistical geneticists is the ‘extreme allelic heterogeneity (EAH)’ setting in which single or small subgroups of individuals with a particular phenotype or disease possess any one, or some subset, of a larger set of rare variants that all independently perturb a single relevant gene in a similar way.12, 31 Although conceptually easier to accommodate in statistical analysis models, there is no reason to believe that the EAH setting is the rule rather than the exception with respect to rare variant influences on phenotypic expression. Statistical analysis models and methods for rare variant association studies should therefore be developed and tested in settings that go beyond the EAH model, such as settings implicating synergistic effects of rare (and common) variants within (and across) genomic regions.
The simplest approach to testing rare variants for association with a trait is to test them individually using standard contingency table and regression methods of the sort implemented in widely used genetic data analysis packages such as PLINK.32 This strategy is highly problematic given, for example, the poor power that such statistical tests have to detect small rare variant frequency differences between diagnostic or phenotypic groups (figures 1A and 1B).14, 28, 29 In order to overcome power issues associated with testing rare variants individually, one could ‘collapse’ sets of rare variants into a single group and test their collective frequency differences between cases and controls.28, 30 In its simplest form this strategy could involve counting individuals possessing a rare variant at any position in the genomic region of interest, calculating the frequencies of these individuals, for example in case and control groups, and then testing the two groups for frequency differences. This strategy forms the basis for most of the statistical models described in this review and variations of it have been considered in many studies involving rare variants (Table 1). To make this collapsing strategy more biologically appealing, elaborate ways of leveraging functional elements and annotations in a genomic region to collapse the variants together can be exploited (see below and Box 2). The effect of collapsing variants and testing their collective frequency differences on power can be substantial, as depicted in Figure 1B.
Regression-based collapsed variant and conditional tests can greatly enhance association studies involving rare variants. Consider Figure 1C which plots the power to detect the effect of a variant on a quantitative trait for 1000 individuals as a function of the fraction of variation of the quantitative trait explained by that variant. If a set of rare variants each individually explain only a small fraction of the variation of the trait, they could be combined into a single predictor variable, perhaps by creating a dummy variable which equals 1 if an individual possesses any of the variants and 0 otherwise.33 This strategy should increase the fraction of variation explained by the variants as a whole and hence increase the power to detect their collective, rather than individual, effects. In addition, if one included other factors in a regression model - such as covariate effects, the effects of previously identified common variants, or other collapsed sets of rare variants - then the power to detect the association involving rare variants could increase substantially (Figure 1C). Not all analysis methods proposed for rare variant studies, however, can accommodate additional factors in their formulations and hence leverage conditioning effects. In addition, not all models can accommodate quantitative trait analysis unless the phenotype is broken into quantiles and stratified analysis is pursued (Table 2).
The collapsing strategy makes important assumptions. First, some formulations of collapsed tests assume that each subject is likely to have only a single rare variant. This may be true given the low frequency of the variants, but could in theory be untrue if the variants interact with one another or large genomic regions are tested.20, 33 Second, if one collapses variants by counting individuals possessing rare variants, then if either the frequency of those variants is large enough or if there are many of them, the percentage of individuals possessing any one of them could reach 100%. Therefore, ways of circumscribing the variants to be collapsed, such as leveraging functional information (Box 2), or weighting the variants in some way,34, 35 are important. Alternatively, one could employ statistics that do not rely on simple counting. For example, one could tally the number of variants within a collapsed set possessed by each individual.33
Although there are a number of ways to leverage functional annotations to guide the collapsing of rare variants in association studies, their use will only be as good as the science behind those annotations. It is also possible that different functional ‘levels’ of annotation can be used to define collapsed sets of rare variants. For example, one could define a set of variants as ‘genic’ if they reside in the open reading frame associated with a gene; as ‘exonic’ if they reside in coding regions within that frame; as ‘non-synonymous coding variants’ if they perturb an encoded amino acid; and as ‘non-synonymous coding within an active site of the encoded protein’ if a variant impacts a residue within the active site of the encoded protein. With this in mind, one could perhaps test hierarchies of hypotheses about collections of variants and their biological impact on a phenotype.
It is important to note the distinction between leveraging functional annotations to collapse a set of rare variants based on their location versus predictions that the variants themselves have a functional effect (Box 2).35 In fact, two recent papers23, 36 suggest that leveraging functional annotations and computational methods for predicting the consequences of specific rare variants can be used to great advantage in the identification of disease-predisposing variants, at least for rare monogenic conditions. Functional annotations for rare CNVs and other forms of structural variation can also be leveraged in collapsed or group-wise analyses. However, many of these forms of variation are thought to exert or manifest their effects throughout the genome and not necessarily as a group of variants in a singular region of the genome. Thus, pathway-based (Box 2) and other higher-order approaches to collapsing or summarizing rare CNV effects have been proposed, especially in the context of neuropsychiatric disease.3, 37
There are a number of statistical analysis strategies that can be used to test the hypothesis that specific collections of rare variants are associated with a particular trait or disease. Some of these methods have been developed in contexts beyond human association studies, such as assessing genetic differentiation between human geoethnic groups or pathogen sequences. In addition, some methods are more or less agnostic to variant frequencies. In order to facilitate their descriptions, we have grouped various methods together in three broad and somewhat arbitrary categories: tests based on the use of group summary information on variant frequencies compared between, for example, case and control groups; tests based on the similarity or diversity of unique DNA sequences possessed by different individuals; and regression models that consider collapsed sets of variants and other factors as predictors of a phenotype. We consider each of these three categories separately below, although Table 2 provides brief summaries of representative methods from each category. Each of the methods discussed can leverage functional annotations to define collapsed variant sets or can be used in a moving window setting (Box 2).
Morgenthaler and Thilly30 were the first to describe a version of the collapsing approach in which the frequency of individuals carrying any one of a number of rare variants is contrasted between case and control groups. They termed this approach the ‘cohort allelic sums test’ or ‘CAST’ method and suggested the use of standard contingency table-based Chi-square or Fisher’s exact tests for obtaining p-values. The method as first proposed does not easily accommodate covariates, cannot be used with quantitative phenotypes, and does not consider weighting of the variants using, for example, variant frequency or functional annotations. Li and Leal considered an extension of the CAST method, which they termed the ‘Combined Multivariate and Collapsing (CMC)’ method.28 Here, rare variants are collapsed, as in the CAST method, and treated as a single set of variants whose frequency differences are then tested between groups. This testing could potentially be done simultaneously with frequency differences at other individual loci or among other collapsed sets using a summary distance-based Hotelling’s T-Squared statistic.28, 38 The CMC statistic has desirable properties in that it appropriately controls type I error rates even when non-functional variants are included in the set of variants to be tested, and has better power than the standard CAST method. In addition, the CMC statistic can be implemented in a regression modeling framework as discussed later.
Madsen and Browning proposed a statistic for testing a prespecified collapsed set of variants that leverages weighting of each variant by its frequency, thus allowing one to include variants of any frequency into the collapsed set.34 A score is calculated for each individual using that individual’s genotypes and the frequency-determined weights. The sum of ranks of the scores among the cases is then used as a summary statistic to be compared to the same statistic computed among the controls using permutation methods, in a manner analogous to the Wilcoxon rank test.39 Madsen and Browning showed that their proposed statistic is more powerful than either the CAST or CMC methods in a number of settings, but more work in this area is needed to clarify the advantages, if any, of each.34 Other strategies for testing groupwise frequency differences of genetic variations between cases and controls in an analogous manner to the CAST method have been proposed, although many have only been implemented in settings involving common variants. 34, 40, 41,42
Recently, Price et al.35 implemented a method for testing rare coding variants that considers optimal or variable weighting of the variants in a procedure resembling Madsen and Browning’s.34 Price et al.35 showed that their method is more powerful than approaches that consider fixed weights. In addition, they argued that the use of the predicted functional impact of each individual non-synonymous coding variant could be leveraged in their model. Finally, Han and Pan40 recently devised a method that cleverly considers the direction of the effect of the implicated variants (e.g., protective or deleterious) which can be implemented in a regression model framework (see below). Other summary statistic methods essentially ignore direction of effect and hence may be problematic in settings in which rare variants are not necessarily more frequent in disease or certain a priori defined phenotypic states.
Another way of exploiting summary statistics for rare variant analysis involves comparing haplotype frequencies between, for example, case and control groups, as opposed to genotype or single variant carrier status frequencies.43–45 Haplotype analyses require phase information, which is not trivial to obtain for genotyped rare variants or variants derived from sequence data (Box 1). In addition, if enough rare variants are studied, each individual in a sample of cases and controls may have their own unique haplotypes, making summary statistic approaches impossible. A recently proposed two-stage approach to haplotype analysis of rare variants could alleviate this problem since it collapses haplotypes into groups and eliminates variants not likely to be relevant prior to contrasting haplotype frequencies.46
Other potential methods that leverage summary statistics to test multiple variant frequency differences across groups include classical DNA sequence diversity measures such as nucleotide polymorphism, θ, and nucleotide diversity, π47, as well as traditional measures of population differentiation such as that statistics referred to as Fst and Gst.48, 49 These methods are more or less agnostic to allele frequencies, but can provide insight into differences between groups over many rare variants. However, their utility and power have not been assessed in association analysis settings. In addition, flaws with measures such as Fst and Gst have been pointed out that may not allow them to reliably capture diversity, differences in diversity, or population differentiation in general in some of the most trivial settings, given their focus on heterozygosity.50 Jost50 discusses alternatives to traditional Fst, Gst and related DNA sequence population differentiation measures, but these measures still require assumptions about the best way to apply them in any one particular setting. Interestingly, the methods described by Jost can be easily adjusted to assess group differences attributable to many rare variants (see Box 3).50
Exploiting sequence similarity or diversity in genetic association studies can be problematic due to the fact that the choice of a similarity or diversity measure can impact the interpretation of the results. This issue is well-documented in the cluster analysis literature59, 126 but has been shown to influence the interpretation of genomic studies as well. For example, the determination of phylogenetic patterns among different species based on DNA sequences requires the choice of a DNA sequence alignment method in order to identify patterns of orthology, and it has been shown that, depending on how DNA similarities are defined and the alignments are determined, different conclusions can be drawn about the phylogenetic, and hence evolutionary, relationships between species.127
For within-species studies assessing the ancestral relationships between populations based on DNA sequence, it has been shown that the choice of a distance measure can impact the interpretation of the results50, 128. Measures of nucleotide similarity for the comparison of DNA sequences between pairs of individuals within a species are also problematic for this reason. This issue is no less problematic for the assessment of the difference in the diversity of DNA sequences obtained from two or more groups of individuals when summary allele frequency measures are used.50 For example, consider the classical general formula for diversity measures129, 130 for a single population:
Where pi is the frequency of the ith allele out of a total of k (i=1,…, k) and the exponent q determines the Δ measure’s sensitivity to the frequency of the alleles. Thus, the use of q values less than 1.0 produces a measure that emphasizes rare variants and the use of λ values greater than 1.0 produces a measure that emphasizes common variants.50, 129 The use of different q values in the construction of Δ measures for the comparison of the genetic diversities of two (or more) populations will have the same effect50, 130, 131: small q values will impact differences in rare variants and large q values emphasize differences in common variants132. Since a genomic region may harbor common, moderately common, and rare variants, some of which may influence phenotypic expression, the choice of a q value for association studies based on diversity indices may be problematic.
Instead of constructing statistics based on the frequencies of individual or collapsed variants, statistics that reflect the similarity of the unique DNA sequences possessed by individuals can be constructed. Such statistics have their roots in the assessment of cross-species orthology, protein family determination, phylogeny construction and a number of other molecular genetic analyses based on DNA sequence similarity and are more or less agnostic to the frequencies of the variants being considered.20, 51 The main motivation for similarity-based approaches to assessing rare variant associations is that the general nucleotide background or context within which a rare variant can influence a phenotype may be important. Thus, such approaches assume some form of interaction among variants or at least a simple shaping of gene function by the balance of variations an individual possesses.
Many recent papers have described flexible strategies for testing genetic associations that leverage individual sequence similarity information,20, 52–57 and it has been shown that such strategies can be as powerful, if not more so, than some traditional tests of association in many settings involving common variations.58 However, the performance of these methods when many rare variants and no common variants are considered is an open question. In addition, a limitation of these methods is that a specific DNA similarity or distance measure or metric must be chosen and this can be problematic (Box 3).59 For example, a number of approaches have described DNA sequence similarity metrics that consider the origins or phylogenetic relationships between sequences.60–62 In addition, other approaches, some of which have their roots in comparing pathogen sequences, consider weighting individual nucleotides by their frequency or putative functional effects.54, 63, 64
The problem of choosing a DNA sequence similarity measure based purely on nucleotide content matching or genealogical or cladistic distance is rooted in the fact that, ultimately, functional nucleotide content (i.e., what nucleotides and nucleotide combinations an individual possesses that impact function) determines gene activity, rather than the phylogenetic origins of those nucleotides. Thus, in theory, similarity measures that build off the functional features and functional capacities of impacted genes associated with DNA sequence (Box 2) – as shaped by particular nucleotides and nucleotide combinations – are likely to be more appropriate for association studies than measures based on either phylogenetic relationships between sequences or the mere equality of aligned nucleotides.
Alternatively, statistics that exploit pairwise sequence similarity can be used65 as alternatives to classical summary statistic measures of sequence diversity differences between groups. Such statistics would be highly appropriate in situations, such as the EAH situation, in which a group of individuals (e.g., cases) are hypothesized to simply possess more unique variants or more unique combinations of variants than another group of individuals (e.g., controls) in a defined genomic region.
In the absence of knowledge of which rare variants to collapse or consider as a set, one could potentially search for a subset of variants that maximally discriminates between, for example cases and controls, based on the distances between the sequences in the two groups.66 Permutation methods could be used to derive p-values for discriminative ability. Searches for optimal sets of variations in this manner have parallels to the approach underlying logic regression67 and the method of Han and Pan40, which are discussed later in the section on regression methods. Although intuitively appealing, such methods are problematic in that the determination of an optimal subset of variants based on group differences can be computationally-intensive. In addition, if a large enough genomic region is considered, then one could merely ‘collapse’ all variants unique to each case and then unique to each control, resulting in a set of variants that completely and perfectly discriminate cases from controls. The possibility of this phenomenon emphasizes a need for considering functional annotations in relevant data analyses or other ways of circumscribing rare variants to be considered as a collapsed set.
Finally, traditional family-based linkage analyses consider the consistency of within-family sharing of specific transmitted chromosomal segments among affected family members rather than the consistency or similarity of the nucleotide content of those segments across different families. As a result, such methods are fairly robust to allelic heterogeneity.68 However, not all approaches to linkage analysis are very powerful, and this is especially true for non-parametric approaches involving small families69, 70, although transmission/disequilibrium tests may have merit in the analysis of rare variants.71 In addition, linkage analysis approaches not only come with the often difficult and expensive need to sample family members, but many phenotypes may not exhibit familial aggregation, undermining the motivation to consider family-based studies10.
Regression models treat the phenotype as a dependent variable and collapsed sets of variants as independent or predictor variables. Such methods provide a flexible framework for assessing the contribution of collections of rare variants to a phenotype.28, 33 Such models can accommodate a number of additional predictor variables, including common variants, covariates such as gender and age, and interaction terms. Recently, Morris and Zeggini33 assessed the power of simple regression methods for testing collapsed sets of rare variants for association with a quantitative trait and found that such approaches are indeed intuitive, flexible and powerful. The authors compared the use of a simple tally of the number of rare variants possed by an individual across a large region as a predictor of a phenotype against the use of a simple indicator of the possession of any rare variant. They found that the use of a tally may be more powerful.33 However, they did not consider conditioning effects (Figure 1C) or problems associated with analyses involving many correlated predictor variables.33
Multiple regression models have been applied in many standard GWA studies in an effort to identify the most likely causal variants in a particular genomic region harboring many associated variants72, 73. However, their direct application via simple extensions of the methods described by Morris and Zeggini33 to the analysis of multiple individual rare variants or collapsed sets of variants may be problematic. For example, collapsed sets of variants might be correlated due to LD with an additional common variant included in the model or due to the manner in which different subsets of variants are collapsed based on functional annotations, as discussed previously in the context of the hierarchical nature of collapsing sets of variants based on functional annotations. Furthermore, strong multicollinearity is known to cause numerical and interpretation issues in traditional linear regression analysis. In addition, there will likely be many potential predictor variables to choose from if many individual common and rare variants, as well as collapsed sets of variants, are considered. Having many independent variables, or more independent variables than subjects, creates enormous potential for numerical instabilities and overfitting in standard linear regression models.
Newer regression techniques that make use of regularization and shrinkage parameters to control for collinearity and overfitting can be used to overcome these problems. Two such techniques, ridge regression74 and the LASSO75,76 have been considered in genetic association analysis contexts, and other methods have also been proposed as well.77–80, 81 Tibshirani82 compared the relative merits of standard stepwise regression, ridge regression, and the LASSO in different non-genetic contexts and concluded that each method seems is best suited for different specific settings, depending on the number and effect sizes of the predictors. This is problematic in the context of genetic association analyses since one will not necessarily know a priori how many common, rare, or collapsed sets of variants might influence a phenotype, nor what kind of effects those variants have. One possible solution to this problem is to devise methods that combine elements of many different regression procedures, such as the ‘bridge (GPS)’ regression procedure of Friedman83 that exploits constructs forming the basis for both ridge and LASSO-based regression. Alternatively, ‘ensemble’ methods or ‘super learners’ that combine the results of different regression and prediction methods84 could be used. However, it is not clear that such methods will pick out functional or causal variants in an association study involving a large number of variants or collapsed sets of variants over those that may, due to LD, merely act as strong predictors of the phenotype.
Logic regression67 may be a particularly attractive regression-based approach, at least in theory, for the analysis of rare variants. Logic regression, which is similar in ways to the method proposed by Han and Pan,40 was initially proposed for analyzing sequence data and does not assume that variants have been collapsed a priori. Instead, it constructs, and then tests for association, combinations of variants held together through the creation of dummy independent variables. These variables are constructed from logical operators such as ‘AND’ and ‘OR’ that connect and combine sets of variants into potential predictors of the phenotype. There are many issues with logic regression and related approaches that are similar to the issues discussed previously in the context of selecting an optimal subset of rare variants40, 66. These include: computational burden; difficulty in obtaining p-values for each potential independent variable (or individual rare variant, as opposed to a collapsed group of rare variants); and the identification of the optimal, and hence the biologically most-plausible, set of genetic predictors. The development of regression analysis methods for rare variant association analyses is an important area of research, however, as the flexibility, conditioning strategies, and ability to accommodate many effects make them particularly appealing.
Most studies assessing associations between rare variants and a phenotype have relied on rather simple collapsing strategies (Table 1). The advantages of more sophisticated data analysis methods are therefore unclear from a practical and implementation standpoint. However, power studies comparing newer methods with more simplistic methods for rare variant analysis have been pursued (Table 3). The studies we list in Table 3 are in no way exhaustive, but their consideration can provide insight into the limitations of the different strategies and, therefore, motivation for further studies. For example, almost all such studies consider comparisons between a proposed novel method and simple single locus analyses, which is an obvious comparison at some level, but does not reflect the sophistication and utility of the proposed method. In addition, almost all of the studies considered simulations under some version of the EAH model of rare variant effects and do not consider other scenarios (Figure 2) or the influence of LD structure among multiple common and rare variants (of the type that might create ‘synthetic associations’85). In addition, studies so far have not considered tests within a hierarchical collapsing framework that leverages functional annotations of genomic regions to separate truly causal variants from collections of rare variants that merely contain causal variants.
Other obvious issues with the current assessments of the power and other properties of rare variant analysis methods concern the simple fact that not enough time has elapsed since their introduction for someone to compare them all in a large study. In addition, some methods are clearly nuanced and are unlikely to work in situations other than those for which they were designed. For example, some methods do not take into account the possible direction of a rare variant effect, such as the methods described by Li and Leal28 and Madsen and Browning34 whereas other methods are designed to handle these situations40. Finally, although many such published power studies simulate data assuming a population genetics model for the propagation of rare variants, the appropriateness of the assumptions of these models is unclear. We believe that the best approach will be to take real sequence data obtained from many individuals (e.g., the 1000 Genomes Project data) and simulate phenotypes based on variants in those sequences, making assumptions only about phenotypic effect sizes and interactions between variants.
In this light, Bansal et al.86 recently considered the analysis of sequencing data obtained on two genes, FAAH and MGLL, thought be associated with morbid obesity among 142 morbidly obese and 147 control subjects discussed in a previous study66. They applied 11 of the methods described in this review plus 9 high-dimensional regression procedures, and showed that the methods do not consistently agree on the most strongly associated regions of the genes or the most likely causal variants. Their results emphasize the need for simulation and theoretical studies of different methodologies.
The identification and characterization of the effects of collections of rare variants on common complex disease susceptibility and general phenotypic expression will play prominent roles in future genetic studies. Appropriate data analysis methods for associating rare variants to a phenotype are therefore needed. A number of rare variant association analysis methods have been proposed that build off the notion of collapsing variants into groups based on either functional annotations of the genomic regions they reside in or on their location in a defined genomic region or ‘window.’ The power and robustness of these models need to be assessed in a wide variety of contexts. In addition, future studies of rare variants will likely be pursued in the context of a broader understanding of the genetic and environmental factors contributing to a particular common complex disease, making it unlikely that an exclusive focus on the influence of rare variants would be appropriate. Furthermore, as DNA sequencing and other genomic technology costs decrease, the frequency and functional impact of different forms of variation beyond SNPs will also be better understood. In this context merely finding that a set of rare variants appears to be collectively associated with a phenotype in no way suggests that all those variants are indeed functional or causally related to the phenotype. Thus, the problem of assigning causality to rare variants in a set may be more pronounced than it is in assigning causality to a single common variant.
A better understanding of the genetic architecture of disease, as well as a better appreciation of the forms and functions of DNA sequence variation, will undoubtedly impact the choice of a statistical method for rare variant association studies. Thus, for example, methods which can accommodate covariates, previously identified genetic factors, allelic heterogeneity, and different sets of collapsed variants simultaneously, such as regression-based methods, are clearly advantageous. However, methods which can account for subtle synergistic effects of many loci within a defined region and/or different forms of variation that might contribute to gene function, such as those rooted in sequence or functional similarity56, 57, 87, 88 are also likely to be appropriate. It is arguable that, in general, variants or groups of variants should always be studied in a more comprehensive regression model that includes covariates and other confounding variables no matter how the collapsed set was initially identified. Such an approach might mitigate a range of concerns, for example about accommodating confounding variables and the functional assessment of variants.
This work was supported in part by the following research grants: U19 AG023122-05; R01 MH078151-03; N01 MH22005; U01 DA024417-01; P50 MH081755-01; R01 AG030474-02; N01 MH022005; R01 HL089655-02; R01 MH080134-03; U54 CA143906-01; UL1 RR025774-03 as well as the Price Foundation and Scripps Genomic Medicine. Ondrej Libiger is also supported by a grant from Charles University: GAUK #134609. The authors would like to thank the reviewers for their comments on previous versions of the review as well as Drs. Eric Topol, Sarah Murray, Sam Levy and the entire team at the STSI for support.
Vikas Bansal, Ph.D. Vikas Bansal received his Ph.D. in Computer Science from the University of California, San Diego, USA. He is currently a Research Scientist at the Scripps Translational Science Institute in La Jolla, California. His current research interests include developing computational methods for the detection of human genetic variation using high-throughput sequencing technologies, reconstructing diploid human genomes, and statistical methods for enabling sequencing-based disease association studies.
Ondrej Libiger, Ph.Dc. Ondrej Libiger holds a Master of Science degree in Computer Science from Charles University in the Czech Republic, and is currently finishing his doctoral degree in biomedical sciences there. Ondrej worked in laboratory of Dr. Nicholas Schork as a research programmer at the University of California, San Diego between 2003 and 2007. Since 2007, he has been a research programmer at The Scripps Translational Institute and The Scripps Research Institute. His interests include applying multivariate statistics and data mining techniques to data generated by various types of genomic technology with the aim of improving health care.
Ali Torkamani, Ph.D. Ali Torkamani received his Ph.D. training in the laboratory of Dr. Nicholas Schork in the School of Medicine at the University of California at San Diego. He then joined the Scripps Translational Sciences Institute as a Research Scientist before being appointed as an Assistant Professor of Molecular Medicine at the The Scripps Translational Science Institute and The Scripps Research Institute. His research interests are in developing computational and analytical methods for understanding the functional impact of inherited and somatically-acquired DNA sequence variation from multiple perspectives, including sequence and structure based analyses as well as systems biology approaches.
Nicholas J. Schork, Ph.D. Nicholas J. Schork is currently Professor, Molecular and Experimental Medicine, The Scripps Research Institute (TSRI) and Director of Biostatistics and Bioinformatics at the Scripps Translational Science Institute (STSI). Prior to joining TSRI and the STSI, Dr. Schork held faculty positions at the University of California, San Diego and Case Western Reserve University. His professional interests are in statistical genetics and integrated biomedical research. He received an M.A. in Philosophy, an M.A. in Statistics, and a Ph.D. in Epidemiology under Drs. Michael Boehnke and Patricia Peyser from the University of Michigan in Ann Arbor.