Analysis of rare variants requires statistical methods that are fundamentally different from association statistics used for testing common variants. There are two reasons for this. First, rare variants have to be combined in a gene (or pathway) for an association test to reach sufficient power57
. For example, a causal SNP at a frequency of 1 in 500 and genotype relative risk of 10 in a sample of 200 cases and 200 controls, has 0.2% power to be detected at a conventional significance threshold for GWAS (P < 5 × 10−8
). Second, functional and population genetics information can be added to the testing approach because exome sequencing comprehensively captures variation that can be annotated with such information.
Early candidate-gene sequencing studies for complex traits were based on the comparison of numbers of non-synonymous alleles exclusive to cases or controls (or samples at the extremes of the trait distribution)21, 26
. This approach has limited power because it ignores common and low-frequency polymorphisms, as most such variants would be present in cases and controls. Recently, a number of statistical tests have been designed for rare-variant analysis. The Combined Multivariate and Collapsing (CMC) test58
jointly assesses the role of rare and common variation. For the common variants, traditional regression-based association is applied. For rare variation, an individual's predictor in a regression model is defined as 1 if the individual possesses at least one rare variant in the region (e.g., gene) and 0 otherwise. The Weighted-Sum Statistic (WSS) test59
, creates a composite genotype score for all individuals. This score is the sum of alternate alleles weighted by the inverse of the binomial variance. A rank sum test is then performed on the genotype scores between phenotypic groups. The Kernel-Based Adaptive Cluster (KBAC) test60
also uses a weighting scheme that reflects apparent effect sizes of individual variants. An alternative approach to combine rare variants into a single test is to select an allele frequency threshold based on the observed data. This variable threshold (VT) approach61
was motivated by population genetic simulations that showed that there is no single optimal weighting scheme or allele frequency threshold. There are numerous other statistical tests for rare variants in complex traits (reviewed in refs.62–65
In simulation studies64
, most tests behave similarly in many situations. However, the results may depend on assumptions used in simulated data. The relative power to detect association depends on factors such as the number and proportion of causal variants, their population frequency, and their effect sizes, as well as directionality of effects, the number of genes contributing to the trait, and fraction of causal genetic variation located in the exome. Statistical tests were developed with various combinations of these factors in mind and therefore are likely to be sensitive to different disease architectures. For example, the simulation framework used in development of the WSS test assumes effect size proportional to 1/x(1 − x)
is the population frequency of the causal allele), while Sequence Kernel Association Test (SKAT)66
simulation framework uses effect size proportional to −log(x)
, and the VT test simulations uses a demographic history model with a range of possible values of strength of selection leading to different relationships between effect size and x
. These simulations were designed to demonstrate the strengths of each methods under different effect-size distributions: the WSS is designed for effect sizes proportional to 1/x(1 − x)
, SKAT is designed for effect sizes proportional to beta density β(x; a1, a2)
for prespecified a1
, the C-alpha67
test is designed for effects going in opposite directions in the same region, while the VT test makes no assumptions about effect-size distributions.
When combining rare variants, all functional variants may either be assumed to influence the trait in the same direction, or some may be allowed to have opposite directions of effects. A biochemical argument can be made that most of non-synonymous variants are loss-of-function hypomorphs, while gain-of-function variants are infrequent. However, some genes (e.g., PCSK968
) have variants of both kinds. Several tests allow for rare variants to have opposite effects on the trait (e.g., Step-up69
, C-alpha, replication-based test70
, SKAT). These tests are based either on the analysis of over-dispersion or on explicit linear models that determine the contribution of a variant to a score based on the direction of effect observed in the data.
Rare-variant tests can benefit from stratifying or weighting rare alleles by functional significance, as evidenced by simulations and sequencing studies of candidate genes61,64, 71–73.
The power of rare-variant tests is strongly influenced by the fraction of causal variants among all variants analyzed and using functional information is an effective way to give greater weight to likely causal variants. For example, nonsense variants should be prioritized higher than nonconserved missense variants. Similarly, missense variants should be prioritized higher than synonymous variants. Functional consequences of variants can be predicted by examining effects of amino-acid changes using comparative sequence analysis and protein structure analysis. Many computational prediction and conservation74, 75
methods are available (reviewed in refs.76–79
). The accuracy of those methods is around 80%80
and it is likely highest for rare variants (truly functional variants are most likely deleterious and kept at low frequencies by purifying selection, and so common variants are most likely neutral and nonfunctional). Therefore, using prediction methods enriches for functional variants and thus boosts the power of association tests. Because such predictions are not perfect, however, they should be used quantitatively by weighing variants, rather than qualitatively by filtering out variants. A number of tests allow including prediction scores into test statistics, e.g., VT test, KBAC, SKAT, Rare variant Weighted Aggregate Statistic (RWAS)72
, Likelihood Ratio Test (LRT)73
. The PLINK/SEQ suite includes precomputed PolyPhen-246
prediction scores for all possible missense changes in humans, which makes these scores readily applicable.
An important consideration for exome sequencing studies is selecting the significance threshold that accounts for multiple testing. A simple way is to adopt a Bonferroni correction for 20,000 independent tests (one test per each gene), which, for an experiment- wide significance of 0.05 gives a p-value threshold of 2.5 × 10−6 per gene. However, such a threshold may be overly conservative because it assumes that each tested gene has sufficient variation to achieve the asymptotic properties for the test statistic. For example, if only 2 individuals carry non-synonymous variants in a given gene, the difference between cases and controls never exceeds 2 total observations, and so the most significant p-value that can be achieved is around 0.25 assuming that these 2 variants are independent. Therefore, unless the study is large, association p-values will be generally less significant than expected under the null hypothesis. demonstrates this effect on the 438 whole exomes. The PLINK/SEQ suite computes from data the so-called i-stat, which is an estimate of the minimal achievable p-value for a gene. The i-stat can be used by setting a threshold (e.g., 10−3) and only correcting for the number of genes that have the i-stat below the threshold following the idea that for the genes with i-stat above the threshold there is no power to find an association. Another way to correct for multiple testing is to compute an experiment-wide significance threshold by permutations of phenotype labels, create the empirical distribution of minimal p-values for all genes across permutations, and compare the minimal p-value from the real data to that distribution (). This approach efficiently controls Type-I error and is less conservative than the Bonferroni correction. Importantly, the p-value threshold computed by permutations is dependent on both the study and on the statistical test. However, the experiment-wide correction via permutation is not robust to confounding and it is essential to assess the quality of the distribution of test statistics, for those genes that have i-stats less than the threshold, to ensure appropriate calibration of the distribution. Nevertheless, with increasing sample sizes, the dimensionality of the tests will also increase, and studies will be assessing close to 20,000 tests. Therefore, for large studies we consider the Bonferroni threshold to be preferable.
Figure 2 Association analysis. (a) Q-Q plot of association p-values under the null hypothesis. (b) Distributions of lowest p-values under whole-exome permutations. The histograms show the distributions of the lowest p-values across permutations for the T5 test. (more ...)