|Home | About | Journals | Submit | Contact Us | Français|
Multiple rare variants have been suggested as accounting for some of the associations with common single nucleotide polymorphisms identified in genome-wide association studies or possibly some of the as yet undiscovered heritability. We consider the power of various approaches to designing substudies aimed at using next-generation sequencing technologies to discover novel variants and to select some subsets that are possibly causal for genotyping in the original case-control study and testing for association using various weighted sum indices. We find that the selection of variants based on the statistical significance of the case-control difference in the subsample yields good power for testing rare variant indices in the main study, and that multivariate models including both the summary index of rare variants and the associated common single nucleotide polymorphisms can distinguish which is the causal factor. By simulation, we explore the effects of varying the size of the discovery subsample, choice of index, and true causal model.
The current crop of genome-wide association scans (GWAS) for various complex diseases has yielded a number of important insights. First, the relative risks for most replicated associations have been small, typically less than 1.5 . Given the high degree of linkage disequilibrium between all known common variants and those on commercial genotyping chips, it has always been understood that most discovered single nucleotide polymorphism (SNP) associations are not likely to be directly causal, but merely surrogates for truly causal variants in the region, whose effects could be stronger than the induced associations discovered. Second, even in the aggregate, the overall effect of all discovered associations explains only a small portion of the estimated heritability for most complex traits. Various hypotheses about the remaining dark matter have been proposed [2,3], one of which is multiple rare variants that are not well tagged by HapMap SNPs [4,5,6,7]. Because of their rarity and the penalty being even larger for multiple comparisons than for common SNPs, detecting the effects of any single such variant is nearly impossible, but the aggregate effect of multiple variants within a region may be detectable . Resequencing of an associated region to discover novel variants that could be subject to such an analysis is thus a necessary first step in this process . Because of the high cost of sequencing, even with NextGen technologies [10,11], it is generally feasible to do this only on a targeted subsample. In this paper, we consider the design of such studies with the aim of selecting some subsets of variants for direct genotyping in the original GWAS samples and testing regional association with rare variants.
Various approaches have been suggested for testing association with multiple rare variants. Morgenthaler and Thilly  described a cohort allelic sums test (CAST), which contrasts the frequency of individuals carrying any mutant alleles in sequenced exons between cases and controls. Li and Leal  proposed the combined multivariate and collapsing (CMC) method that collapses rare variants using an indicator variable for the presence of any rare alleles. This set of rare variants can be tested simultaneously in a multivariate framework with other groups of variants collapsed in the same fashion, where the groups are formed according to predefined criteria such as allele frequencies. Such an approach has been applied, for example, to variants of unknown significance in candidate genes for breast cancer . Madsen and Browning  suggested refining this approach by weighting alleles inversely by the population frequency. Li and Leal  considered the design of resequencing studies and pointed out the serious inflation of type I error rates than can arise if the same sample is used to select variants and test association with the selected variants. This problem can be simply avoided by using disjoint samples for variant discovery and testing, but the optimal design of such studies in terms of the allocation of sample sizes and sampling schemes between discovery and testing phases, depth of sequencing, criteria for selecting variants for genotyping in the main study, and analysis methods remains unresolved.
We considered study designs in which random samples of cases and controls would be fully sequenced in a region carrying one or more causal variants, typically regions identified by an association with one or more common SNPs. Based on univariate analyses of the association of every polymorphic site with disease status in these samples, a subset of potentially causal variants would then be genotyped in a much larger, independent case-control study and their association with risk tested using various types of summary indices. We considered several criteria for selecting discovered variants for genotyping and several forms of summary indices, and we studied the effects of varying sample sizes for discovery and testing with each as well as the dependence of the optimal design and analysis on the true distribution of allele frequencies of causal alleles and their relative risks. As a benchmark for comparison with feasible indices, we used similarly constructed summary indices that included all truly causal variants and excluded all truly non-causal variants. By simulation, we demonstrate that our tests of regional association have the nominal type I error rate and compared methods on the basis of their power to detect regional association. Finally, for each instance where regional association with the summary index was deemed significant, we assessed the false discovery and false non-discovery rates for the specific variants included in the index.
A haplotype pool of size 20,000 for a 100-kb region was simulated using coalescent simulator ms, in which the scaled mutation rate θ = 4N0μ was set to 100 and the scaled crossover rate ρ = 4N0r was set to 50 to mimic a human population. Two independent case-control samples were drawn from the haplotype pool: a discovery panel for variant detection via sequencing; and a larger main study sample in which a regional association test was conducted. To generate a case-control sample, genotypes of each individual were formed by sampling a random pair of haplotypes with replacement from the pool, and disease status was determined by a specified disease model. For 1,000 Genomes Project controls, 1,000 individuals were randomly sampled from the same haplotype pool where the cases were sampled, without determining their disease status.
A logistic model with an additive coding of genotypes was used to determine binary disease status
where Gij is the number of mutations in individual i at variant j. The probability of variant j being causal was calculated as
where qj is the population minor allele frequency (MAF) for variant j, q is a scaling constant set to 0.01, and the parameters π0 and π1 determine the overall proportion of causal variants and their allele frequency spectrum relative to all variants. When βj is set to be non-zero in equ. 2, its value, interpreted as the genotypic relative risk (RR) on a log scale, was simulated as
where was sampled from either a Normal(0,1) distribution to account for the existence of both deleterious and protective minor alleles – denoted the mixed causal variants (MCV) model – or a truncated Normal(0,1) distribution that has only positive values – the deleterious causal variants (DCV) model. The parameters π1 and α1 that specify the dependence on MAF were both set to non-zero values in the MAF-dependent model and both set to 0 in the MAF-independent model. For simplicity, we describe the distribution of effect sizes by the 95th percentile of the genotypic RR for a variant with MAF = 0.01 (primarily controlled by α0), denoted, e.g., as RR95 = 2 or 3. The corresponding α1 was set in such a way that the 95th percentile of the genotypic RR for a variant with the lowest possible MAF (1/20,000) is 20 or 30. We present results for proportions of all rare variants being causal ranging from 1 to 5% in the belief that most non-coding variants are likely to be neutral with respect to the trait of interest.
An indicator variable D was assigned to each variant to account for the direction of its apparent effect in the discovery panel: D = +1 for an apparently deleterious variant and D = −1 for an apparently protective variant. Then a subset of variants identified in the discovery panel was selected by various criteria, among which two types of thresholds were particularly focused upon: (1) p value thresholds, where a marginal two-sided p value for each variant was calculated and those with p values smaller than various thresholds were included; (2) frequency thresholds, where MAF for each variant was calculated using the entire discovery panel and those with MAF lower than various thresholds were included. All indices include variants with D = +1, some also include those with D = −1. For comparison purpose, we also considered indices that include all discovered variants, or variants found only in cases.
The summary genetic indices for individual i were calculated as
where S is the selected subset of variants, and wj is a weight, either the inverse of the estimated binomial standard deviation 1/√njqj(1 – qj) in the Madsen and Browning weighted sum method, or 1 for an unweighted index. Gij is the genotype for individual i at variant j using additive coding, obtained by genotyping variant set S in an independent main study sample. Regional association was then assessed by Armitage's trend test  for each variant selection criterion and each wj, respectively. This procedure was repeated for 1,000 simulated datasets for a range of parameter sets, and power was estimated as the proportion of significant replicates.
Since many association studies involve sequencing to investigate candidate regions around GWAS hits, we simulated associations with common variants by scanning all variants with MAF ≥ 0.05 using the main study sample, retaining replicates with at least one hit with a p value <0.05/N, where N is the number of common variants tested. Power calculated from these replicates is denoted ‘conditional power’. We also computed power over all replicates, with or without common variant signals, simply denoted ‘power’. The purpose is to extend the two-stage analytical scheme into the context of exome sequencing studies, where sequenced individuals in each stratum (cases or controls for example) are randomly separated into two disjoint samples, one for variant selection, and the other for testing.
A question for the multiple rare variants hypothesis is whether the SNP signals detected by a GWAS can be explained by synthetic associations with multiple rare variants . The reverse question is if a summary index is used to test a group of rare variants and shows significant association, can this signal be actually due to causal effects of common SNPs? To answer these questions, we performed simulations under models where either a single common SNP or multiple rare variants were the causal factors. All rare variants (MAF < 0.05 in controls) discovered in 200 cases and 200 controls were collapsed into a weighted sum index and tested for association in 1,000 cases and 1,000 controls. The test was adjusted for all SNPs that were significant in single marker tests with Bonferroni correction in a multiple logistic regression. The power of such a test is denoted ‘adjusted power’.
We also considered a one-stage design, where variant selection and tests for association were performed using the same sample. For this design, the sampling distribution of the t statistic testing a summary index was obtained through permutation. For each random permutation of case-control labels, we selected a subset S of variants for the particular criterion, calculated the index for each individual, and computed the t statistic for the permutation. For N permutations, the p value was assessed as (n + 1)/(N + 1), where n is the number of permutations that generated more extreme t statistics. Ties were resolved randomly. N = 1,000 was used for each test in our simulation study.
The majority of the simulated alleles were rare, specifically 28.1% were common (MAF > 0.05), 15.7% were uncommon (0.01 < MAF ≤ 0.05), and 56.2% were rare (MAF ≤ 0.01) across 10,000 haplotype pools. The corresponding numbers for the combined case-control samples were 33.6% (MAF > 0.05), 18.8% (0.01 < MAF ≤ 0.05), and 47.6% (MAF ≤ 0.01). The distributions of all discovered variants and of discovered causal variants were almost identical for MAF-dependent and MAF-independent models, whether or not at least one common variant signal was required. For MAF-dependent models, however, the discovered causal variants were dominated by rare variants. For example, discovered causal variants were distributed as 8.7% (MAF > 0.05), 9.7% (0.01 < MAF ≤ 0.05), and 81.6% (MAF ≤ 0.01) for 10,000 replicates. When at least one common variant signal is required, as arises in a follow-up of a GWAS hit, these numbers were 11.9% (MAF > 0.05), 10.0% (0.01 < MAF ≤ 0.05), and 78.1% (MAF ≤ 0.01).
The statistical power to detect regional association using unweighted or weighted sum index collapsing all discovered variants was highly dependent on whether the probability of a variant being causal and their distribution of effect sizes depend upon the MAF and whether there are protective variants (table (table1).1). The Madsen and Browning weighting scheme generally improves power, but this advantage is not as pronounced in MCV models as in DCV models using this method because it does not handle deleterious and protective variants differently.
We first demonstrated that our regional association tests have the correct type I error rate using a ‘global-null’ model where no causal variants were simulated (data not shown). Then we simulated a ‘pseudo-null’ model for testing the aggregate of rare variants where a single common SNP (population MAF ≥ 0.05) is designated as the only causal variant in the region and examined the type I error rate for testing association with rare variants. Table Table22 shows that testing a weighted index collapsing all discovered rare variants in an independent sample has an inflated type I error under this pseudo-null model. The inflation becomes more serious if the region is required to have common variant signals (e.g. for a follow-up of GWAS hits). Thus, false positive associations with aggregate indices of rare variants can be induced by common SNPs in linkage disequilibrium with them. Adjusting for the common SNPs that show region-wide significance can control this inflation, as illustrated in the last column of table table2,2, this remains true even for the pseudo-null model with more common causal SNPs and larger marginal effect sizes (data not shown). Since we made no assumptions about the frequency of the underlying causal variants, both common and rare variants could be causal in the following simulations, and therefore summary indices could include both common and rare variants, depending on variant selection criteria. Tests of these indices have correct type I error rates (at α = 0.05) under the true null model where no causal variants are simulated.
The motivation for including a subset of variants in a summary index is two-fold: (1) to increase statistical power by improving the signal-to-noise ratio and (2) to minimize the genotyping cost for the main study by excluding apparently irrelevant variants. Under models in which both rare and common variants can be disease causing, we considered various p value thresholds and frequency thresholds, with the restriction that the frequency in cases should be higher than in controls. For indices based on frequency thresholds, power consistently increased with higher frequency thresholds, but eventually the gain became marginal, as rare variants were simulated to have higher probability of being causal in MAF-dependent models (fig. 1b, d). Thus, if a fixed frequency threshold (e.g. 5%) is used in studies that focus on rare variants, the statistical power of the study is suboptimal. p value thresholds generally outperformed frequency thresholds (fig. 1a, c), while using fewer variants. Since power as a function of p value cut-offs levels off rapidly when the p value is >0.1, one can achieve good power using less than 20% of the discovered variants by applying a fixed 0.1 p value threshold. A possible disadvantage of the p value threshold is that it does not necessarily prioritize rare variants, while common variants are likely to have already been discovered and tested.
Under models based on the multiple rare variants (MRV) hypothesis, where no common variants (MAF > 0.05) are causal, we observed that a frequency threshold of 5% achieved the highest statistical power (data not shown), as expected; the power then decreased as a greater frequency threshold beyond 5% was applied. The power curve for p value thresholds plateaued beyond 0.01, a similar pattern as in the general models above. Although p value thresholds achieve slightly lower power than a fixed frequency threshold of 5%, their performance is at least comparable to other frequency thresholds and is consistent across different scenarios.
We considered three aspects in selecting an appropriate sample size for the discovery panel (table (table3).3). The first is the proportion of all variants (causal or not) that are discovered, approximately 63, 69, 78% for sample sizes 100, 200, 500. These numbers were fairly constant across simulation model parameter settings. The second aspect is the proportion of causal variants discovered, which increased slightly with higher effect sizes and more causal variants. The third aspect is the power for detecting regional association. For a main study sample size of 1,000 cases and 1,000 controls, we found that a small discovery panel is useful only when a considerable proportion of variants are causal and they have large effect sizes. For example, at least 500 cases and 500 controls are needed for discovery to achieve 70% power, when 2.5% of all variants are causal and RR95 is 2.0–3.0.
We compared two cost-equivalent designs for obtaining control data. The first uses study controls, the second uses 1,000 Genomes Project data, assuming they represent the source population of cases; here, only 2N cases are sequenced for variant discovery instead of N cases and N study controls. The results show that using 400 cases and 1,000 Genomes Project controls for discovery has higher power than using 200 cases and 200 controls for discovery (table (table4).4). However, the former requires a few more variants to be genotyped in the association testing stage (but typically less than twice as many), increasing genotyping cost somewhat. The relative cost-efficiency of the two designs thus would depend upon the relative costs of sequencing and genotyping.
When instead of cost there is a constraint on the number of case samples available, one could vary the allocation into two subsamples, one for discovery and one for testing. Under MAF-dependent models, we found that allocating 40% (DCV) or 50% (MCV) of the sample for variant discovery and selection while reserving the remaining for association testing generally achieved the best power (fig. (fig.2).2). Results under MAF-independent models exhibited a distinct pattern, showing that power levels off as long as the discovery sample size is within a reasonable range (100 to 900 out of 1,000). An exception is the method using an index that includes all discovered variants, for which smaller discovery samples are preferable across models. Again, the most cost-efficient choice would depend upon the relative costs of sequencing and genotyping.
Sequencing is generally directed at discovery of rare deleterious variants, as associations with rare protective variants are more difficult to detect. However, some variants may be truly protective and sequencing studies will inevitably identify some as apparently protective. We found that the existence of truly protective alleles reduces the statistical power for detecting an association when using a simple summary index that counts only apparently deleterious alleles (table (table5).5). For example, using an index that only includes apparently deleterious variants, we had 52.1% power under a simulated DCV model but only 38.4% under the corresponding MCV model (where 50% of causal variants are protective). Taking potentially protective alleles into account by including the D factor in the index consistently improved power in MCV models across all numbers of causal variants and effect sizes. For example, this method increases the probability of detecting the association from 38.4 to 45.2% in the previous situation. However, almost twice as many variants will need to be genotyped, although one can certainly employ different thresholds for potentially deleterious and protective alleles in order to reduce cost. When there truly are no protective alleles, using this method will not improve power and could even compromise it; for example, adding protective alleles to the index in the DCV model described above reduces power from 52.1 to 49.9%.
We tabulated the false discovery rate (FDR) and false non-discovery rate (FNDR) for various thresholds and causal models (table (table6),6), where the FDR was estimated by the proportion of all included variants that are non-causal, and the FNDR was estimated by the proportion of all excluded variants that are causal. Although the majority of included variants are non-causal (high FDR), nevertheless we have shown above that these indices yield good power to detect regional association. Also tabulated are sensitivity and specificity, estimated by the proportion of causal variants that are included and the proportion of non-causal variants that are not included, respectively. The relatively low sensitivity and low causal:non-causal ratio (1:39 for table table6)6) jointly explain the universally high FDR. Both p value and frequency thresholds achieve lower FDR, lower FNDR, and higher specificity than indices that include all discovered variants. A frequency threshold at 0.05 has overall better performance in terms of these four measures than other criteria in MAF-dependent models but not in MAF-independent models. This reflects the fact that more rare variants were simulated to be causal in MAF-dependent models, suggesting that using a frequency threshold enriches causal variants only if causality is correlated with MAF.
The two-stage design achieves at least comparable power to a one-stage design with a permutation test (table (table7).7). Since various p value thresholds and frequency thresholds behave differently in the two designs, we show only the thresholds that achieve the highest power for either design along with the criteria that achieve that optimum. For example, with 400 cases and 400 controls for discovery and 600 cases and 600 controls for testing, the two-stage design achieves the best power at a p value threshold of 0.1, whereas a one-stage design with 1,000 cases and 1,000 controls is optimized at a p value threshold of 0.01. In this comparison, the one-stage design outperforms the two-stage design only in MAF-dependent models with high marginal effect sizes (RR95 = 3.0). Therefore, the two-stage design proves to be not only cost-efficient in terms of reducing sequencing costs, but is also computationally more efficient when a permutation test is needed to control type I errors. Since an analytical framework that selects variants in part of the sample and tests constructed indices in the remaining part of the sample applies even when both subsamples are sequenced, this method has the potential to be a general analytical framework without requiring a two-stage design.
Our simulation studies have demonstrated that these simple two-stage designs provide unbiased tests of regional association with multiple rare variants and a reasonable balance of true and false discoveries for specific variants identified in this way that would be candidates for further replication or molecular characterization studies. We also found that sample sizes of 100–500 subjects were sufficient to yield good power for discovering a substantial portion of causal variants for inclusion in these summary indices, depending upon the overall risk attributable to rare variants. If rare variants are truly more likely than common variants to be causally associated or to have larger risks, then weighted indices like that proposed by Madsen and Browning  provide more powerful tests than unweighted tests, while there is little loss of power if the probability of being causal or the relative risks are unrelated to allele frequency. For rare diseases, there appears to be little to be gained by random sampling of controls if comparable general population data (e.g. the 1,000 Genomes Project ) are available and no strong environmental risk factor(s) is suspected, but further work is needed to assess whether this remains true if cases derive from populations that are not well represented in the shared control data or if the disease is common. In the case that a strong environmental factor(s) is present for the trait, using matched controls rather than general population controls is expected to substantially improve power. From a practical perspective, the differences that arise from different sequencing procedures and data preprocessing steps between case data and shared control data need to be carefully dealt with.
Although some have argued that a high proportion of rare coding variants could be deleterious [5,21] and the simulations of Li and Leal [13,16] have assumed proportions of causal variants starting at 20%, we followed instead the lead of Dickson et al. , who considered models with only 1–9 causal variants in a 10-Mb region.
Permutation testing is one solution to the type I error inflation problem that arises when phenotype information is involved in variant discovery and/or summary index construction. One merit of our two-stage design over a one-stage design is that it overcomes the need for permutation test while retaining correct type I error, which significantly reduces the computational burden. We found our two-stage design achieves comparable power to currently available statistical tools, although both designs have yet to be optimized with more sophisticated variant selection methods and sampling schemes. It also has been shown that the naïve permutation test does not guarantee an unbiased test when there is population stratification. Further investigation of methods for controlling for population stratification issue in a one-stage framework would be helpful.
A promising extension of the two-stage method is a joint analysis of both the discovery panel and the genotyped sample, which increases sample sizes for selected variants and leverages available information in both phases. Skol et al.  have shown that the joint analysis method achieves higher power than treating the second-phase data as replication in a framework where a subset of SNPs selected from the first-phase genotyping study is genotyped in an independent second-phase sample.
In on-going work, we are exploring the use of two-phase case-control designs [23,24] that would allow reuse of a subsample of cases and controls from an existing study for both sequencing and testing using a joint analysis of the full sequence data on the subset and only SNP data on the parent study. More importantly, such designs permit subsampling jointly on disease status and the associated SNPs in a given region from the original GWAS analysis by taking the biased sampling into account in the analysis . Thus, rather than simply using random samples of cases and controls, the selection of samples for sequencing could be targeted to the individuals most likely to be informative for each associated region separately. If gene-environment interactions are considered, one might also stratify on environmental risk factors. Such designs could use either imputation of the untyped variants in the main study using full sequence data from the subsample or from the 1,000 Genomes Project, or could re-genotype selected variants in the main study as considered here; the relative cost-efficiency of these alternatives is still being explored. Appropriate analysis for two-phase case-control samples can overcome the biases from using the same data twice and from biased sampling, but not the bias noted by Li and Leal  from using the same data both for constructing a summary index and for testing association with it.
One might also wish to stratify on family history. Additional information to support a judgment of causality could then be obtained by studying the co-segregation of the disease within families with specific rare variants. Family-based designs would also allow selection of distantly related case pairs for sequencing in families showing evidence of linkage to better identify likely causal variants.
Alternative methods for aggregating rare variants could be integrated into this framework. The variable threshold method described by Price et al.  and the step-up approach described by Hoffman et al.  could be used for selecting an optimal subset of variants from a discovery panel. The coding scheme of protective variants proposed by Han and Pan  results in an equivalent index to the one using a (–1)D term that we introduced here.
More sophisticated analysis methods based on Bayesian hierarchical models are also possible. Rather than weighting all variants equally or using weights that depend on a priori fixed functions of allele frequency or other attributes (e.g. coding or not), weights could be estimated empirically by assuming either a simple exchangeable model, a mixture of null and causal distributions, or regression models on prior covariates [29,30]. For example, one could include a latent variable taking the values +1, 0, or −1 for deleterious, null, or protective associations, using MCMC methods to average over the probability distributions of these weights .
The analysis of raw sequence data itself poses challenging difficulties in alignment and allele calling. Depending upon the depth of sequencing used, there can be substantial uncertainty in allele calls that should also be taken into account in any analysis. We have not attempted to simulate raw sequence fragment data here, basing our simulations instead on the true haplotypes generated by the ms program. For variant discovery purposes, Ionita-Laza and Laird  showed that lower coverage with more samples is preferable to higher coverage with fewer samples for a fixed study cost. Trade-offs between sample size, depth of sequencing, and region sizes are amongst the issues still under investigation in our on-going simulations.
DNA pooling methods attempt to capture more sequence information with the same cost. Wang et al.  developed a test for single variant in pooled data, accounting for the influence of high sequencing error rates. Kim et al.  proposed a likelihood ratio statistic and observed in simulation that prioritizing pool size rather than depth of coverage achieved higher power.
In summary, appropriate design and analysis of resequencing studies can provide a valid and cost-efficient way of testing for the existence of rare variants that could account for associations with SNPs discovered in a GWAS. Once such potentially causal variants have been detected and their aggregate association with disease tested, much remains to be done to replicate the associations of the specific variants in other samples and characterize their effects in functional studies. Meanwhile, further research on the optimal design and analysis of such studies would be useful.
This work was supported by NIH grants ES15090 and GM 069890. The authors are grateful to Drs. Paul Marjoram and David Conti for many helpful discussions.