Our simulation studies have demonstrated that these simple two-stage designs provide unbiased tests of regional association with multiple rare variants and a reasonable balance of true and false discoveries for specific variants identified in this way that would be candidates for further replication or molecular characterization studies. We also found that sample sizes of 100–500 subjects were sufficient to yield good power for discovering a substantial portion of causal variants for inclusion in these summary indices, depending upon the overall risk attributable to rare variants. If rare variants are truly more likely than common variants to be causally associated or to have larger risks, then weighted indices like that proposed by Madsen and Browning [15
] provide more powerful tests than unweighted tests, while there is little loss of power if the probability of being causal or the relative risks are unrelated to allele frequency. For rare diseases, there appears to be little to be gained by random sampling of controls if comparable general population data (e.g. the 1,000 Genomes Project [20
]) are available and no strong environmental risk factor(s) is suspected, but further work is needed to assess whether this remains true if cases derive from populations that are not well represented in the shared control data or if the disease is common. In the case that a strong environmental factor(s) is present for the trait, using matched controls rather than general population controls is expected to substantially improve power. From a practical perspective, the differences that arise from different sequencing procedures and data preprocessing steps between case data and shared control data need to be carefully dealt with.
Although some have argued that a high proportion of rare coding variants could be deleterious [5
] and the simulations of Li and Leal [13
] have assumed proportions of causal variants starting at 20%, we followed instead the lead of Dickson et al. [19
], who considered models with only 1–9 causal variants in a 10-Mb region.
Permutation testing is one solution to the type I error inflation problem that arises when phenotype information is involved in variant discovery and/or summary index construction. One merit of our two-stage design over a one-stage design is that it overcomes the need for permutation test while retaining correct type I error, which significantly reduces the computational burden. We found our two-stage design achieves comparable power to currently available statistical tools, although both designs have yet to be optimized with more sophisticated variant selection methods and sampling schemes. It also has been shown that the naïve permutation test does not guarantee an unbiased test when there is population stratification. Further investigation of methods for controlling for population stratification issue in a one-stage framework would be helpful.
A promising extension of the two-stage method is a joint analysis of both the discovery panel and the genotyped sample, which increases sample sizes for selected variants and leverages available information in both phases. Skol et al. [22
] have shown that the joint analysis method achieves higher power than treating the second-phase data as replication in a framework where a subset of SNPs selected from the first-phase genotyping study is genotyped in an independent second-phase sample.
In on-going work, we are exploring the use of two-phase case-control designs [23
] that would allow reuse of a subsample of cases and controls from an existing study for both sequencing and testing using a joint analysis of the full sequence data on the subset and only SNP data on the parent study. More importantly, such designs permit subsampling jointly on disease status and the associated SNPs in a given region from the original GWAS analysis by taking the biased sampling into account in the analysis [25
]. Thus, rather than simply using random samples of cases and controls, the selection of samples for sequencing could be targeted to the individuals most likely to be informative for each associated region separately. If gene-environment interactions are considered, one might also stratify on environmental risk factors. Such designs could use either imputation of the untyped variants in the main study using full sequence data from the subsample or from the 1,000 Genomes Project, or could re-genotype selected variants in the main study as considered here; the relative cost-efficiency of these alternatives is still being explored. Appropriate analysis for two-phase case-control samples can overcome the biases from using the same data twice and from biased sampling, but not the bias noted by Li and Leal [16
] from using the same data both for constructing a summary index and for testing association with it.
One might also wish to stratify on family history. Additional information to support a judgment of causality could then be obtained by studying the co-segregation of the disease within families with specific rare variants. Family-based designs would also allow selection of distantly related case pairs for sequencing in families showing evidence of linkage to better identify likely causal variants.
Alternative methods for aggregating rare variants could be integrated into this framework. The variable threshold method described by Price et al. [26
] and the step-up approach described by Hoffman et al. [27
] could be used for selecting an optimal subset of variants from a discovery panel. The coding scheme of protective variants proposed by Han and Pan [28
] results in an equivalent index to the one using a (–1)D
term that we introduced here.
More sophisticated analysis methods based on Bayesian hierarchical models are also possible. Rather than weighting all variants equally or using weights that depend on a priori fixed functions of allele frequency or other attributes (e.g. coding or not), weights could be estimated empirically by assuming either a simple exchangeable model, a mixture of null and causal distributions, or regression models on prior covariates [29
]. For example, one could include a latent variable taking the values +1, 0, or −1 for deleterious, null, or protective associations, using MCMC methods to average over the probability distributions of these weights [31
The analysis of raw sequence data itself poses challenging difficulties in alignment and allele calling. Depending upon the depth of sequencing used, there can be substantial uncertainty in allele calls that should also be taken into account in any analysis. We have not attempted to simulate raw sequence fragment data here, basing our simulations instead on the true haplotypes generated by the ms
program. For variant discovery purposes, Ionita-Laza and Laird [32
] showed that lower coverage with more samples is preferable to higher coverage with fewer samples for a fixed study cost. Trade-offs between sample size, depth of sequencing, and region sizes are amongst the issues still under investigation in our on-going simulations.
DNA pooling methods attempt to capture more sequence information with the same cost. Wang et al. [33
] developed a test for single variant in pooled data, accounting for the influence of high sequencing error rates. Kim et al. [34
] proposed a likelihood ratio statistic and observed in simulation that prioritizing pool size rather than depth of coverage achieved higher power.
In summary, appropriate design and analysis of resequencing studies can provide a valid and cost-efficient way of testing for the existence of rare variants that could account for associations with SNPs discovered in a GWAS. Once such potentially causal variants have been detected and their aggregate association with disease tested, much remains to be done to replicate the associations of the specific variants in other samples and characterize their effects in functional studies. Meanwhile, further research on the optimal design and analysis of such studies would be useful.