|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Whole-genome sequencing (WGS) allows direct interrogation of previously undetected uncommon or rare variants, which potentially contribute to the missing heritability of human disease. However, cost of sequencing large numbers of samples limits its application in case–control association studies. Here, we describe theoretical and empirical design considerations for such sequencing studies, aimed at maximizing the power of detecting association under the constraint of study-wide cost.
Results: We consider two cost regimes. First, assuming cost is proportional to the total amount of base pairs to be sequenced across all samples, which is a practical model for whole-genome sequencing, we explored the tradeoff in terms of study power between increasing the number of subjects and increasing depth coverage. We demonstrate that the optimal power of detecting association is achieved at medium depth coverage under a wide range of realistic conditions for case-only sequencing designs. Second, if cost is fixed per sample, which is approximately the case in exome sequencing, we show that in a simple case+control sequencing study, the optimal design should include cases totaling 1/e of all subjects.
Availability: A web tool implementing the methods is available at http://www.cs.columbia.edu/~itsik/OPERA/.
Contact: yshen/at/c2b2.columbia.edu; itsik/at/cs.columbia.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Genome-wide association studies (GWAS) have greatly improved our understanding of the genetics of human diseases (Altshuler et al., 2008). However, for the majority of common diseases, substantial fractions of heritability remain to be explained. Rare variants, which are under the radar of GWAS, are suggested to have significant contributions to missing heritability (Eichler et al., 2010; Frazer et al., 2009; Yang et al., 2010).
Complete ascertainment of rare variants requires genome-wide resequencing, which is now feasible thanks to the development of next-generation sequencing technologies (Bentley et al., 2008). Exome sequencing (Ng et al., 2009) and whole-genome sequencing (WGS) (Lupski et al., 2010) have been successfully applied to identify rare causal variants of Mendelian diseases. In these studies, a causal gene was implicated if it harbors rare functional mutations that are shared among cases but absent in controls or single nucleotide polymorphism (SNP) databases. Extending this approach to association studies of complex diseases requires more than a handful of samples, because the causal variants of such conditions only have statistical effect (Cohen et al., 2006). Therefore, it is necessary to carefully consider the design of such studies under the constraint of limited resources.
A critical technical attribute of WGS data is the number of times each site is observed in a sequencing read, usually termed the depth-of-coverage. Depth-of-coverage is a key determinant of quality for sequencing information, and in particular, w.r.t. rare variants. Accuracy and completeness of detection and calling such variants from sequencing depend on high depth-of-coverage, due to the randomness of read placement (Lander and Waterman, 1998) and non-negligible error rate (Mckernan et al., 2009). Higher depth coverage for variant discovery and genotyping is attainable by using more sequencing resources (e.g. Illumina lanes) per sequenced sample. However, given a fixed budget, more sequencing per sample is in conflict with analyzing a large sample size, which is required for adequate statistical power of detecting associations. Here, we propose to resolve this tradeoff by finding the design that maximizes the power for detecting associations of rare variants. We define rare variants as the ones with minor allele frequency (MAF) <1% in the population, and uncommon variants as the ones with MAF between 1 and 5%. Here, we use the term rare variants to include both rare and uncommon variants.
We first assume a cost regime where the overhead cost of a subject is negligible comparing to that of the sequenced bases. Under fixed total cost and therefore affordable total bases (T) is fixed, genome-wide average depth coverage λ is determined by the number of cases to be sequenced (N1) : λ=T/N1. For a rare variant, we assume homozygous carriers are rare enough to be ignored. The power of correctly calling a heterozygous carrier from sequencing data is a function of λ : R(λ ), usually with an approximate sigmoid. It can be determined from empirical data (Supplementary Fig. S1) or approximated analytically (Wendl and Wilson, 2009; Wheeler et al., 2008).
Allelic heterogeneity is likely to be ubiquitous among genes that harbor causal rare variants (Bodmer and Bonilla, 2008; Pritchard, 2001). Grouping rare variants according to genes or pathways they disrupt for association testing (Price et al., 2010) is necessary to increase power. For simplicity, we consider a gene as disrupted in a sample if it harbors any of the causal variants, and use the collapsing method (Li and Leal, 2008) to test the association.
In the presence of allelic heterogeneity, we assume there are M rare causal variants within a gene (or pathway). These alleles are independently distributed in the population, with respective carrier frequencies h1,…, hM among cases. The carrier status for any of these rare variants is a Bernoulli variable with compound carrier frequency p=R(λ)(1−i=1m(1−hi)) in cases. Let F1∑i=1Mhi, then p≈F1R(λ) when hi<<1, and the number of observed carriers K1 among N1 cases follows a binomial distribution with parameters (N1, F1R(λ)). An economic design is to sequence cases only and utilize publicly available data for additional N0 samples as controls. For example, we use N0=400, which represents a lower bound of the sample size from one major population of the 1000 Genomes Project (1000 Genomes Project Consortium et al., 2010). Given T, N0, F1 and R(λ), we provide an online tool (OPERA) that can calculate the power of detecting association from different values of N1 (and λ) under certain Type I error cutoff (Q). As an example, we show the results with resources of T=1000× genomes under a simplistic assumption that the number of carriers in controls (K0) is 0 (Fig. 1). The power curve takes the unusual shape of a sawtooth wave that reflects discrete artifacts (Supplementary Material) of the test statistics for association of rare variants (Supplementary Figs S2 and S3), superimposed on a smoothed concave curve through the respective tips of the sawteeth. Moving along the x-axis, the smoothed curve initially increases with N1 when N1 is relatively small, reflecting the fact that when N1 is small, λ is large, therefore R(λ) is close to 1 and changes very little as N1 increases. For larger sample sizes and lower coverage, where R(λ) is away from 1, increasing N1 further reduces λ and R(λ) sufficiently to negate the benefit of increasing N1, thus power starts to decrease from one sawtooth tip to the next. We present similar results with K0=1 (Supplementary Fig. S4a), which can represent singleton variants in controls sequenced at low coverage. In general, for a gene with compound carrier frequency F0 of rare functional variants in controls, the expected power is the sum of power under possible K0 values weighted by the probability of observing K0. We show the power curve with hypothetical F0=0.001 (Supplementary Fig. S4b).
The other cost regime is where per-sample overhead is large enough to be approximated as fixed cost regardless of coverage, which is approximately the case for practical exome sequencing. The budget then translates to a total number of subjects to be sequenced. The remaining decision for the investigator is what fraction of these subjects should be cases versus controls. While variants may be partially penetrant, either causal or protective (Neale et al., 2011; Yi and Zhi, 2011), a simplifying assumption of gene disruptions to be fully penetrant and causal is helpful both in practical exome sequencing (Ng et al., 2009) as well as for theoretical analysis: we prove (Supplementary Material) that in such scenarios the optimal ratio of cases to all samples is 1 : e (e is the base of the natural logarithm). If disruptions are to be sought in both cases and control without bias, this asymmetry breaks, and naturally one needs to sequence a similar number of subjects from either group.
We discussed optimal designs of association studies focused on rare variants using high-throughput resequencing under budget constraint. Under a scenario where the cost is proportional to the total number of sequenced bases and only sequencing cases, there are two important characteristics of the optimal design under a constant budget. First, the curve of power versus the number of cases is not smooth, but resembles a sawtooth. Second and more importantly, smoothing the sawtooth tips, the maximum power of detecting associations is achieved at a medium coverage (λ) where the power of calling heterozygous variants R(λ) from sequencing data is suboptimal. Under a second scenario where the cost is proportional to the number of subjects, when assuming sequencing both cases and controls for detecting case-only disruptions, the optimal fraction of cases among sequenced subjects is 1/e.
Our analysis of the optimal design assumes using simple collapsing method to test the association of rare variants in the presence of allelic heterogeneity. Although some other methods are better powered under certain conditions (Madsen et al., 2009; Neale et al., 2011; Price et al., 2010; Yi and Zhi, 2011), the collapsing method is often the first statistical test researchers conduct after getting new data. Moreover, it has been showed to be empirically adequate in practice when combined with appropriate functional assessment of the variants (Cohen et al., 2006; Surolia et al., 2010). Our power calculator assumes this test following to traditional calculations that typically focus on a basic scenario and test, rather than a full probabilistic model.
We have implemented a web tool for computing the power under a flexible set of assumptions considering the general cost regime, where both a fixed, per-sample cost and a variable, per coverage price are accrued. This tool could provide investigators means to plan experiments in the practical regimes of budgets and technologies they are facing within their institutions.
Funding: National Science Foundation CAREER 0845677 (to I.P.); National Institute of Health U54CA121852 (to I.P.); International Serious Adverse Event Consortium (to Y.S.).
Conflict of Interest: none declared.