Next Generation Sequencing has emerged as a powerful tool for detecting rare variants and their associations with human diseases and traits. In this manuscript, we discussed how to choose the coverage depth for a study using NGS technology. Many of the statistical techniques used here were originally developed to examine the extent of overlap and coverage needed when genomes were being mapped by “fingerprinting,” where overlapping clones from recombinant libraries were needed to piece together the genome [Lander and Waterman, 1988
; Siegel et al., 2000
; Wendl and Barbazuk, 2005
; Wendl and Waterston, 2002
]. We started by showing that the depth of coverage ranged greatly across the genome, especially when performing targeted sequencing. Therefore, even when the average depth is high, a large number of positions can still have relatively low coverage. To avoid sparse coverage in difficult-to-sequence regions, we need to raise the average above that suggested by the simpler models. Statistically, we suggested that the depth of coverage followed a negative binomial distribution, as opposed to the simpler poisson distribution. A measure of the extent of deviation from the poisson ideal provides a measure of the quality of the sequencing. In this regard, the shape parameter, which describes that deviation, is an important characteristic to consider when deciding between NGS technologies.
The coverage depth needed to identify a variant allele with high specificity was surprisingly large. The exact choice of depth depends on multiple considerations, such as desired α-level, quality of technology and read error rate. If we are only looking for variants at a pre-specified set of loci, perhaps those positions already known to be polymorphic, we might allow a relatively large α-level, 1/1000 or even 1/100. At such rates, a coverage depth around 10 might be sufficient when using the best technologies. Another consideration is the quality of the technology. When the shape parameter is small and depth is highly variable, we would need to increase coverage depth. Our analyses were for stand-alone calling algorithms. Other algorithms, specifically those that consider linkage disequilibrium, will likely require lower depth. Results from the 1000 genome data should help show how such advanced methods can augment power.
When trying to detect a new rare variant within a population, the desirable coverage depth, with realistic parameters, ranges between 2 and 8 reads, with the exact depth depending heavily on the acceptable false-positive rate. With too few reads, it could be difficult to determine whether a few variant alleles, scattered across all subjects, occurred by error. With too few individuals, there would be a non-negligible chance that none of the individuals carried the rare variant. Therefore, the depths that perform well balance between these extremes. Our suggested coverage depth is similar to the depth, 4 – 6 ×, chosen for the 1000 Genomes project and Wendl’s approximation of 3.6.
In contrast with the test for rare variants, the association test is maximized for power by including as many cases as possible. In studies with secondary analyses, adjustment for confounders, or a planned follow-up, increasing depth to identify heterozygous individuals may be necessary. However, the consequences of such increases can be severe, if they require a reduction in sample size. Li and Leal also discuss the need for large sample sizes when detecting rare haplotypes with frequency <1% [Li and Leal, 2008
An association test requires genotype calling optimized for that purpose. Although it is an extreme example, consider using a calling algorithm that requires 50 reads before making a call. Obviously, with such a rule, using a low average depth will perform poorly. Therefore, for a fair comparison, we should use the optimal calling rule, or alternatively a rule which maximizes the power of the following association test. As no such rule had been discussed previously in the literature, we derived it in a general, and broadly applicable, form in the appendix
To understand why maximizing the number of subjects is optimal for association studies, consider the simple case where there are no read errors and all SNPs with at least one read of the minor allele are called as Ĝij
= 1. As a standard rule, power is determined by the number of events, which, in this case, is the number of called heterozygotes. Note that
is higher for 2n
subjects with 1 read per base compared to n
subjects with 2 reads per base. Let p
be the true proportion of heterozygotes. When the reads for only 50% of the Gi
= 1 individuals contain a minor allele, we would expect 2n
× 0.5 individuals to have Ĝij
= 1. In the alternative, where 75% of those individuals will have at least one read with a minor allele, we would expect n
× 0.75 to have Ĝij
= 1. Note that the gain accuracy does not offset the loss in the number of subjects, 2n
× 0.75. See the supplementary material
for addition discussion.
There are important limitations for our conclusions. First, we assume a constant error rate, regardless of MAF or the number of minor alleles detected. Because calling algorithms often use linkage disequilibrium to aid calling, per base error rates can decrease as sample increases. One of the consequences is that the depth needed to detect a heterozygous allele within an individual actually depends on the total sample size. Second, we have focused only on SNPs. Because of an increased difficulty in detecting structural variants, such as copy number variation, insertions, inversions and translocations [Feuk et al., 2006
], the optimal depth for detecting this type of variation may require a depth greater than 50 when using the discordant read pairs method [Wendl and Wilson, 2009a
]. However, as technology improves, reads become longer, and read error rates decrease, the necessary depth should decrease. Third, when considering the error rate at a single location, our model assumes that all errors produce the same allele. If errors were random, then the potential for observing enough of any single allele to call a specific variant allele would decrease. Fourth, we only take into consideration sequencing costs. Other costs, such as collecting data and sample preparation, have been ignored from our cost structure.
As the cost of NGS falls, its use will continue to increase, and thus it will be necessary to optimize designs to efficiently discover and validate variants that map to human diseases and traits. Sequencing with too low a depth can negatively impact discovery and with too few individuals can negatively impact detecting associations. Overall, it is important to balance sample size and coverage depth in the context of available resources for NGS studies.