Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3291790

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2012 September 2.

Published in final edited form as:

PMCID: PMC3291790

NIHMSID: NIHMS330774

Joshua Sampson,^{1,}^{*} Kevin Jacobs,^{2} Meredith Yeager,^{2} Stephen Chanock,^{2} and Nilanjan Chatterjee^{1}

The publisher's final edited version of this article is available at Genet Epidemiol

See other articles in PMC that cite the published article.

Next Generation Sequencing represents a powerful tool for detecting genetic variation associated with human disease. Because of the high cost of this technology, it is critical that we develop efficient study designs that consider the trade-off between the number of subjects (*n*) and the coverage depth (μ). How we divide our resources between the two can greatly impact study success, particularly in pilot studies. We propose a strategy for selecting the optimal combination of *n* and μ for studies aimed at detecting rare variants and for studies aimed at detecting associations between rare or uncommon variants and disease. For detecting rare variants, we find the optimal coverage depth to be between 2 and 8 reads when using the likelihood ratio test. For association studies, we find the strategy of sequencing all available subjects to be preferable. In deriving these combinations, we provide a detailed analysis describing the distribution of depth across a genome and the depth needed to identify a minor allele in an individual. The optimal coverage depth depends on the aims of the study, and the chosen depth can have a large impact on study success.

In order to capitalize on the rapid advancement of Next Generation Sequencing (NGS) technologies [Metzker, 2010; Shendure and Ji, 2008], investigators have initiated a range of studies designed to capture the spectrum of genetic variants underlying human diseases and traits. In particular, there has been an emphasis on exploring rare variants (minor allele frequency (MAF) <1%), and uncommon variants (MAF between 1 and 10%). As expected, initial reports herald the improvements in mapping classical Mendelian disorders within pedigrees and studying more complex, oligogenic diseases. So far, NGS studies have already identified DNA variants responsible for Kabuki syndrome [Ng et al., 2010b], Miller Syndrome [Ng et al., 2010a], acute myeloid leukaemia [Ley et al., 2008] and metachondromatosis [Sobreira et al., 2010]. As these findings exemplify the power of the technology, we must now consider the refinements needed to efficiently design and conduct future studies.

NGS technologies represent major advances in the scope of sequencing, but are still susceptible to platform-specific chemistry, imaging and PCR incorporation errors. As a result, the error rate per sequenced base is often as high as 1–2%. Between this high error rate and the need to identify the alleles on both chromosomes, each position must be covered by multiple reads to ensure that variation is accurately classified [Harismendy et al., 2009]. As the cost of the study will increase with the total number of reads, the minimal coverage depth necessary for identifying variant, or rare, alleles should be determined. Moreover, when the choice is between increasing coverage depth and increasing the number of individuals, it is important to determine the depth that maximizes power.

We start by examining the relationship between the average coverage depth and the proportion of rare alleles that can be detected within an individual. Since common practice is to determine average depth, as opposed to the depth at each base, the proportion detected depends heavily on the distribution of coverage across the genome. We discuss this distribution, using data from the 1000 Genomes Project [Kuehn, 2008; The 1000 Genome Project Consortium, 2010], and show that a single *shape* parameter can provide a new measure of quality for NGS technology. Our discussion is applicable to currently available calling methods [Bansal et al., 2010; Horner et al., 2010; Quinlan et al., 2008; Shen et al., 2010].

Next, we consider the goal of identifying rare variants within a sample of unrelated individuals from a given population. Using an illustrative model, in which the sequencing cost can be considered as a product of the number of subjects and the average coverage depth, we discuss the trade-off between the two. This model has simplified the cost structure by ignoring other elements such as DNA extraction, library preparation, and bioinformatic analysis, but keeps the essential point, that increasing either sample size size or depth often must be accompanied by a decrease in the other. This trade-off has also been considered by other groups, including the 1000 Genomes project [Kaiser, 2008; Kuehn, 2008; Siva, 2008], Bhangale et al. [2008] and Wendl and Wilson [2008, 2009a,b]. The effects of changing the parameters of the study, such as read error rate, shape parameters and statistical test, are explored when trying to identify a suitable combination of sample size and coverage depth to detect variants. When designing a case/control study to identify associations between SNPs and a disease, sample size must also be balanced against coverage depth. Therefore, we offer a method for determining a combination that can maximize the power for detecting influential SNPs. This discussion extends previous work that identified the sample sizes needed to detect rare variants for association studies without considering error [Li and Leal, 2010].

Let individual *i* have *K _{ij}* reads covering position

- A1The number of reads,
*K*, is distributed as a poisson variable with mean μλ_{ij}_{j}. - A2λ
_{j}is distributed as a gamma variable with shape parameter ζand scale parameter 1/ζ. - A3Given an individual’s genotype, the probability of observing a variant allele on a read covering
*j*follows the bernoulli distribution with mean*p*_{MA}(often,*p*_{MA}= 0.5) when*G*= 1,_{ij}*r*, the read error rate, when*G*= 0, and 1−_{ij}*r*when*G*= 2._{ij}$$P({\widehat{A}}_{\mathit{\text{ijk}}}=1|{G}_{\mathit{\text{ij}}})=\{\begin{array}{cc}r\hfill & \text{if}{G}_{\mathit{\text{ij}}}=0,\hfill \\ {p}_{\mathit{\text{MA}}}\hfill & \text{if}{G}_{\mathit{\text{ij}}}=1,\hfill \\ 1-r\hfill & \text{if}{G}_{\mathit{\text{ij}}}=2.\hfill \end{array}$$(1) - A4Any two reads, even those within the same individual, are independent conditional on the genotype:
*Â*_{ijk1}*Â*_{ijk2}|*G*._{ij} - A5The frequencies of
*G*follow Hardy-Weinberg Equilibrium._{ij}

Our first goal is to find the best model fit for the distribution of read depth. Low coverage sequencing runs for Caucasian individuals from the 1000 genomes project are the first examples. BAM files from samples sequenced at the Sanger Center on the ILLUMINA platform and samples from Baylor College of Medicine on the ABI Solid platform were downloaded from ncbi.nlm.nih.gov: 1000genomes/ftp/data/ using Aspera. The pileup command in SAMTOOLS was then used to obtain the depth of coverage for the first position of all dbSNP entries on chromosome 1. Versions of the sequence and alignment indices were from 3/11/2010. For a second example, we downloaded read information for the YH genome, the first genome sequenced as part of the YanHuang project, from http://yh.genomics.org.cn.

For each sample, we fit the distribution of read depth by a negative binomial distribution. A value of ζ is selected for each platform. For samples within a platform, the MLE of the sample means, μ_{1},…, μ_{n}, are calculated based on only those positions with at least one read and the poisson assumption. We defined the shape parameter to be the value of ζ that minimized

$$\sum _{i}}{\displaystyle \sum _{d=1}^{100}}{\widehat{f}}_{i}(d){({\widehat{f}}_{i}(d)-f(d|\mathrm{\zeta},{\widehat{\mathrm{\mu}}}_{i}))}^{2},$$

(2)

where *f*(*d*|ζ, μ_{i}) is the density of a truncated negative binomial distribution at depth *d*, * _{i}*(

We calculate the power to detect a rare variant within an individual under scenarios where we vary *r*, ζ, *p*_{MA}, μ and α_{IND}, the allowed false-positive rate. Our method for calculating power is best described by simulation. For each scenario, we simulate λ_{j} for *S* SNPs from a gamma distribution with shape parameter ζ and scale parameter 1/ζ. For each SNP, we then simulate *K _{ij}* from a poisson distribution with mean μλ

$${\text{LRS}}_{\mathit{\text{ij}}}=2\text{log}({L}_{\mathit{\text{ij}}})=2(\text{log}({0.5}^{{K}_{\mathit{\text{ij}}}})-\text{log}({r}^{{v}_{\mathit{\text{ij}}}}{(1-r)}^{{K}_{\mathit{\text{ij}}}-{v}_{\mathit{\text{ij}}}})),$$

(3)

and *L _{ij}* is the likelihood ratio

$${L}_{\mathit{\text{ij}}}=\frac{P({\widehat{A}}_{\mathit{\text{ij}}\xb7}|{K}_{\mathit{\text{ij}}},{G}_{\mathit{\text{ij}}}=1)}{P({\widehat{A}}_{\mathit{\text{ij}}\xb7}|{K}_{\mathit{\text{ij}}},{G}_{\mathit{\text{ij}}}=0)}.$$

(4)

The α_{IND} threshold, *t*_{α}, is the (1− α_{IND}) quantile of the *S* values of LRS_{ij}. To calculate power, we simulate the number of rare variants at each SNP from a binomial (*K _{ij}*,0.5), calculate LRS

We calculate the power to detect a rare variant within a population under scenarios where we vary *r*, ζ, μ, α, θ and *n*. Power calculations start by simulating a set of *S* datasets, *S* = 5,000,000 for the null distribution and *S* = 5,000,000 for the alternative distribution. For each dataset, *n* values of (*K _{ij}*,

$${\text{LRS}}_{j}^{*}=2({\mathit{*}}^{(}$$

(5)

where

$${\mathit{*}}^{(}$$

(6)

and _{MLE} is the maximum likelihood estimator. To estimate the null distribution, each dataset is presumed to include *n* individuals with *G _{ij}* = 0 so

We calculate the power to detect a rare variant within a population under scenarios where we vary *r*, ζ, μ, α, θ, *n* and RR, the relative risk attributable to the variant allele under the additive model. Power is calculated directly, without simulation, as follows. We assume a disease prevalence of 0.1 and calculate *M*_{1}, the 2 × 3 matrix containing the probabilities of each genotype in controls and cases. Then for each possible genotype, we calculate the distribution of *L _{ij}* as discussed in the first section of the Appendix and the resulting distribution of the called genotypes, where

Given a total of *R* reads, the ideal distribution spreads them uniformly over the genome so that every base is covered by the same number of reads. Both stochastic and experimental limitations prevent such uniform coverage, and specification is limited to the average, μ = *T*/*N*, number of reads, where *T* is the total number of bases in the *R* reads (i.e. *R* × average read length) and *N* is the number of bases to be sequenced within an individual (all notation is listed in Table I). If the efficiency and quality of sequencing were constant across the genome, the distribution of the number of reads at any given position (i.e. *K _{ij}*), or depth of coverage, would likely share similar characteristics to that of a poisson variable with mean μ. Unfortunately, because of variation in local content, such as the proportion GC and extent of segment duplication, some regions in the genome can, for certain NGS technologies, be more difficult to sequence and align [Kidd et al., 2010]. Therefore, it is unlikely that the overall distribution of reads follows a poisson distribution, and evidence from our own data, full genome sequencing [Wang et al., 2008], and the 1000 genomes project suggests that the distribution appears to be most similar to a negative binomial (Fig. 1). To understand the origin of such a distribution, consider that each base, defined by its location in the genome, has its own, intrinsic, sequencing inclination λ

Comparing the observed distribution of coverage depth to the distribution estimated by the model. Graphs are for the three subjects, with the lowest, median, and highest, coverage depth, among the 116 individuals genotyped at the Sanger Center. The black **...**

The shape parameter (ζ), which describes how the λ_{j} are distributed, is another important measure of sequencing performance. With the goal being consistent performance across the genome, all values of λ_{j} would ideally be identical and equal to 1. As larger ζ imply more tightly distributed λ_{j}, better technologies will, in general, have larger ζ, a quantity that can be estimated by the data. Specifically, for each individual, we fit the observed distribution of coverage depth to a truncated negative binomial distribution that considers only positions with at least one read. Therefore, interpretation of ζ as a measure of quality is only meaningful when also considering the total number of reads and the total number of loci covered.

We estimated ζ values from four sets of samples. Two sets of samples were from the 1000 genomes project. For the distributions of reads corresponding to variants in dbSNP along chromosome 1 in 116 samples analyzed with the ILLUMINA platform at the Sanger Center, we estimated ζ = 4 (Fig. 1, see supplementary material for all 116 fits). For the distributions from 20 samples analyzed by the ABI SOLID platform at the Baylor College of Medicine, we estimated ζ = 5.5. In addition to platform, study center also affected ζ as the shape parameter appears slightly higher from data collected on the ILLUMINA platform at ILLUMINA. Furthermore, we examined the distribution from a study with higher coverage. For the distribution of reads along chromosome 1 in the YH individual [Wang et al., 2008] sequenced in 2008 by Illumina Genome Analysers, we estimate ζ = 7. Finally, we estimated ζ in an exome sequencing project (unpublished data), where base performance now depends on the ease of sequencing and ease of capture. Because the capture step is far more heterogenous, it was not surprising to find ζ = 2.2. As Figure 2 shows that such a low ζ would noticeably reduce power to detect rare variants. As most upcoming studies focus on targeted regions, these studies need to expect small values of ζ and should choose their depth accordingly.

Consider testing whether an individual has a variant allele at a given location. In hypothesis testing language, our null hypothesis will be that the individual does not have a variant allele. Power (pow_{IND}) is then defined to be the probability of correctly rejecting *H*_{0} when the subject is truly heterozygous (*G _{ij}* = 1). Power at a single location and percentage of variant alleles that are discovered are identical quantities. The false-positive rate, α

Figure 2A shows the power to detect a heterozygote as a function of μ. If the desired false-positive rate is α_{IND} = 10^{−5} (i.e. 1/10,000 homozygous positions is misclassified as heterozygous), then the proportion of variant alleles expected to be discovered across the genome are 80, 90, 95 and 99% when the average depth is 12, 17, 23 and 42, respectively. If the acceptable false-positive rate is lowered to α_{IND} = 10^{−6}, then the required depths would be 14, 20, 27 and 49. If the acceptable false-positive rate is raised to α_{IND} = 10^{−4}, the required depths would be 10, 14, 19 and 25. The cost, in depth, of raising power by 1% increases dramatically with power.

Figure 2B shows the dependence of power on the shape parameter, ζ. As the shape parameter shrinks, and the distribution of *K _{ij}* diverges farther from poisson, the mean depth needed to ensure an acceptable level of power will increase. When λ

Furthermore, there is concern that the variant allele may be more difficult to copy and therefore occur at a lower frequency (i.e. *p*_{MA}, defined in assumption 3 in the earlier section, may be below 0.5). If only 40% of the reads at a heterozygous position are expected to be the variant allele, the proportion detected shrinks by ~6% (α_{IND} = 10^{−5}, ζ = 4, μ = 23: 89 vs. 95%). Figure 2C shows that, if true, read bias can result in a large loss in power.

Consider testing the null hypothesis that all individuals are homozygous at locus *j*, *G*_{·j} = 0˗, in a random sample of unrelated individuals from a given population. We consider the likelihood ratio test, defined in the earlier section, as it performs better than tests based on individuals’ assigned genotypes.

The power to detect a rare allele increases with *n* as the likelihood of collecting an individual with a rare-variant increases with *n*. For any given MAF, however, there is an *n* at which the power plateaus as the chance that none of the subjects have the rare variant is negligible. Power also increases with coverage depth and for any given MAF, there will be a μ for which the likelihood of incorrectly classifying a true heterozygote is negligible. Figure 3A illustrates power as a function of μ and *n* when MAF = 0.005, *r* = 0.001, α = 0.0001 and λ follows a poisson distribution. Darker colors indicate higher probabilities of detection.

If we fix the total number of reads, letting μ_{big} = μ × *n*, our proxy for cost, there will be a specific combination of *n* and μ that maximize the number of rare variants detected. In the above example, setting μ_{big} = 500, we could choose any combination along the 500/μ contour (Fig. 3A). To better illustrate the options, we plot power as a function of μ with the total cost fixed (Fig. 3B). Note, similar to previous findings [Wendl and Wilson, 2009a,b], choosing a λ less than the optimal value reduces power far more than choosing a λ greater than the optimal value.

As illustrated in Figure 3C, the optimal λ depends most strongly on *r*, and greater depth is required as the read error rate increases. When we reduce our tolerance for false positives (i.e. decrease α), we similarly benefit from increasing depth. Changing other parameters has less effect on the optimal λ, as previously suggested in Wendl and Wilson [2009a,b]. Reducing the shape parameter and/or increasing μ_{big} may slightly reduce the optimal μ. Overall, depending on the parameters of the experiment, we find the optimal μ to be between 2 and 8.

Consider testing for the association between a rare or uncommon variant and a disease in a case/control study with individuals split equally between the two groups. Here, we present results from a genotype-based test, where we first genotype each individual and then perform a χ^{2} test of association, a standard test for the case/control analysis. In contrast to a test for rare variants, the association test will have maximum power if *n* is as large as possible when the total cost, as defined by μ_{big}, is fixed (Fig. 4). We review the intuition for this observation in the discussion section. In this section, we examine the rate at which increasing coverage decreases power. The decrease in power from raising μ = 1 to μ = 2, with the corresponding decrease in sample size, primarily depends on the overall power of the association test when μ = 1. Power is extremely sensitive to changes in sample size. As a specific example, consider the scenario where the total number of reads is 20,000, *r* = 0.01, MAF = 0.05, and α = 0.0001. Note, MAF is set at a higher frequency than previous examples to achieve detectable power. By varying the relative risk from 1.2 to 2.0, we vary the power at μ = 1 from 0.03 to 0.999. For relative risks of 1.4, 1.6 and 1.8, increasing μ from 1 to 2 decreases the power from 0.42, 0.88 and 0.99 to 0.20, 0.64 and 0.90. Note that the loss in power can be relatively large for even small increases in μ when the power is low. Holding power constant, changing other variables such as the total number of reads, MAF or *r*, has little, if any, effect on the rate of decrease. Changing α only has a slight effect, in that allowing for a higher false-positive rate dampens the rate of decrease.

For fixed cost, the power to detect an association decreases with μ. The sharpness of the decline depends on the relative risk attributable to the SNP. The relationship between power and μ is illustrated for four different values of the **...**

Again, the genotyping test is not the most powerful test and the preferable test would be the likelihood ratio test. Here, the general performance of the tests are similar. In contrast to rare variant detection, the advantage of the likelihood ratio test appears truly minimal. If we consider the scenario where each base is covered by exactly one read, the likelihood ratio test and genotype-based are essentially equivalent, with both comparing the proportion of individuals with a variant allele in the two groups. For details of the likelihood ratio statistic, see the methods section. Note that this discussion assumed sequencing was performed solely for the purposes of an association test with a specified outcome and no covariates, so there was no interest in identifying any individual’s specific genotype.

Next Generation Sequencing has emerged as a powerful tool for detecting rare variants and their associations with human diseases and traits. In this manuscript, we discussed how to choose the coverage depth for a study using NGS technology. Many of the statistical techniques used here were originally developed to examine the extent of overlap and coverage needed when genomes were being mapped by “fingerprinting,” where overlapping clones from recombinant libraries were needed to piece together the genome [Lander and Waterman, 1988; Siegel et al., 2000; Wendl and Barbazuk, 2005; Wendl and Waterston, 2002]. We started by showing that the depth of coverage ranged greatly across the genome, especially when performing targeted sequencing. Therefore, even when the average depth is high, a large number of positions can still have relatively low coverage. To avoid sparse coverage in difficult-to-sequence regions, we need to raise the average above that suggested by the simpler models. Statistically, we suggested that the depth of coverage followed a negative binomial distribution, as opposed to the simpler poisson distribution. A measure of the extent of deviation from the poisson ideal provides a measure of the quality of the sequencing. In this regard, the shape parameter, which describes that deviation, is an important characteristic to consider when deciding between NGS technologies.

The coverage depth needed to identify a variant allele with high specificity was surprisingly large. The exact choice of depth depends on multiple considerations, such as desired α-level, quality of technology and read error rate. If we are only looking for variants at a pre-specified set of loci, perhaps those positions already known to be polymorphic, we might allow a relatively large α-level, 1/1000 or even 1/100. At such rates, a coverage depth around 10 might be sufficient when using the best technologies. Another consideration is the quality of the technology. When the shape parameter is small and depth is highly variable, we would need to increase coverage depth. Our analyses were for stand-alone calling algorithms. Other algorithms, specifically those that consider linkage disequilibrium, will likely require lower depth. Results from the 1000 genome data should help show how such advanced methods can augment power.

When trying to detect a new rare variant within a population, the desirable coverage depth, with realistic parameters, ranges between 2 and 8 reads, with the exact depth depending heavily on the acceptable false-positive rate. With too few reads, it could be difficult to determine whether a few variant alleles, scattered across all subjects, occurred by error. With too few individuals, there would be a non-negligible chance that none of the individuals carried the rare variant. Therefore, the depths that perform well balance between these extremes. Our suggested coverage depth is similar to the depth, 4 – 6 ×, chosen for the 1000 Genomes project and Wendl’s approximation of 3.6.

In contrast with the test for rare variants, the association test is maximized for power by including as many cases as possible. In studies with secondary analyses, adjustment for confounders, or a planned follow-up, increasing depth to identify heterozygous individuals may be necessary. However, the consequences of such increases can be severe, if they require a reduction in sample size. Li and Leal also discuss the need for large sample sizes when detecting rare haplotypes with frequency <1% [Li and Leal, 2008].

An association test requires genotype calling optimized for that purpose. Although it is an extreme example, consider using a calling algorithm that requires 50 reads before making a call. Obviously, with such a rule, using a low average depth will perform poorly. Therefore, for a fair comparison, we should use the optimal calling rule, or alternatively a rule which maximizes the power of the following association test. As no such rule had been discussed previously in the literature, we derived it in a general, and broadly applicable, form in the appendix.

To understand why maximizing the number of subjects is optimal for association studies, consider the simple case where there are no read errors and all SNPs with at least one read of the minor allele are called as *Ĝ _{ij}* = 1. As a standard rule, power is determined by the number of events, which, in this case, is the number of called heterozygotes. Note that $\sum _{i}}{\widehat{G}}_{i$ is higher for 2

There are important limitations for our conclusions. First, we assume a constant error rate, regardless of MAF or the number of minor alleles detected. Because calling algorithms often use linkage disequilibrium to aid calling, per base error rates can decrease as sample increases. One of the consequences is that the depth needed to detect a heterozygous allele within an individual actually depends on the total sample size. Second, we have focused only on SNPs. Because of an increased difficulty in detecting structural variants, such as copy number variation, insertions, inversions and translocations [Feuk et al., 2006], the optimal depth for detecting this type of variation may require a depth greater than 50 when using the discordant read pairs method [Wendl and Wilson, 2009a,b]. However, as technology improves, reads become longer, and read error rates decrease, the necessary depth should decrease. Third, when considering the error rate at a single location, our model assumes that all errors produce the same allele. If errors were random, then the potential for observing enough of any single allele to call a specific variant allele would decrease. Fourth, we only take into consideration sequencing costs. Other costs, such as collecting data and sample preparation, have been ignored from our cost structure.

As the cost of NGS falls, its use will continue to increase, and thus it will be necessary to optimize designs to efficiently discover and validate variants that map to human diseases and traits. Sequencing with too low a depth can negatively impact discovery and with too few individuals can negatively impact detecting associations. Overall, it is important to balance sample size and coverage depth in the context of available resources for NGS studies.

We greatly appreciate the Genetic and Epidemiology Branch, DCEG, NCI for allowing us access to the data from their exome studies. K.J., M.Y. and S.C. identified the problem. J.S. and N.C. developed the statistical framework. J.S. ran simulations and drafted the paper. All authors revised the paper.

We calculate power, β, using a numerical approximation of

$$\mathrm{\beta}={\displaystyle {\int}_{{\mathrm{\lambda}}_{j}}}\phantom{\rule{thinmathspace}{0ex}}[{\displaystyle \sum _{{K}_{\mathit{\text{ij}}}=1}^{\mathrm{\infty}}}\left({\displaystyle \sum _{{v}_{\mathit{\text{ij}}}=0}^{{K}_{\mathit{\text{ij}}}}}P({K}_{\mathit{\text{ij}}}|\mathrm{\mu}{\mathrm{\lambda}}_{j})P\phantom{\rule{thinmathspace}{0ex}}({\displaystyle \sum _{k}}{\widehat{A}}_{\mathit{\text{ijk}}}={v}_{\mathit{\text{ij}}}|{G}_{\mathit{\text{ij}}}=1)\phantom{\rule{thinmathspace}{0ex}}1({L}_{\mathit{\text{ij}}}\ge {t}_{\mathrm{\alpha}})\right)]\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dF}}({\mathrm{\lambda}}_{j}),$$

(A1)

where we choose *t*_{α} so

$$P({\mathit{\mathit{\text{ij}}}}_{}$$

(A2)

We need only find _{U}, *A* and *C*, those estimates that maximize the log-likelihood for controls alone, cases alone and cases/controls combined. The log likelihood for each group can be defined by summing over their respective members

$${\mathit{X}}_{(}$$

(A3)

and the likelihood ratio statistic

$${\text{LRS}}_{j}^{*}=2({\mathit{U}}_{(}$$

(A4)

When testing for association, the true difficulty is determining the optimal calling algorithm when genotyping individuals. Here, optimal implies maximizing the power of the upcoming association test. Although the threshold for calling *Ĝ _{ij}* = 0 can be set to 1 without worry,

$$\text{MAF}\ge 0.5\frac{P({L}_{\mathit{\text{ij}}}=t|{G}_{\mathit{\text{ij}}}=0)}{P({L}_{\mathit{\text{ij}}}=t|{G}_{\mathit{\text{ij}}}=1)}-\frac{P({L}_{\mathit{\text{ij}}}\ge t|{G}_{\mathit{\text{ij}}}=1)}{P({L}_{\mathit{\text{ij}}}\ge t|{G}_{\mathit{\text{ij}}}=0)}.$$

(A5)

Equation (A5) shows two key properties of the optimal threshold

- The optimal threshold will be higher for alleles with low MAF, as the right-hand side of Equation (A5) is decreasing with
*t*. When the MAF is low, we need to keep*t*_{1}high to prevent false positives from over-whelming those few true heterozygotes. - The optimal threshold will be higher when the error rate is high. When the read error is high or μ is low, we need to be extra vigilant about preventing false positives.

Equation (A5) is based on performing a χ^{2} test on a 2 × 2 contingency table. This discussion is completely general, and this threshold will be optimal whenever a continuous variable is being converted to a 0/1 variable. In the following 2 × 2 contingency table, let *a*, *b*, *c* and *d* be the expected cell counts.

$$[\begin{array}{cc}a\hfill & b\hfill \\ c\hfill & d\hfill \end{array}]\phantom{\rule{thinmathspace}{0ex}}.$$

Then, the ncp from the χ^{2} test for equal proportions can be written as

$$\eta \frac{{(\mathit{\text{ac}}-\mathit{\text{bd}})}^{2}}{(a+b)(c+d)(a+c)(b+d)}.$$

(A6)

When we allow for errors and no-calls, the expected cell counts, and consequently the ncp, change. Assume, we have adopted the policy to call any locus with a *L*≤1 to be “*AA*” (i.e. *Ĝ _{ij}*= 0), any locus with

$$[\begin{array}{c}a-\mathit{\text{aP}}(L>1|\mathit{\text{AA}})\mathit{\text{bP(L\u2265t|AB)+aP(L\u2265t|AA)}}\hfill & c-\mathit{\text{cP}}(L1|\mathit{\text{AA}})\mathit{\text{dP(L\u2265t|AB)+cP(L\u2265t|AA)}}\hfill \\ ]\\ \phantom{\rule{thinmathspace}{0ex}}.\end{array}$$

For our calculations, we use the following notation

$$kP(L\ge t|\mathit{\text{AB}}),$$

(A7)

$$uP(L\ge t|\mathit{\text{AA}}).$$

(A8)

Therefore, the expected cell counts can be abbreviated as

$$[\begin{array}{c}a-\mathit{\text{aP}}(L>1|\mathit{\text{AA}})\mathit{\text{ak+bu}}\hfill & c-\mathit{\text{cP}}(L1|\mathit{\text{AA}})\mathit{\text{ck+du}}\hfill \\ ]\\ \phantom{\rule{thinmathspace}{0ex}}.\end{array}$$

We can greatly simplify calculating the ncp if we make the following approximations which we believe to be reasonable

*a*−*aP*(*L*>1|*AA*) ≈*a*,*c*−*cP*(*L*>1|*AA*) ≈*c*,*a*−*aP*(*L*>1|*AA*)+*ak*+*bu*≈*a*+*b*,*c*−*cP*(*L*>1|*AA*)+*ck*+*du*≈*c*+*d*.

Then, the ncp can be approximated by

$$\eta \approx \frac{{(\mathit{\text{ad}}-\mathit{\text{bc}})}^{2}{u}^{2}}{{(a+c)}^{2}(a+b)(c+d)k+(a+c)(a+b)(c+d)(b+d)u}.$$

(A9)

By the chain rule, we know

$$\frac{d\eta}{\mathit{\text{dt}}}=\frac{d\eta}{\mathit{\text{dk}}}\frac{\mathit{\text{dk}}}{\mathit{\text{dt}}}+\frac{d\eta}{\mathit{\text{du}}}\frac{\mathit{\text{du}}}{\mathit{\text{dt}}}.$$

(A10)

Therefore, for the derivative to be less than 0, we need

$$\frac{\frac{d\eta}{\mathit{\text{du}}}}{-\frac{d\eta}{\mathit{\text{dk}}}}<\frac{\frac{\mathit{\text{dk}}}{\mathit{\text{dt}}}}{\frac{\mathit{\text{du}}}{\mathit{\text{dt}}}}.$$

(A11)

Straightforward calculations yield

$$\frac{\mathit{\text{du}}}{\mathit{\text{dt}}}=P(L=t|\mathit{\text{AB}}),$$

(A12)

$$\frac{\mathit{\text{dk}}}{\mathit{\text{dt}}}=P(L=t|\mathit{\text{AA}}),$$

(A13)

$$\frac{\frac{d\eta}{\mathit{\text{du}}}}{-\frac{d\eta}{\mathit{\text{dk}}}}=2\frac{P(L\ge t|\mathit{\text{AB}})}{P(L\ge t|\mathit{\text{AA}})}+\frac{b+d}{a+c}\approx 2\phantom{\rule{thinmathspace}{0ex}}(\frac{P(L\ge t|\mathit{\text{AB}})}{P(L\ge t|\mathit{\text{AA}})}+\text{MAF})\phantom{\rule{thinmathspace}{0ex}}.$$

(A14)

Then we should continue to increase the threshold so long as

$$\text{MAF}<0.5\frac{P(L=t|\mathit{\text{AA}})}{P(L=t|\mathit{\text{AB}})}-\frac{P(L\ge t|\mathit{\text{AB}})}{P(L\ge t|\mathit{\text{AA}})}.$$

(A15)

Additional Supporting Information may be found in the online version of the article.

- Bansal V, Harismendy O, Tewhey R, Murray SS, Schork NJ, Topol EJ, Frazer KA. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 2010;20:537–545. [PubMed]
- Bhangale TR, Rieder MJ, Nickerson DA. Estimating coverage and power for genetic association studies using near-complete variation data. Nat Genet. 2008;40:841–843. [PubMed]
- Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat Rev Genet. 2006;7:85–97. [PubMed]
- Harismendy O, Ng P, Strausberg R, Wang X, Stockwell T, Beeson K, Schork N, Murray S, Topol E, Levy S, Frazer K. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;10:R32. [PMC free article] [PubMed]
- Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth M, Picardi E, Pesole G. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform. 2010;11:181–197. [PubMed]
- Kaiser J. A plan to capture human diversity in 1000 genomes. Science. 2008;319:395. [PubMed]
- Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G, Kallicki J, Anderson P, Tsalenko A, Yamada NA, Tsang P, Kaul R, Wilson RK, Bruhn L, Eichler EE. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Meth. 2010;7:365–371. [PMC free article] [PubMed]
- Kuehn B. 1000 genomes project promises closer look at variation in human genome. JAMA. 2008;300:2715. [PubMed]
- Lander E, Waterman M. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–239. [PubMed]
- Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling DD, Dunford-Shore BH, McGrath SM, Hickenbotham M, Cook L, Abbott R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott SA, Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, Gordon D, Chinwalla AC, Zhao Y, Ries RE, Payton JE, Westervelt P, Tomasson MH, Watson M, Baty J, Ivanovich JI, Heath SH, Shannon WDS, Nagarajan RN, Walter MJ, Link DC, Graubert TA, DiPersio JF, Wilson RK. DNA sequencing of a cytogenetically normal acute myseloid leukaemia genome. Nature. 2008;456:66–69. [PMC free article] [PubMed]
- Li B, Leal S. Discovery of rare variants via sequencing: implications for association studies [abstract] Genet Epidemiol. 2008;32:702.
- Li B, Leal SM. Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS Genet. 2010;5:e1000481. [PMC free article] [PubMed]
- Metzker M. Sequencing technologies—the next generation. Nat Rev Genet. 2010;1:31–46. [PubMed]
- Ng SB, Buckingham KJ, Lee C, Bigham A, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad M. Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics. 2010a;42:30–35. [PMC free article] [PubMed]
- Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC, Lee C, Turner EH, Smith JD, Rieder MJ, Yoshiura K-i, Matsumoto N, Ohta T, Niikawa N, Nickerson DA, Bamshad MJ, Shendure J. Exome sequencing identifies mll2 mutations as a cause of kabuki syndrome. Nat Genet, advance online publication. 2010b [PMC free article] [PubMed]
- Quinlan AR, Stewart DA, Stromberg MP, Marth GT. Pyrobayes: an improved base caller for snp discovery in pyrosequences. Nat Meth. 2008;5:179–181. [PubMed]
- Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, Yu F. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 2010;20:273–280. [PubMed]
- Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotech. 2008;26:1135–1145. [PubMed]
- Siegel AF, van den Engh G, Hood L, Trask B, Roach JC. Modeling the feasibility of whole genome shotgun sequencing using a pairwise end strategy. Genomics. 2000;68:237–246. [PubMed]
- Siva N. 1000 genomes project. Nat Biotechnol. 2008;26:256. [PubMed]
- Sobreira NLM, Cirulli ET, Avramopoulos D, Wohler E, Oswald GL, Stevens EL, Ge D, Shianna KV, Smith JP, Maia JM, Gumbs CE, Pevsner J, Thomas G, Valle D, Hoover-Fong JE, Goldstein DB. Whole-genome sequencing of a single proband together with linkage analysis identifies a mendelian disease gene. PLoS Genet. 2010;6:e1000991. [PMC free article] [PubMed]
- The 1000 Genome Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
- Wang J, Wang J, Li W, Li R, Tian Y, Goodman G, Fan L, Zhang W, Li J, Zhang J, Guo J, Feng Y, Li B, Lu H, Fang Y, Liang X, Du H, Li Z, Zhao D, Hu Y, Yang Y, Zheng Z, Hellmann H, Inouye I, Pool M, Yi J, Zhao X, Duan J, Zhou J, Qin Y, Ma J, Li L, Yang G, Zhang Z, Yang G, Yu B, Liang C, Li F, Li W, Li S, Ni D, Ruan P, Li J, Zhu Q, Liu H, Lu D, Li Z, Guo N, Zhang G, Ye J, Fang J, Hao L, Chen Q, Liang Q, Su Y, San Y, Ping A, Yang C, Chen S, Li F, Zhou L, Zheng K, Ren H, Yang Y, Gao L, Yang Y, Li G, Feng Z, Kristiansen X, Wong K, Nielsen GKS, Durbin R, Bolund R, Zhang L, Li X, Yang S, Wang H, Jian The diploid genome sequence of an Asian individual. Nature. 2008;456:60–65. [PMC free article] [PubMed]
- Wendl M, Barbazuk W. Extension of Lander-Waterman theory for sequencing filtered DNA libraries. BMC Bioinform. 2005;6:245. [PMC free article] [PubMed]
- Wendl M, Waterston R. Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 2002;12:1943–1949. [PubMed]
- Wendl M, Wilson R. Aspects of coverage in medical DNA sequencing. BMC Bioinform. 2008;9:239. [PMC free article] [PubMed]
- Wendl M, Wilson R. Statistical aspects of discerning indel-type structural variation via DNA sequence alignment. BMC Genomics. 2009a;10:359. [PMC free article] [PubMed]
- Wendl M, Wilson R. The theory of discovering rare variants via DNA sequencing. BMC Genomics. 2009b;10:485. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |