|Home | About | Journals | Submit | Contact Us | Français|
One challenge of analyzing samples of DNA sequences is to account for the nonnegligible polymorphisms produced by error when the sequencing error rate is high or the sample size is large. Specifically, those artificial sequence variations will bias the observed single nucleotide polymorphism (SNP) frequency spectrum, which in turn may further bias the estimators of the population mutation rate for diploids. In this paper, we propose a new approach based on the generalized least squares (GLS) method to estimate θ, given a SNP frequency spectrum in a random sample of DNA sequences from a population. With this approach, error rate ε can be either known or unknown. In the latter case, ε can be estimated given an estimation of θ. Using coalescent simulation, we compared our estimators with other estimators of θ. The results showed that the GLS estimators are more efficient than other θ estimators with error, and the estimation of ε is usable in practice when the θ per bp is small. We demonstrate the application of the estimators with 10-kb noncoding region sequence sampled from a human population and provide suggestions for choosing θ estimators with error.
Sequencing errors can produce many problems for evolutionary or population genetical analysis of samples of DNA sequences (Clark and Whittam 1992, Johnson and Slatkin 2008). Given a random sample of sequence from a population, artificial polymorphisms caused by sequencing error will skew both the number and the frequency spectrum of the observed single nucleotide polymorphisms (SNPs). This will further skew any estimations or tests based on the number and/or frequency spectrum of the SNPs if errors are not properly taken into account. The bias will be more prominent with increased sample size because sequencing error accumulates linearly with sample size while θ increases slower, as implied by coalescent theory. Johnson and Slatkin (2008) suggested a rule of thumb that an uncorrected estimate will be biased significantly if , where n is the sample size (i.e., number of sequences), ε is the average error rate per site, L is the sequence length of the given locus, and θ is the population mutation rate of the locus. θ is equal to for diploids and for haploids, where N is the effective population size and μ is the mutation rate for the given locus. typically ranges from for species with extremely low diversity, such as the bog turtle (Rosenbaum et al. 2007), to for species with extremely high diversity, such as human immunodeficiency virus (Achaz et al. 2004). For human populations, is approximately (Crawford et al. 2005). Therefore, with a typical sequencing error rate of on the Sanger sequencing platform (Zwick 2005), uncorrected estimates will have problem with larger than 50 individuals (100 chromosomes) in human populations. If next-generation sequencing technologies are used, their higher sequencing error () (Mackay et al. 2008, Shendure and Ji 2008) may further reduce the upper limit of sample size to only 10 chromosomes or less.
In response to the problem, researchers are developing new unbiased estimators and tests for population samples with sequencing error incorporated into the analysis. Johnson and Slatkin (2006) proposed maximum likelihood estimators for θ and the scaled exponential growth rate using the SNP frequency spectrum while accounting for sequencing error via Phred quality scores. They also studied the effects of sequencing error on two uncorrected estimators of θ, Tajima's (Tajima 1983) and Watterson's (Watterson 1975), and proposed two unbiased estimators of θ assuming known ε (see discussions below) (Johnson and Slatkin 2008). Hellmann et al. (2008) proposed a method to correct , assuming known ε. It is similar to Johnson and Slatkin (2008)‘s θ estimator based on the total number of polymorphic sites but takes account of the uncertainty of chromosome sampling in a shot-gun sequencing of mixed diploid individuals. Lynch (2008) proposed several methods to correct for high-coverage shot-gun genomic sequences from a single diploid individual. Knudsen and Miyamoto (2007) incorporated missing data, sequencing error, and multiple reads for diploid individuals into a full-likelihood coalescent model of a sample of DNA sequences. Their model can be used to estimate θ and ε jointly. However, the computational intensity of calculating the likelihood limits the application of the model to small sample sizes (i.e., less than 20 sequences). Assuming a low but unknown ε, Achaz (2008) proposed two new moment estimators of θ based on the SNP number and frequency spectrum while ignoring singletons, on which sequencing error skews the most, and constructed two new test statistics for neutrality based on the new estimators. Jiang et al. (2009) developed a method to estimate θ and recombination rate for shot-gun resequencing data and proposed some refinements to increase its robustness to sequencing errors.
Here, we propose a new approach based on the generalized least squares (GLS) method to estimate θ using the SNP frequency spectrum of a random sample of DNA sequences from a population. It can be considered as a modification and extension of Fu (1994b)‘s best linear unbiased estimator (BLUE) estimator under the assumption of a moderate to low ε, either known or unknown. When ε is unknown, it can be estimated sequentially after the estimation of θ. The rational here is to incorporate the variance–covariance structure of the SNP frequency spectrum, with either known or unknown ancestral states of the SNPs, and reduce the variation of the estimator. Its computational intensity is higher than Achaz (2008)‘s moment estimators but much lower than Knudsen and Miyamoto (2007)‘s full-likelihood method, which makes it applicable to samples with thousands of sequences. In the following sections, we first introduce our GLS estimators along with some corrected or uncorrected estimators of θ; then, we compare these estimators using coalescent simulation assuming different population parameters. We demonstrate their application on a sample of DNA sequences of a 10-kilobase noncoding region on human chromosome 22. Finally, we discuss the advantages, limitations, and some technical issues of the GLS estimators.
Sequencing error at each site is usually modeled as a binomial or Poisson variable with parameter . As defined in the Introduction, n is the sample size (i.e., number of sequences) and ε is the average error rate per site. Here, we simplify the model by assuming that each site has at most one sequencing error, which occurs with probability P. As a rule of thumb, this assumption is approximately valid with . We further assume that when a sequencing error occurs on an ancestral (or mutant) allele, the probability that it leads to an existing allele (mutant or ancestral) is u and the probability that it yields a new allele is . For example, if only point mutation is considered and assuming that a nucleotide has equal probability to be read incorrectly as one of the other three nucleotides, . For a sample of n sequences with length L randomly sampled from a population, we designate as the (unknown) number of true polymorphic sites with mutation size i (or SNP class i, i.e., there are i copies of the mutant allele in the sample) and as the observed number of sites with copies of the ancestral allele, i copies of the mutant alleles and k copy of the new allele (other than mutant and ancestral alleles) produced by sequencing error ().
If ε or p is unknown but assumed to be relatively small, it is easy to see that most errors are singletons ( or ) because with assumption of infinite site model and a reasonable θ (Achaz 2008). In another words, for . By ignoring singletons, we removed most of the errors so that in the remaining SNP classes based on which new estimators can be developed. For example, Achaz (2008) proposed K − ξ1 and K− η1 estimators, which are modified Watterson (1975)‘s estimator and π− ξ1 and π − η1 estimators, which are modified Tajima (1983)‘s estimator (see details below). Fu (1994b) proposed a θ estimator based on the GLS method (BLUEu and BLUEf, called BLUE estimators in Fu 1994b), which makes full use of the observed numbers of SNP frequency classes, so that it has much less variability than either Watterson (1975)‘s or Tajima (1983)‘s estimator.
Following Achaz (2008)‘s idea that because singletons are much more likely to be errors comparing to other SNP class, an estimator based on the SNP spectrum excluding singletons will be robust to errors. Here, we briefly describe the modification of Fu (1994b)‘s GLS method by ignoring all singletons and triple allele. See Fu (1994a, 1994b) for more technique details of the method. We first rewrite the expectations, variances, and covariances of the observed number of SNP classes as:
where . The values of ρi and are dependent on the population model, which can be obtained theoretically or numerically using simulation. Assuming a Wright–Fisher model with constant population size, no selection, and recombination, their explicit formulas are known (Fu 1995). Writing in a matrix form, we have
Because the calculation of needs prior estimation of θ, the BLUE−ξ1 is calculated iteratively (Fu 1994a, 1994b): 1) give an initial value of θ, say θ0 = , 2) calculate with θ0; 3) calculate BLUE − ξ1 with and let θ1 =BLUE − ξ1; 4) update with ; and 5) repeat step 3 and 4 until BLUE − ξ1 converges.
If the ancestral state of each SNP is unknown (i.e., we do not know which allele is ancestral and which is mutant) then we need to fold the counting of ξi and to include all SNPs with a minor allele frequency i and major allele frequency or vice versa. In the folded case,
and is changed accordingly. The corresponding estimator, designated as BLUE − η1, can be calculated with the same iterative process as described above.
After the θ is estimated, p can be approximated as
for unfolded counting or
for folded counting, where E(ξ1 |) and E(ξn−1|) are calculated using equation (1) with θ replaced by . Then, is simply /n. For convenience, we use the subscript of the θ estimator used in to designate different s. For example, BLUE − ξ1 means the calculated using BLUE− ξ1.
When ε is known, we can develop a more precise GLS estimator of θ by making full use of this information. First, we need to derive expectations, variance, and covariances of . For simplicity, letting if or then
where is an index function, which equals to 1 if or 0 otherwise; am,ji, bm,ji, c m,j1,k,j2i, and dm,j1,k,j2i are functions of i, j, , , m, k, n, and u. Their detailed expression and the derivation of the above formulas can be found in the Appendix. , , and can be calculated using equations (1)–(3).
In practice, and are indistinguishable even when the ancestral states are known. To avoid the above difficulties, we combine observations of and . The general formulas for calculating the expectation of the combined observations and the variance–covariance of the combined observation with other (combined) observations are
A GLS estimator can be calculated as follows. If the ancestral state of each site is known or can be inferred from outgroups, let
The GLS estimator of θ is
again is an estimation of the variance–covariance matrix of Y , which can be calculated using equations (8)–(16). The same iterative process described above is then used to calculate GLS. In the case that ancestral states are unknown, we need to further fold with , and with because they are indistinguishable. Let
and follow the same steps as described for GLS. We can calculate the GLS estimator with folded observations. Later in this paper, we designate GLSf to be the GLS estimator using folded observations and GLSu to be that using unfolded observations.
Three types of corrected and uncorrected estimators of θ were compared in this study, which will be briefly introduced below. Johnson and Slatkin (2006)‘s method was not compared because here, we assume the detailed Phred quality scores is unknown. Neither did we include Knudsen and Miyamoto (2007)‘s method because of its computational intensity and limitation to small sample sizes. Hellmann et al. (2008) and Johnson and Slatkin (2008)‘s estimators based on the total number polymorphic sites should have similar performances, so only the latter was included in our comparison.
The first type of estimator (type I) is based on the total number of polymorphic sites. An uncorrected estimator of this type is the widely used Watterson's estimator = K/an (Watterson 1975), where and K is the total number of polymorphic sites in the sample. Assuming a known ε, Johnson and Slatkin (2008) proposed a corrected estimator (with modification):
and m is the total number of possible alleles in a site. Assuming an unknown but small ε, Achaz (2008) proposed two corrected estimators of this type. They are for unfolded observations and for folded observations.
The second type of estimator (type II) is based on the average difference between two sequences. An uncorrected estimator of this type is Tajima's (Tajima 1983), where is the number of difference between sequences i and j. Johnson and Slatkin (2008)‘s corrected estimator of this type (with modification) is
Achaz (2008)‘s corrected estimators of this type are
for unfolded observations and
for folded observations.
The third type of estimator (type III) compared is the GLS estimator. Our estimators are corrected GLS estimators, which can be considered an extension to Fu (1994b)‘s uncorrected GLS estimators BLUEu and BLUEf for unfolded and folded counting, respectively.
We compared the performances of the estimators using coalescent simulations (e.g., Hudson 2002). We assumed a Wright–Fisher model with constant population size, no selection, no recombination and a simple infinite sites model for mutation. For each combination of parameters, 10,000 samples were simulated. Only point mutations () were simulated. The GLS estimation iteration was stopped either when the absolute value change between the update of GLS is smaller than or when 200 updates of equation (18) was conducted (see Methods for details). Sequencing error at each site was simulated to follow a binomial distribution with parameter p. Because BLUE − ξ1, BLUE − η1, , , , , BLUEu, and BLUEf assume at most two alleles at each site, for those estimators if a site has more than one type of non-ancestral allele, the one with the largest count is regarded as the mutant allele and all other non-ancestral alleles are regarded as errors from ancestral alleles. For example, if a site has two non-ancestral alleles with counts 1 and 2 then it will be added to instead of . was calculated with equation (4) using BLUE −ξ1, , and or with equation (5) using BLUE − η1, , and .
For each set of simulated samples with a given combination of parameters, we compared the mean and variance of each estimator. We also calculated the mean squared error (MSE) of each estimator and the variance of an optimal estimator, . For θ, the large-sample approximation of is (Fu and Li 1993). Because we know the actual number of errors introduced in each simulated sample, say , we simply use as the optimal estimator of ε and calculated its variance in simulated samples. The ratio indicates the relative efficiency of each estimator to the optimal estimator.
As expected, with increasing p, via either an increasing error rate ε or an increasing sample size n, the uncorrected θ estimators perform poorly (data not shown). The four uncorrected θ estimators (, , BLUEu, and BLUEf) are upper biased significantly with increasing ε because they assume every polymorphism is a result of mutation. As a result, the uncorrected θ estimators are the least efficient estimators comparing to the corrected ones (data not shown). Among them, BLUEu and BLUEf are the most sensitive to sequencing error, whereas is the least sensitive. Looking at this in another way, although under the null hypothesis of no sequencing error, the order of their relative efficiency is , the order is totally reversed when p is larger. The above observations can be partially explained by the weight each estimator puts on . Although puts equal weight on every , puts less weight on than on if . On the contrary, BLUEu puts more weight on because under the assumption of no error, is a more reliable observation than . However, the opposite is true with errors.
On the other hand, the efficiency of corrected θ estimators is less affected by increasing p. As to those with unknown ε, all are approximately unbiased. Their variances are relatively unchanged with the increase of ε or even decrease slightly with an increase of n (data not shown). The order of their efficiency is (figs. 1 and and2,2, and are not shown because they have a very similar efficiency as and , respectively). For the estimators with known ε, all of them are approximately unbiased except that GLSu and GLSf are upper biased when . The variances of kc, GLSu and GLSf increase with the increase of ε or n except when n is small (say < 100). As a result, their efficiency decreases with increasing p (figs. 1 and and2).2). With , the order of their efficiency is (figs. 1 and and2).2). In summary, with a small to moderate ε (when and known), GLSu and GLSf are the most efficient θ estimators; otherwise, and are the most efficient estimators.
Efficiency of θ and ε estimators with increase of ε. Only the subscripts of the estimators were shown.
Efficiency of θ and ε estimators with increase of n. Only the subscripts of the estimators were shown.
Increasing θ, either by increasing or by L, affects the variances but not the means of the θ estimators. The variances of all estimators increase with the increase of θ. As a result, the efficiency of corrected estimators decrease with an increase of θ, with the exception of Kc, which actually increases when is small (fig. 3). Among the corrected estimators, the GLS estimators perform better than others, whereas those with unfolded counting perform better than those with folded counting (fig. 3).
Efficiency of θ and ε estimators with increase of L and θ/L. Only the subscripts of the estimators were shown.
The differences between the estimators of p are smaller compared with their corresponding θ estimators. Because the estimation of p is based on prior estimation of θ, it is easy to understand that a more accurate estimation of θ will help with a more accurate estimation of p. In general, using a θ estimator with unfolded counting will provide a more efficient estimation of p than using an estimator with folded counting. Among those θ estimators with unfolded (or folded) counting, the GLS estimator usually has the highest efficiency, whereas the type II estimator usually has the lowest efficiency. With an increase of p, the efficiency of the estimators of p increase to a certain level and then decreases (figs. 1 and and2).2). Even though their variances converge toward with an increase of p, they tend to underestimate ε at the same time; and at a certain point, the efficiency gain from the variance is not sufficient to compensate for the efficiency loss from bias. With an increase of L, the biases of the estimators do not change but their variances decrease at the same pace as the optimal estimator. With the increase of , both the biases and the variances of the estimators increase. As a result, their efficiency always decreases with an increase of θ, which is faster with an increase of than with an increase of L (fig. 3).
Zhao et al. (2000) sequenced a 10-kilobase noncoding region on human chromosome 22 from 64 individuals collected worldwide. Homologous sequences were obtained from a chimpanzee and an orangutan as outgroups. Among the 78 variant sites originally found from the alignment, 43 (including all 24 singletons) were verified by restriction fragment length polymorphism, resequencing, or subcloning. Four errors were found from the singletons, including three originated from nonvariant sites and one originated from a doubleton. The authors also described an error that changed a site from the 10 mutant class to the 9 mutant class. So it is possible for us to recover an uncleaned data set with five errors.
Here, we demonstrate the application of the estimators using both the cleaned data set and the uncleaned data set. The cleaned sequences were retrieved from GenBank and aligned using MUSCLE (Edgar 2004). Ancestral states of alleles were determined using the outgroup sequences. After trimming and removing insertions and deletions, the data set includes and and 69 polymorphic sites (table 1). = 14.964 and = 2.5 × 10−6 were estimated using the cleaned data (table 2). If we assume there is no error (), GLSu = 16.300 and BLUEu = 16.238 (table 2) can be regarded as an upper bound of the true θ. We then restored the five errors back to the data set according to Zhao et al. (2000) (table 1). Results show that , , and BLUEu are inflated by the erroneous singletons, whereas is not (table 2). Actually, decreases slightly to 14.499 mostly due to a decrease in the number of doubletons. Because we already know there are at least five errors, we know a lower bound of = 5/nL = 4.15 × 10−6. Using this , GLSu = 15.452, which is our best approximation of an upper bound of the true θ with the uncleaned data.
Counting of Sites with Different Number of Mutant Alleles in the Example Data Set
Estimations of θ and ϵ Using the Example Data Set
The results reported here show that the GLS estimators of θ perform well with a small to moderate sequencing error rate ε. When , the GLS estimators with a known ε (GLSu and GLSf) are the most efficient estimators. For the θ estimators with unknown ε, the GLS estimators ( and ) are relatively more efficient than other corrected estimators. In general, the GLS estimators using unfolded observation are superior to those using folded observations. In addition, because the GLS estimators are based on summary statistics of the sample, the computation is much faster than Knudsen and Miyamoto (2007)‘s method based on full likelihood. From this, we conclude that the GLS estimators are a good balance between estimation efficiency and computation efficiency.
The GLS estimators also have obvious limitations. First, it assumes a maximum of one sequencing error for each site, which may not be true when n and ε are large. When , GLSu and GLSf tend to overestimate θ. We also observe that the efficiency of GLSu and GLSf worsens faster than that of other estimators with increase of p. The reason is similar to that of the efficiency reverse of BLUEu, BLUEf, , and with and without error, as we briefly discussed in the Results. That is, when there are more errors than assumed (or can be corrected), the counting of rare variants is further skewed. Comparing to other estimators, GLSu and GLSf put much more weight on rare variants than on common variants, which causes them to be biased more than other estimators. Although and are more robust to p, their efficiency decreases when p is large. Second, they cannot easily handle missing data. However, with our limited simulations, we observed that if missing data is simply imputed using the observed allele frequencies, a random missing rate of up to 10% seems to only have a small effect on the efficiencies of GLSu, GLSf, , and (data not shown). Third, the computational complexity is higher than those of the type I and type II estimators.
Choosing the most efficient θ estimator with error depends on our knowledge of the error rate. If ε is known and , GLSu or GLSf is probably a good choice. If ε is unknown, we could first use or to estimate θ and then estimate p. When a sequence is very long or the sample size is very large, computational intensity may be a limitation. In this case, or is a good alternative.
In this paper, ε is estimated sequentially after an estimation of θ. Although this estimation is biased downward, it is reasonable when θ/L is small. It is possible to estimate ε and θ jointly using a similar iterative process as that used in the GLS estimators. That is, before we update GLS , we update p with
where is the estimation of θ in the previous step. However, our simulation shows that the performance is not as good as a simple estimate of θ ignoring singletons and then estimating ε using singletons and the estimate of θ.
There are some technical issues associated with the computational complexity of the GLS estimators. One is the calculation of − 1 in equation (18). Depending on the numerical methods used, sometimes a true inverse of may be hard to calculate. In our experience, when that happens, a generalized inverse can be used without noticeable problems. Another issue is the convergence of GLS. With our limited simulations, the convergence rate is typically larger than 99.9 % for most combination of parameters and never less than 99.7 %.
Several java programs for calculating the GLS estimators are available upon request or can be downloaded from http://sites.google.com/site/jpopgen/.
This research was supported by National Institutes of Health grant 5P50GM065509-07. We thank the two anonymous reviewers for their comments and suggestions.
Assume that originally, there are sites of mutation size i. Under the sequencing error model, it can produce sites with configuration (mutant, ancestral, and error) (), (), (), (), or (). We denote the configuration as () and its number as , with . Then,
Similarly, we have
Assume all s (represented by ξ) are known. Then, and are independent, where * represents any valid configuration and . So that when . Let and , then
Since s are unknown,