PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of molbiolevolLink to Publisher's site
 
Mol Biol Evol. 2009 July; 26(7): 1479–1490.
Published online 2009 March 24. doi:  10.1093/molbev/msp059
PMCID: PMC2734145

Inferring Population Mutation Rate and Sequencing Error Rate Using the SNP Frequency Spectrum in a Sample of DNA Sequences

Abstract

One challenge of analyzing samples of DNA sequences is to account for the nonnegligible polymorphisms produced by error when the sequencing error rate is high or the sample size is large. Specifically, those artificial sequence variations will bias the observed single nucleotide polymorphism (SNP) frequency spectrum, which in turn may further bias the estimators of the population mutation rate equation n1 for diploids. In this paper, we propose a new approach based on the generalized least squares (GLS) method to estimate θ, given a SNP frequency spectrum in a random sample of DNA sequences from a population. With this approach, error rate ε can be either known or unknown. In the latter case, ε can be estimated given an estimation of θ. Using coalescent simulation, we compared our estimators with other estimators of θ. The results showed that the GLS estimators are more efficient than other θ estimators with error, and the estimation of ε is usable in practice when the θ per bp is small. We demonstrate the application of the estimators with 10-kb noncoding region sequence sampled from a human population and provide suggestions for choosing θ estimators with error.

Keywords: coalescent theory, sequencing error, mutation rate, SNP frequency spectrum, generalized least squares

1. Introduction

Sequencing errors can produce many problems for evolutionary or population genetical analysis of samples of DNA sequences (Clark and Whittam 1992, Johnson and Slatkin 2008). Given a random sample of sequence from a population, artificial polymorphisms caused by sequencing error will skew both the number and the frequency spectrum of the observed single nucleotide polymorphisms (SNPs). This will further skew any estimations or tests based on the number and/or frequency spectrum of the SNPs if errors are not properly taken into account. The bias will be more prominent with increased sample size because sequencing error accumulates linearly with sample size while θ increases slower, as implied by coalescent theory. Johnson and Slatkin (2008) suggested a rule of thumb that an uncorrected estimate will be biased significantly if equation n2, where n is the sample size (i.e., number of sequences), ε is the average error rate per site, L is the sequence length of the given locus, and θ is the population mutation rate of the locus. θ is equal to equation n3 for diploids and equation n4 for haploids, where N is the effective population size and μ is the mutation rate for the given locus. equation n5 typically ranges from equation n6 for species with extremely low diversity, such as the bog turtle (Rosenbaum et al. 2007), to equation n7 for species with extremely high diversity, such as human immunodeficiency virus (Achaz et al. 2004). For human populations, equation n8 is approximately equation n9 (Crawford et al. 2005). Therefore, with a typical sequencing error rate of equation n10 on the Sanger sequencing platform (Zwick 2005), uncorrected estimates will have problem with larger than 50 individuals (100 chromosomes) in human populations. If next-generation sequencing technologies are used, their higher sequencing error (equation n11) (Mackay et al. 2008, Shendure and Ji 2008) may further reduce the upper limit of sample size to only 10 chromosomes or less.

In response to the problem, researchers are developing new unbiased estimators and tests for population samples with sequencing error incorporated into the analysis. Johnson and Slatkin (2006) proposed maximum likelihood estimators for θ and the scaled exponential growth rate using the SNP frequency spectrum while accounting for sequencing error via Phred quality scores. They also studied the effects of sequencing error on two uncorrected estimators of θ, Tajima's An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg (Tajima 1983) and Watterson's An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg (Watterson 1975), and proposed two unbiased estimators of θ assuming known ε (see discussions below) (Johnson and Slatkin 2008). Hellmann et al. (2008) proposed a method to correct An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg, assuming known ε. It is similar to Johnson and Slatkin (2008)‘s θ estimator based on the total number of polymorphic sites but takes account of the uncertainty of chromosome sampling in a shot-gun sequencing of mixed diploid individuals. Lynch (2008) proposed several methods to correct An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg for high-coverage shot-gun genomic sequences from a single diploid individual. Knudsen and Miyamoto (2007) incorporated missing data, sequencing error, and multiple reads for diploid individuals into a full-likelihood coalescent model of a sample of DNA sequences. Their model can be used to estimate θ and ε jointly. However, the computational intensity of calculating the likelihood limits the application of the model to small sample sizes (i.e., less than 20 sequences). Assuming a low but unknown ε, Achaz (2008) proposed two new moment estimators of θ based on the SNP number and frequency spectrum while ignoring singletons, on which sequencing error skews the most, and constructed two new test statistics for neutrality based on the new estimators. Jiang et al. (2009) developed a method to estimate θ and recombination rate for shot-gun resequencing data and proposed some refinements to increase its robustness to sequencing errors.

Here, we propose a new approach based on the generalized least squares (GLS) method to estimate θ using the SNP frequency spectrum of a random sample of DNA sequences from a population. It can be considered as a modification and extension of Fu (1994b)‘s best linear unbiased estimator (BLUE) estimator under the assumption of a moderate to low ε, either known or unknown. When ε is unknown, it can be estimated sequentially after the estimation of θ. The rational here is to incorporate the variance–covariance structure of the SNP frequency spectrum, with either known or unknown ancestral states of the SNPs, and reduce the variation of the estimator. Its computational intensity is higher than Achaz (2008)‘s moment estimators but much lower than Knudsen and Miyamoto (2007)‘s full-likelihood method, which makes it applicable to samples with thousands of sequences. In the following sections, we first introduce our GLS estimators along with some corrected or uncorrected estimators of θ; then, we compare these estimators using coalescent simulation assuming different population parameters. We demonstrate their application on a sample of DNA sequences of a 10-kilobase noncoding region on human chromosome 22. Finally, we discuss the advantages, limitations, and some technical issues of the GLS estimators.

2. Methods

2.1. Assumptions

Sequencing error at each site is usually modeled as a binomial or Poisson variable with parameter equation n12. As defined in the Introduction, n is the sample size (i.e., number of sequences) and ε is the average error rate per site. Here, we simplify the model by assuming that each site has at most one sequencing error, which occurs with probability P. As a rule of thumb, this assumption is approximately valid with equation n13. We further assume that when a sequencing error occurs on an ancestral (or mutant) allele, the probability that it leads to an existing allele (mutant or ancestral) is u and the probability that it yields a new allele is equation n14. For example, if only point mutation is considered and assuming that a nucleotide has equal probability to be read incorrectly as one of the other three nucleotides, equation n15. For a sample of n sequences with length L randomly sampled from a population, we designate equation n16 as the (unknown) number of true polymorphic sites with mutation size i (or SNP class i, i.e., there are i copies of the mutant allele in the sample) and equation n17 as the observed number of sites with equation n18 copies of the ancestral allele, i copies of the mutant alleles and k copy of the new allele (other than mutant and ancestral alleles) produced by sequencing error (equation n19).

2.2. Approximate GLS Estimators of θ When ε is Unknown

If ε or p is unknown but assumed to be relatively small, it is easy to see that most errors are singletons (equation n20 or equation n21) because equation n22 with assumption of infinite site model and a reasonable θ (Achaz 2008). In another words, equation n23 for equation n24. By ignoring singletons, we removed most of the errors so that equation n25 in the remaining SNP classes based on which new estimators can be developed. For example, Achaz (2008) proposed An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgK − ξ1 and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgK− η1 estimators, which are modified Watterson (1975)‘s An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg estimator and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgπ− ξ1 and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgπ − η1 estimators, which are modified Tajima (1983)‘s An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg estimator (see details below). Fu (1994b) proposed a θ estimator based on the GLS method (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEf, called BLUE estimators in Fu 1994b), which makes full use of the observed numbers of SNP frequency classes, so that it has much less variability than either Watterson (1975)‘s or Tajima (1983)‘s estimator.

Following Achaz (2008)‘s idea that because singletons are much more likely to be errors comparing to other SNP class, an estimator based on the SNP spectrum excluding singletons will be robust to errors. Here, we briefly describe the modification of Fu (1994b)‘s GLS method by ignoring all singletons and triple allele. See Fu (1994a, 1994b) for more technique details of the method. We first rewrite the expectations, variances, and covariances of the observed number of SNP classes as:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx4_ht.jpg
(1)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx5_ht.jpg
(2)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx6_ht.jpg
(3)

where equation n26. The values of ρi and equation n27 are dependent on the population model, which can be obtained theoretically or numerically using simulation. Assuming a Wright–Fisher model with constant population size, no selection, and recombination, their explicit formulas are known (Fu 1995). Writing in a matrix form, we have

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx7_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx8_ht.jpg

and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg is calculated with equations (2) and (3). Then, the corresponding GLS estimator of θ is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx10_ht.jpg

Because the calculation of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg needs prior estimation of θ, the An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE−ξ1 is calculated iteratively (Fu 1994a, 1994b): 1) give an initial value of θ, say θ0 = An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg, 2) calculate An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg with θ0; 3) calculate An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − ξ1 with An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg and let θ1 =An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − ξ1; 4) update An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg with equation n28; and 5) repeat step 3 and 4 until An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − ξ1 converges.

If the ancestral state of each SNP is unknown (i.e., we do not know which allele is ancestral and which is mutant) then we need to fold the counting of ξi and equation n29 to include all SNPs with a minor allele frequency i and major allele frequency equation n30 or vice versa. In the folded case,

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx11_ht.jpg

and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg is changed accordingly. The corresponding estimator, designated as An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − η1, can be calculated with the same iterative process as described above.

After the θ is estimated, p can be approximated as

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx12_ht.jpg
(4)

for unfolded counting or

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx13_ht.jpg
(5)

for folded counting, where E1 |An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpg) and En−1|An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpg) are calculated using equation (1) with θ replaced by An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpg. Then, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpg is simply An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx15_ht.jpg/n. For convenience, we use the subscript of the θ estimator used in An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpg to designate different An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpgs. For example, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpgBLUE − ξ1 means the An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpg calculated using An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE− ξ1.

2.3. GLS Estimators of θ When ε is Known

When ε is known, we can develop a more precise GLS estimator of θ by making full use of this information. First, we need to derive expectations, variance, and covariances of equation n31. For simplicity, letting equation n32 if equation n33 or equation n34 then

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx16_ht.jpg
(6)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx17_ht.jpg
(7)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx18_ht.jpg
(8)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx19_ht.jpg
(9)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx20_ht.jpg
(10)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx21_ht.jpg
(11)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx22_ht.jpg
(12)

where equation n35 is an index function, which equals to 1 if equation n36 or 0 otherwise; am,ji, bm,ji, c m,j1,k,j2i, and dm,j1,k,j2i are functions of i, j, equation n37, equation n38, m, k, n, and u. Their detailed expression and the derivation of the above formulas can be found in the Appendix. equation n39, equation n40, and equation n41equation n42 can be calculated using equations (1)(3).

In practice, equation n43 and equation n44 are indistinguishable even when the ancestral states are known. To avoid the above difficulties, we combine observations of equation n45 and equation n46. The general formulas for calculating the expectation of the combined observations and the variance–covariance of the combined observation with other (combined) observations are

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx23_ht.jpg
(13)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx24_ht.jpg
(14)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx25_ht.jpg
(15)
An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx26_ht.jpg
(16)

where equation n47, equation n48, and equation n49 can be any given equation n50 or equation n51. For example, to apply equations (13)(14) to equation n52 and equation n53, we simply let equation n54, equation n55, equation n56, and equation n57. By incorporating equation n58, we have

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx27_ht.jpg

A GLS estimator can be calculated as follows. If the ancestral state of each site is known or can be inferred from outgroups, let

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx28_ht.jpg

then

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx29_ht.jpg
(17)

The GLS estimator of θ is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx30_ht.jpg
(18)

again An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg is an estimation of the variance–covariance matrix of Y , which can be calculated using equations (8)(16). The same iterative process described above is then used to calculate An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLS. In the case that ancestral states are unknown, we need to further fold equation n59 with equation n60An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx31_ht.jpg, and equation n61 with equation n62An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx32_ht.jpg because they are indistinguishable. Let

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx33_ht.jpg

and follow the same steps as described for An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLS. We can calculate the GLS estimator with folded observations. Later in this paper, we designate An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf to be the GLS estimator using folded observations and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu to be that using unfolded observations.

2.4. Comparing to Other Estimators of θ Using Simulation

Three types of corrected and uncorrected estimators of θ were compared in this study, which will be briefly introduced below. Johnson and Slatkin (2006)‘s method was not compared because here, we assume the detailed Phred quality scores is unknown. Neither did we include Knudsen and Miyamoto (2007)‘s method because of its computational intensity and limitation to small sample sizes. Hellmann et al. (2008) and Johnson and Slatkin (2008)‘s estimators based on the total number polymorphic sites should have similar performances, so only the latter was included in our comparison.

The first type of estimator (type I) is based on the total number of polymorphic sites. An uncorrected estimator of this type is the widely used Watterson's estimator An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg = K/an (Watterson 1975), where equation n63 and K is the total number of polymorphic sites in the sample. Assuming a known ε, Johnson and Slatkin (2008) proposed a corrected estimator (with modification):

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx34_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx35_ht.jpg

and m is the total number of possible alleles in a site. Assuming an unknown but small ε, Achaz (2008) proposed two corrected estimators of this type. They are An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx36_ht.jpg for unfolded observations and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx37_ht.jpg for folded observations.

The second type of estimator (type II) is based on the average difference between two sequences. An uncorrected estimator of this type is Tajima's An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx38_ht.jpg (Tajima 1983), where equation n64 is the number of difference between sequences i and j. Johnson and Slatkin (2008)‘s corrected estimator of this type (with modification) is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx39_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx40_ht.jpg

Achaz (2008)‘s corrected estimators of this type are

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx41_ht.jpg

for unfolded observations and

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx42_ht.jpg

for folded observations.

The third type of estimator (type III) compared is the GLS estimator. Our estimators are corrected GLS estimators, which can be considered an extension to Fu (1994b)‘s uncorrected GLS estimators An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEf for unfolded and folded counting, respectively.

We compared the performances of the estimators using coalescent simulations (e.g., Hudson 2002). We assumed a Wright–Fisher model with constant population size, no selection, no recombination and a simple infinite sites model for mutation. For each combination of parameters, 10,000 samples were simulated. Only point mutations (equation n65) were simulated. The GLS estimation iteration was stopped either when the absolute value change between the update of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLS is smaller than equation n66 or when 200 updates of equation (18) was conducted (see Methods for details). Sequencing error at each site was simulated to follow a binomial distribution with parameter p. Because An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − ξ1, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − η1, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx43_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx44_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx45_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx46_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEf assume at most two alleles at each site, for those estimators if a site has more than one type of non-ancestral allele, the one with the largest count is regarded as the mutant allele and all other non-ancestral alleles are regarded as errors from ancestral alleles. For example, if a site has two non-ancestral alleles with counts 1 and 2 then it will be added to equation n67 instead of equation n68. An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx15_ht.jpg was calculated with equation (4) using An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE −ξ1, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx43_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx45_ht.jpg or with equation (5) using An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUE − η1, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx44_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx46_ht.jpg.

3. RESULTS

3.1. Simulation Comparison

For each set of simulated samples with a given combination of parameters, we compared the mean and variance of each estimator. We also calculated the mean squared error (MSE) of each estimator and the variance of an optimal estimator, equation n69. For θ, the large-sample approximation of equation n70 is equation n71 (Fu and Li 1993). Because we know the actual number of errors introduced in each simulated sample, say equation n72, we simply use equation n73 as the optimal estimator of ε and calculated its variance equation n74 in simulated samples. The ratio equation n75 indicates the relative efficiency of each estimator to the optimal estimator.

As expected, with increasing p, via either an increasing error rate ε or an increasing sample size n, the uncorrected θ estimators perform poorly (data not shown). The four uncorrected θ estimators (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEf) are upper biased significantly with increasing ε because they assume every polymorphism is a result of mutation. As a result, the uncorrected θ estimators are the least efficient estimators comparing to the corrected ones (data not shown). Among them, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEf are the most sensitive to sequencing error, whereas An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg is the least sensitive. Looking at this in another way, although under the null hypothesis of no sequencing error, the order of their relative efficiency is An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx47_ht.jpg, the order is totally reversed when p is larger. The above observations can be partially explained by the weight each estimator puts on equation n76. Although An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg puts equal weight on every equation n77, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg puts less weight on equation n78 than on equation n79 if equation n80. On the contrary, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu puts more weight on equation n81 because under the assumption of no error, equation n82 is a more reliable observation than equation n83. However, the opposite is true with errors.

On the other hand, the efficiency of corrected θ estimators is less affected by increasing p. As to those with unknown ε, all are approximately unbiased. Their variances are relatively unchanged with the increase of ε or even decrease slightly with an increase of n (data not shown). The order of their efficiency is An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx48_ht.jpg (figs. 1 and and2,2, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx44_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx46_ht.jpg are not shown because they have a very similar efficiency as An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx43_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx45_ht.jpg, respectively). For the estimators with known ε, all of them are approximately unbiased except that An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf are upper biased when equation n84. The variances of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgkc, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf increase with the increase of ε or n except when n is small (say < 100). As a result, their efficiency decreases with increasing p (figs. 1 and and2).2). With equation n85, the order of their efficiency is An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx49_ht.jpg (figs. 1 and and2).2). In summary, with a small to moderate ε (when equation n86 and known), An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf are the most efficient θ estimators; otherwise, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx51_ht.jpg are the most efficient estimators.

FIG. 1.

Efficiency of θ and ε estimators with increase of ε. Only the subscripts of the estimators were shown.

FIG. 2.

Efficiency of θ and ε estimators with increase of n. Only the subscripts of the estimators were shown.

Increasing θ, either by increasing equation n87 or by L, affects the variances but not the means of the θ estimators. The variances of all estimators increase with the increase of θ. As a result, the efficiency of corrected estimators decrease with an increase of θ, with the exception of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgKc, which actually increases when equation n88 is small (fig. 3). Among the corrected estimators, the GLS estimators perform better than others, whereas those with unfolded counting perform better than those with folded counting (fig. 3).

FIG. 3.

Efficiency of θ and ε estimators with increase of L and θ/L. Only the subscripts of the estimators were shown.

The differences between the estimators of p are smaller compared with their corresponding θ estimators. Because the estimation of p is based on prior estimation of θ, it is easy to understand that a more accurate estimation of θ will help with a more accurate estimation of p. In general, using a θ estimator with unfolded counting will provide a more efficient estimation of p than using an estimator with folded counting. Among those θ estimators with unfolded (or folded) counting, the GLS estimator usually has the highest efficiency, whereas the type II estimator usually has the lowest efficiency. With an increase of p, the efficiency of the estimators of p increase to a certain level and then decreases (figs. 1 and and2).2). Even though their variances converge toward equation n89 with an increase of p, they tend to underestimate ε at the same time; and at a certain point, the efficiency gain from the variance is not sufficient to compensate for the efficiency loss from bias. With an increase of L, the biases of the estimators do not change but their variances decrease at the same pace as the optimal estimator. With the increase of equation n90, both the biases and the variances of the estimators increase. As a result, their efficiency always decreases with an increase of θ, which is faster with an increase of equation n91 than with an increase of L (fig. 3).

3.2. Application Example

Zhao et al. (2000) sequenced a 10-kilobase noncoding region on human chromosome 22 from 64 individuals collected worldwide. Homologous sequences were obtained from a chimpanzee and an orangutan as outgroups. Among the 78 variant sites originally found from the alignment, 43 (including all 24 singletons) were verified by restriction fragment length polymorphism, resequencing, or subcloning. Four errors were found from the singletons, including three originated from nonvariant sites and one originated from a doubleton. The authors also described an error that changed a site from the 10 mutant class to the 9 mutant class. So it is possible for us to recover an uncleaned data set with five errors.

Here, we demonstrate the application of the estimators using both the cleaned data set and the uncleaned data set. The cleaned sequences were retrieved from GenBank and aligned using MUSCLE (Edgar 2004). Ancestral states of alleles were determined using the outgroup sequences. After trimming and removing insertions and deletions, the data set includes equation n92 and equation n93 and 69 polymorphic sites (table 1). An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg = 14.964 and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx52_ht.jpg = 2.5 × 10−6 were estimated using the cleaned data (table 2). If we assume there is no error (equation n94), An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu = 16.300 and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu = 16.238 (table 2) can be regarded as an upper bound of the true θ. We then restored the five errors back to the data set according to Zhao et al. (2000) (table 1). Results show that An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu are inflated by the erroneous singletons, whereas An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg is not (table 2). Actually, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg decreases slightly to 14.499 mostly due to a decrease in the number of doubletons. Because we already know there are at least five errors, we know a lower bound of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpg = 5/nL = 4.15 × 10−6. Using this An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx14_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu = 15.452, which is our best approximation of an upper bound of the true θ with the uncleaned data.

Table 1

Counting of Sites with Different Number of Mutant Alleles in the Example Data Set

Table 2

Estimations of θ and ϵ Using the Example Data Set

4. DISCUSSION

The results reported here show that the GLS estimators of θ perform well with a small to moderate sequencing error rate ε. When equation n95, the GLS estimators with a known ε (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf) are the most efficient estimators. For the θ estimators with unknown ε, the GLS estimators (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx51_ht.jpg) are relatively more efficient than other corrected estimators. In general, the GLS estimators using unfolded observation are superior to those using folded observations. In addition, because the GLS estimators are based on summary statistics of the sample, the computation is much faster than Knudsen and Miyamoto (2007)‘s method based on full likelihood. From this, we conclude that the GLS estimators are a good balance between estimation efficiency and computation efficiency.

The GLS estimators also have obvious limitations. First, it assumes a maximum of one sequencing error for each site, which may not be true when n and ε are large. When equation n96, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf tend to overestimate θ. We also observe that the efficiency of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf worsens faster than that of other estimators with increase of p. The reason is similar to that of the efficiency reverse of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEu, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgBLUEf, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx2_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx1_ht.jpg with and without error, as we briefly discussed in the Results. That is, when there are more errors than assumed (or can be corrected), the counting of rare variants is further skewed. Comparing to other estimators, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf put much more weight on rare variants than on common variants, which causes them to be biased more than other estimators. Although An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx51_ht.jpg are more robust to p, their efficiency decreases when p is large. Second, they cannot easily handle missing data. However, with our limited simulations, we observed that if missing data is simply imputed using the observed allele frequencies, a random missing rate of up to 10% seems to only have a small effect on the efficiencies of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx51_ht.jpg (data not shown). Third, the computational complexity is higher than those of the type I and type II estimators.

Choosing the most efficient θ estimator with error depends on our knowledge of the error rate. If ε is known and equation n97, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSu or An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLSf is probably a good choice. If ε is unknown, we could first use An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx50_ht.jpg or An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx51_ht.jpg to estimate θ and then estimate p. When a sequence is very long or the sample size is very large, computational intensity may be a limitation. In this case, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx43_ht.jpg or An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx44_ht.jpg is a good alternative.

In this paper, ε is estimated sequentially after an estimation of θ. Although this estimation is biased downward, it is reasonable when θ/L is small. It is possible to estimate ε and θ jointly using a similar iterative process as that used in the GLS estimators. That is, before we update An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLS , we update p with

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx53_ht.jpg
(19)

where An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpg is the estimation of θ in the previous step. However, our simulation shows that the performance is not as good as a simple estimate of θ ignoring singletons and then estimating ε using singletons and the estimate of θ.

There are some technical issues associated with the computational complexity of the GLS estimators. One is the calculation of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg − 1 in equation (18). Depending on the numerical methods used, sometimes a true inverse of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx9_ht.jpg may be hard to calculate. In our experience, when that happens, a generalized inverse can be used without noticeable problems. Another issue is the convergence of An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx3_ht.jpgGLS. With our limited simulations, the convergence rate is typically larger than 99.9 % for most combination of parameters and never less than 99.7 %.

Several java programs for calculating the GLS estimators are available upon request or can be downloaded from http://sites.google.com/site/jpopgen/.

Acknowledgments

This research was supported by National Institutes of Health grant 5P50GM065509-07. We thank the two anonymous reviewers for their comments and suggestions.

APPENDIX

Assume that originally, there are equation n98 sites of mutation size i. Under the sequencing error model, it can produce sites with configuration (mutant, ancestral, and error) (equation n99), (equation n100), (equation n101), (equation n102), or (equation n103). We denote the configuration as (equation n104) and its number as equation n105, with equation n106. Then,

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx54_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx55_ht.jpg

Similarly, we have

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx56_ht.jpg

Assume all equation n107s (represented by ξ) are known. Then, equation n108 and equation n109 are independent, where * represents any valid configuration and equation n110. So that equation n111 when equation n112. Let equation n113 and equation n114, then

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx57_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx58_ht.jpg

Then,

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx59_ht.jpg

Since equation n115s are unknown,

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsp059fx60_ht.jpg

References

  • Achaz G. Testing for neutrality in samples with sequencing errors. Genetics. 2008;179:1409–1424. [PubMed]
  • Achaz G, Palmer S, Kearney M, Maldarelli F, Mellors JW, Coffin JM, Wakeley J. A robust measure of HIV-1 population turnover within chronically infected individuals. Mol. Biol. Evol. 2004;21:1902–1912. [PubMed]
  • Clark AG, Whittam TS. Sequencing errors and molecular evolutionary analysis. Mol. Biol. Evol. 1992;9:744–752. [PubMed]
  • Crawford DC, Akey DT, Nickerson DA. The patterns of natural variation in human genes. Annu. Rev. Genomics. Hum. Genet. 2005;6:287–312. [PubMed]
  • Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic. Acids. Res. 2004;32:1792–1797. [PMC free article] [PubMed]
  • Fu YX. A phylogenetic estimator of effective population-size or mutation-rate. Genetics. 1994a;136:685–692. [PubMed]
  • Fu YX. Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics. 1994b;138:1375–1386. [PubMed]
  • Fu YX. Statistical properties of segregating sites. Theor. Popul. Biol. 1995;48:172–197. [PubMed]
  • Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133:693–709. [PubMed]
  • Hellmann I, Mang Y, Gu Z, Li P, de la Vega FM, Clark AG, Nielsen R. Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome. Res. 2008;18:1020–1029. [PubMed]
  • Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. [PubMed]
  • Jiang R, Tavaré S, Marjoram P. Population genetic inference from resequencing data. Genetics. 2009;181:187–197. [PubMed]
  • Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome. Res. 2006;16:1320–1327. [PubMed]
  • Johnson PLF, Slatkin M. Accounting for bias from sequencing error in population genetic estimates. Mol. Biol. Evol. 2008;25:199–206. [PubMed]
  • Knudsen B, Miyamoto MM. Incorporating experimental design and error into coalescent/mutation models of population history. Genetics. 2007;176:2335–2342. [PubMed]
  • Lynch M. Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects. Mol. Biol. Evol. 2008;25:2409–2419. [PMC free article] [PubMed]
  • Mackay T, Richards S, Gibbs R. Proposal to sequence a Drosophila genetic reference panel: a community resource for the study of genotypic and phenotypic variation [white paper]. 2008 FlyBase. org. Available from http://flybase.org/ static_pages/news/wpapers.html. Last accessed on April 14, 2009.
  • Rosenbaum P, Robertson J, Zamudio K. Unexpectedly low genetic divergences among populations of the threatened bog turtle (Glyptemys muhlenbergii) Conserv. Genet. 2007;8:331–342.
  • Shendure J, Ji H. Next-generation DNA sequencing. Nat. Biotechnol. 2008;26:1135–1145. [PubMed]
  • Tajima F. Evolutionary relationship of DNA-sequences in finite populations. Genetics. 1983;105:437–460. [PubMed]
  • Watterson GA. Number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 1975;7:256–276. [PubMed]
  • Zhao Z, Jin L, Fu YX, et al. Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. Proc. Natl. Acad. Sci. U.S.A. 2000;97:11354–11358. [PubMed]
  • Zwick ME. A genome sequencing center in every lab. Eur. J. Hum. Genet. 2005;13:1167–1168. [PubMed]

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press