Forward diffusion equations played an important role during the development of classical population genetics, as they were originally introduced by R. Fisher and S. Wright to model the evolutionary process (Kimura 1964
). With the arrival of modern DNA sequencing technologies, forward diffusion processes have been applied to the inference of demographic parameters and the effects of natural selection (Williamson et al. 2005
; Boyko et al. 2008
; Gutenkunst et al. 2009
). These studies have been limited to scenarios with one, two, and three simultaneous populations. In this article, we have introduced a different approach to solving the forward diffusion equations by means of truncated polynomial expansions. These methods yield exact solutions of the AFS in the absence of migration; they can be used equally to study demographic models in one, two, and three populations with gene flow, and furthermore they can be applied to study models with four simultaneous populations.
We have applied our method to the study of the human expansion out of Africa and peopling of the Americas by means of a model with four simultaneous populations. Our four-population model can be seen as a combination of two three-population models that were studied before by means of diffusion-theory–based techniques (Gutenkunst et al. 2009
). Similarly, we used the Environmental Genome Project SNP database. The demographic parameters that we have inferred in this model are similar to many recent results (see Reich 2001
; Keinan et al. 2007
; Gravel et al. 2011
; Li and Durbin 2011
). However, some of our inferred parameters are substantially different from the parameters inferred in another study that used numerical solutions to forward diffusion equations (Gutenkunst et al. 2009
In this article we have also studied the behavior of different numerical solutions of the diffusion PDEs that approximate the AFS under a specified demographic model. In particular, we have compared the polynomial-based approach introduced in Lukic et al. (2011)
with the finite-difference approach implemented in Gutenkunst et al. (2009)
. Although the methods exhibit comparable behaviors, we found that the polynomial-based approach obtains better results in the regime where it yields exact solutions of the AFS, i.e.
, in the zero-migration limit (see and and ). Also, we confirmed that the polynomial-based approach allows us to broadly predict the magnitude of the numerical error as a function of the model parameters for a given truncation parameter Λ. The finite-difference approach exhibited a better behavior in the models with strong intensity of migration that we have considered in this work. However, in the case of the method introduced in Gutenkunst et al. (2009)
there is not a general theory that allows us to predict how the numerical error behaves as a function of the parameter space.
Numerical errors and bias-corrected confidence intervals
A common limitation that both approaches sometimes exhibit is that small numerical errors in the computation of the AFS can propagate to large biases in the parameter space when we search for the maxima of the numerical approximation of the likelihood function. Hence, even if biases due to numerical errors can be minimized, in some cases small but significant numerical sources of error that affect the statistical accuracy of the inferred demographic parameters will remain. For example, exhibits several cases where the bias is very large. These biases are not expected to diminish as the sample size grows, or as the number of SNPs increases, because they are due to numerical artifacts. One could minimize these biases by choosing a larger truncation Λ. However, numerical floating-point errors in the evaluation of polynomials become important sources of numerical error when the degree of the polynomial is large enough. In our implementation, we found that for values of Λ > 40 these sources of numerical error became larger than the truncation error due to a finite choice of Λ.
To minimize the impact of such biases one can either consider models with fewer parameters or introduce statistical corrections to the propagated numerical errors. Here, we apply standard bootstrap methods to correct for bias in the estimators and confidence intervals (see Efron and Tibshirani 1994
; DiCiccio and Efron 1996
) of some of the models studied above. If
is the maximum-likelihood estimator of a demographic model, where Λ denotes the truncation parameter of the numerical approximation, the bias is defined as
is the expected biased estimator and
is the expected unbiased estimator. Although we do not know the unbiased estimator, we can estimate the bias by means of the parametric bootstrap. In particular, we simulate a large number of SNP allele frequencies under the estimated parameters
and consider the associated AFS for each set of simulated SNPs. We denote by
the maximum-likelihood estimate associated with the b
th simulated AFS (where 1 ≤ b
is the number of bootstraps). Therefore, the bootstrap estimate of bias is
is the original maximum-likelihood estimate. This approximation will be valid as long as the bias is small and the number of bootstraps B
is large enough. Given
, we can now compute the bias-corrected 95% confidence intervals using nonparametric bootstraps (see DiCiccio and Efron 1996
is the 100 α
th percentile of the nonparametric bootstrap distribution.
As an example, in model 3 we used the parameters inferred by maximum likelihood that are shown in to simulate 10,000 SNPs per bootstrap by means of Monte Carlo methods. After estimating the bias with the parametric bootstrap, the 95% bias-corrected confidence intervals for the parameters of model 3 are 0.988−1.04 for 4NAu, 0.722–1.26 for N1/NA, 0.255–0.488 for N2/NA, 0.0136–0.0197 for T/2NA, 0.479–0.83 for 2NAm1→2, and 1.69–2.24 for 2NAm2→1.
Overcoming current limitations
Several limitations exist in the use of joint allele-frequency spectra with many populations. The most important one is that given the joint density of population frequencies
, the time needed to compute an AFS grows exponentially with the number of populations. Therefore, the only way to extract demographic information from such a high number of populations requires reducing the number of cells in the AFS to be computed. Because of this, future applications of our method for K
> 3 will require the use of either adaptive frequency spectra or projections of high-dimensional AFS into triplets or couplets of populations. For instance, by integrating out populations we can compute the joint density of frequencies associated with every triplet of populations from the higher joint density as
The associated three-population AFS is derived from Equation 1 and the density
A second question that remains to be explored concerns the bias of the estimator. For a given observed AFS, our method yields a sequence of maximum-likelihood peaks
labeled by the number of polynomials Λ used. The convergence of the numerical approximation to the exact AFS in the limit of an infinite number of polynomials guarantees that
is an unbiased estimator. It is important to understand the asymptotic behavior of the sequence of peaks
given a few finite values of the sequence
. In this study we computed different estimators for several values of Λ, to confirm that our estimators were converging toward the unbiased true estimator. In practice, this is an elaborate approach that requires running nonlinear optimization algorithms for different values of the parameters used in the approximation. A better understanding of this asymptotic behavior will allow the simplification of the analysis of propagated numerical errors associated with finite values of Λ. Similarly, it is important to identify the maximum sample size (number of SNPs) used in each bootstrap that allows us to estimate accurate confidence intervals for a given Λ and a demographic model. The importance of choosing the right sample size for a given Λ lies in that statistical error decreases as sample size increases and propagated numerical error decreases as Λ increases. Therefore, for any given Λ there exists a large enough sample size that, if used to estimate confidence intervals, will yield significantly biased intervals that are difficult to correct via standard statistical methods. This is due to the fact that the numerical error produced by a given Λ will be significantly larger than the statistical error produced by a large enough sample size. In our four-population model we used only 170 SNPs per bootstrap, which yields a conservative estimate of the confidence intervals as we confirmed in simulations. However, in the general case when more data are available and it is not clear what sample size to use in the bootstrap for a given Λ, one can combine parametric and nonparametric bootstrap techniques to estimate the sample size and the bias. In the parametric bootstrap, one generates simulated data with specified parameters that one uses to estimate the bias due to numerical errors. Also, one can determine which sample sizes are small enough to produce larger statistical errors than numerical errors. Then one can use this knowledge to estimate accurate confidence intervals by applying corrections for bias in the nonparametric bootstrap. This approach, although elaborate, will help to estimate accurate confidence intervals for any value of Λ in a wide variety of models.