Using genetic data to infer the time of migrations has always been difficult, and the time estimates obtained often come within wide confidence intervals, making these dates unreliable and inferences problematic. Here, we have introduced an approach that takes advantage of dense genome-wide SNP data to improve precision and reduce bias in making inferences about the timing of human migrations. By using an admixed population one can capitalize on the property of the genome to recombine each generation, producing chromosomes that are a mixture of the parental genetic material. The structure of an admixed genome contains temporal information about an admixture event, as a greater number and narrower width of ancestry blocks indicates more recombination events, and hence greater time depth.
Simulations indicate that the WT coefficients can be used to obtain accurate estimates of the time of admixture from suitable genome-wide SNP data. We therefore applied the method to three datasets, consisting of about 650,000 SNPs, to estimate the amount and time of admixture for three human populations: African-Americans, Polynesians, and Fijians. In addition, we analyzed and dated admixture in five HGDP populations of African and Middle-Eastern origin. At first glance, it may appear that the simulated and empirical data differ in that the simulations used fully-differentiated populations, which is not the case for the empirical data. However, as explained in more detail in the Results and Discussion (Basic Setup section), the number of SNPs is adjusted in the StepPCO sliding windows until the ancestral populations can be statistically-differentiated, just as with the simulated data.
For African-Americans, we estimate an average of 19% European ancestry, with a wide range of less than 5% to more than 40% European ancestry across individuals. Both the average and the observation of a wide range of individual admixture estimates are in keeping with previous studies [10
]. The estimated time of admixture is about 180 years ago (95% CI: 120-240 years ago), which is probably an underestimate since admixture in the African-American population is ongoing (implying that new ancestry blocks are being continuously introduced by new recombination events, which potentially removes older block structure by replacing narrower ancestry blocks with new, wider blocks).
We tested the performance of the method on Fijians and Polynesians, as both populations are of admixed Asian and Melanesian ancestry [6
]. Previous demographic analyses of the genome-wide SNP data used in this study strongly support both an admixed Asian/Melanesian ancestry for Fijians and Polynesians as well as subsequent additional gene flow from Melanesia into Fiji, but not Polynesia [19
]. Based on this previously established scenario, we estimated an average of about 25% (from 18 to 28%) Melanesian ancestry in Polynesians, in good agreement with previous estimates based on the same [19
] or other [6
] data. The estimated time of admixture is about 90 generations ago, or 2,700 years (95% CI: 2,300-3,900 years), in good agreement with a previous estimate of about 3,000 years ago based on an ABC simulation approach for the same data [19
]. For Fiji, the estimated amount of Melanesian ancestry was about 40%, and the time for this admixture is estimated to have occurred about 37 generations ago, or 1,100 years (95% CI: 870-1170 years). An ABC-simulation based approach for the same data gave an estimated date of 62 generations for this admixture in Fijians, about twice as long ago as our estimate. We speculate that, as in the case of the African-Americans, the estimate based on WT coefficients may be biased toward more recent dates if the gene flow to Fiji occurred over a period of time, as more recent gene flow replaces older, narrower ancestry blocks with newer, wider ancestry blocks. Individual Melanesian ancestry estimates are much wider for Fiji (from 22 to 63%) than for Polynesia (from 18 to 28%), which may indeed indicate a longer period of gene flow into Fiji.
Our results for the Mozabite, Mandenka, Bedouin, Druze and Palestinian populations are similar to those for HAPMIX for inferring local ancestry, and in addition our method seems to perform better with respect to more ancient admixture events (as also shown with simulated data: Figure ). In particular, we dated the admixture event in the Mozabites and the Druze to 131 and 90 generations ago respectively, 30 generations more than the corresponding estimates obtained with HAPMIX [17
]. HAPMIX estimation of the time since admixture is based on the number of calculated ancestry transitions (that is, the number of breakpoints); both our simulations and previous simulations [17
] indicate that infinite size populations the number of breakpoints does not increase with time according to expectations (see Equation 1 above), but rather stabilizes, leading to underestimates in admixture dates (Figure ). Furthermore, because human populations are closely related and not very well differentiated, direct estimation of the number of breakpoints and block width as a measure of time since admixture for human genetic data is problematic for two reasons. Firstly, to have enough power to reliably assign chromosomal segments to an ancestral population, it is necessary to use relatively large genomic windows, which correspondingly reduces detection of closely-spaced breakpoints. And secondly, for every location in the genome that potentially carries a breakpoint, a formal decision has to be made as to whether to consider it a true breakpoint or not. This transformation of the 'raw' signal into a discrete signal potentially leads to either some not well-defined breakpoints being overlooked, or conversely random effects becoming inflated and falsely considered as a true signal. These errors, however small, will accumulate over the many measurements taken. Conversely, the spectral analysis approach implemented here does not require any data transformation and is applied to the 'raw' signal directly. This has the advantage of preserving the statistical nature of the signal until the final averaging step, and thus does not involve detection of exact location (and presence) of breakpoints, where inevitably large errors in estimation could occur. Although we followed Price et al
. 2009 in using African and European parental groups for the admixed Mozabite, Mandenka, Bedouin, Druze and Palestinian groups from the CEPH-HGDP, in fact previous studies have shown that the Druze, Bedouin and Palestinian populations are admixed primarily along a European-Central Asian axis, with little African admixture, and the Mandenka exhibit very little European admixture [18
]. Here, we report dates for the presumptive European gene flow, to compare our results to the previous study [17
], but it is important to keep in mind that our method (like all admixture methods) requires the use of pre-defined parental groups. Incorrect identification of the ancestral groups contributing to an admixed group will obviously lead to erroneous conclusions, hence careful attention must be paid when identifying parental groups. This is especially true for groups that are suggested to have experienced admixture a long time ago, and hence had more time to experience genetic drift (which is always expected to act in a direction orthogonal to the axis of admixture). In such cases, it is difficult to distinguish between an admixed population that has been subject to genetic drift, and a population that has experienced admixture along a different axis of variation.
Theoretically, there are no limitations as to how far back in time one can get good estimates of admixture time with WT coefficients. The performance of the method is influenced by two factors: the density of SNPs analyzed and the degree of differentiation between the two parental populations. Increasing SNP density would allow the estimated time horizon for detecting admixture to be moved further back. We therefore expect that full sequence data will increase the sensitivity and resolution of our method. The analysis presented here was based on about 650,000 markers; the current estimates for the number of SNPs in the human genome is around 15 million SNPs [42
]. Full sequence data will thus provide a twentyfold increase in SNP density, and thereby allow for a twentyfold reduction in the size of the sliding window. Thus, assuming that the newly added SNPs are no less informative for population differentiation than the SNPs on the Affymetrix arrays, we expect that analysis of full sequence data should offer at least a twentyfold improvement in the potential time depth for admixture estimates for human populations. However, given that there is relatively little genetic differentiation between human populations, to distinguish among parental populations requires relatively large segments of the genome, and this also poses a restriction on the time depth of the method. The more closely-related the parental populations, the larger the window size needed to span a sufficient number of informative SNPs. Obviously, this limitation will persist regardless of the type of molecular data considered. However, because of the strong ascertainment bias associated with the SNPs genotyped on the arrays, we expect that SNP-data generated using the array technology necessarily underestimates the variation that exists between human populations, and recent studies suggest that this underestimation could be considerable [43
]. Moreover, the method introduced here can be used with any species for which suitable genome-wide data exist, and the larger the genetic difference among the parental populations, the less genome-wide data needed for accurate admixture estimates.
An additional advantage of the StepPCO method is that it provides an estimate of the admixture proportions for each individual within an admixed population. Individual-level estimates of admixture obtained here via StepPCO for African-Americans, Polynesians, and Fijians were quite similar to those obtained via the maximum-likelihood based approach frappe
], indicating that StepPCO gives reliable results. Furthermore, the StepPCO method also provides information about the distribution of admixture along each chromosome. As such, this approach is also promising for disease gene mapping (in recently admixed populations), and for studying local selection. According to neutral expectations the admixture level should be constant along the genome, but a locus favored by positive selection in the admixed population should appear to have greater admixture proportions than would be expected from the genome-wide average. We are currently investigating the utility of this approach for identifying candidate genes subject to local selection.
In conclusion, we have shown that wavelet transformation is a useful and novel means of dating admixture events from genome-wide data. Other potential extensions of the methodology introduced here include admixture scenarios involving more than two parental populations, and implementing spectral analysis of the raw genomic signal directly, rather than from the StepPCO signal. There is potentially much more to be learned from surfing the wavelets of the genome.