|Home | About | Journals | Submit | Contact Us | Français|
The founding of New World populations by Asian peoples is the focus of considerable archaeological and genetic research, and there persist important questions on when and how these events occurred. Genetic data offer great potential for the study of human population history, but there are significant challenges in discerning distinct demographic processes. A new method for the study of diverging populations was applied to questions on the founding and history of Amerind-speaking Native American populations. The model permits estimation of founding population sizes, changes in population size, time of population formation, and gene flow. Analyses of data from nine loci are consistent with the general portrait that has emerged from archaeological and other kinds of evidence. The estimated effective size of the founding population for the New World is fewer than 80 individuals, approximately 1% of the effective size of the estimated ancestral Asian population. By adding a splitting parameter to population divergence models it becomes possible to develop detailed portraits of human demographic history. Analyses of Asian and New World data support a model of a recent founding of the New World by a population of quite small effective size.
Archeological evidence, as well as anatomical, linguistic, and genetic evidence, have shown that the original human inhabitants of the Western Hemisphere arrived from Asia during the Late Pleistocene [1–4]. However, there persists uncertainty on the source, within Asia, of peoples who migrated to the New World , on the timing of the earliest migration [6–10], and on whether there have been multiple migrations [3,11–13].
For complex historical subjects such as the colonization of the Americas, there are many ways that models can be constructed, examined, and compared. One approach is to develop a portrait based on a particular kind of data, such as linguistic , skeletal , or archaeological  data, or on DNA sequence data from a particular portion of the human genome such as the mitochondria [4,16–19] or the Y chromosome . Yet each source of data has unique sources of variation. In the case of genetic data there occurs a large stochastic variance of the coalescent history among genes that causes different loci to vary widely in levels of genetic variation and in apparent patterns of relationships among populations [20–22]. This stochastic variance is sometimes overlooked, for example in discussions of the histories of the individual DNA sequence haplotypes , and it is easy to underestimate the many possible histories that are consistent with a finding that haplotypes are shared by different populations [23–25].
To accommodate the stochastic variance among loci, population geneticists have turned in recent years to Bayesian and likelihood methods that explicitly take into account the range of possible gene tree histories that are consistent with a given dataset [26–30]. For questions on population divergence, the focus has been on models of population splitting in which an ancestral population divides into two descendant populations, after which there may be gene flow between the descendant populations. These “isolation with migration” (IM) models can have a large number of parameters, and they offer the possibility of capturing many of the dynamics that occur in the early stages of population divergence or speciation [30–33].
Figure 1A shows the basic IM model, in which the ancestral and descendant populations each have a constant size. Each of the terms in the model is explained in Table 1. Basic limitations of this model are that it cannot provide details on how descendant populations were founded or whether population sizes have changed. Certainly for human populations there is considerable genetic evidence that population sizes have grown [34–37], and it would be helpful if it were possible to capture information on the sizes of descendant populations as they are formed. For example, if one descendant population formed as a small founder population that later grew to a large size, such dynamics would not be revealed in the fitting of the basic IM model. To allow the study of such histories, an additional parameter has been added to the IM model. Figure 1B shows a model in which an ancestral population splits in two, with the relative sizes of those two new populations reflected in the parameter s, where 0 < s < 1. At the time of the split, descendant population 1 has size sNA from which it moves to size N1 at the time of sampling. Similarly, population 2 begins with size (1 − s)NA from which it moves to size N2 at the time of sampling. Figure 1B depicts one population growing and the other shrinking, but in fact either population is free to either grow or shrink under this model.
These models were applied to questions on the founding of New World populations from Asia. A total of nine DNA sequence datasets that included Asian and Native American (Amerind-speaking) samples were drawn from the literature (Figure 2 and Table 1) and analyzed jointly using a procedure that provides posterior probability distributions for each of the model parameters [30,33]. The stochastic variance among loci is clearly evident in the variation of FST values (between Asian and New World samples) observed among the loci. Of the nine loci included in the present study, three have fairly high FST values, while the remainder are either negative (undefined) or near zero (Table 1).
Asian samples were arbitrarily designated as being from population 1 and the New World samples from population 2. In this case, 1 − s is the fraction of the ancestral population that founded the New World population. The analyses also require that prior distributions be specified for each model parameter. It was assumed that the New World was founded by a minority of the ancestral Asian population, corresponding to a specified uniform prior distribution for s between 0.5 and 1. For the other parameters, flat prior distributions were selected that would span the entire range of the posterior densities (i.e., uninformative priors) . However, in some cases the posterior distributions were quite flat over the highest portions of parameter ranges. In these cases the choice of the upper bound on the prior distribution does affect the posterior distribution, and we are not able to use an uninformative prior distribution. However, parameters can still be estimated on the basis of the locations of peaks in the parameter regions that can be assessed, and the effect of altering the prior distribution on these estimates can be determined.
The overall picture that emerges is one in which the New World was very recently founded by a small number of individuals (effective size of about 70), and then grew by a factor of about 10. The data do suggest that there has been gene exchange between Asia and the New World since that time; however, the likelihood surfaces are quite flat, so confidence in gene flow estimates is low.
The method assumes that the loci have not been subject to recombination or to directional or balancing selection. For recombination, we used only those loci that showed no evidence of recombination by the four-gamete test . It is possible that this has missed some recombination since the time of common ancestry. Regarding natural selection, the study was limited to loci that had not individually been reported to show evidence of directional or balancing selection. However, it is possible that when considered together, and polymorphism and divergence from chimpanzees are considered under a common neutral model, that there is evidence of selection. An HKA test  of the eight loci with estimates of divergence from chimpanzees (Table 2) yielded a p value of 0.054, which is nearly statistically significant. This test assumes, as do the models analyzed in this study, that all loci are sampled from the same panmictic population , and it is possible that the differing geographic sources of the loci included in the study may have contributed some variation.
The estimated posterior distributions are shown in Figure 3. For the initial analysis, allowing for exponential population size changes, the posterior distribution for t yielded both a major and a minor peak (the curve for t with a high tupper, Figure 3D). Given the mutation rate estimates (see Table 1), the location of the major peak (t = 0.032) corresponds to 7,130 y, whereas the location of the minor peak (t = 0.27) corresponds to 44,400 y. Given the remote possibility of such an ancient time as the latter, analyses were also done with a smaller upper bound on t of 0.2 (identified as “low tupper” in Figure 3), which corresponds to 33,000 y. Analyses were done with this reduced upper limit for t for both models in Figure 1, allowing for population size change and for the case of fixed population sizes. In the case of constant population sizes, the distribution for t shows a peak (t = 0.038) very near those for the analyses under population size change; however, the highest posterior density is found at the upper limit of t. When the constant population size model was run with a higher upper limit on t, the posterior distribution showed the same low value peak as well as a steadily rising curve for higher values of t (unpublished data).
The archaeological portrait of early New World populations has largely centered around widespread Clovis sites that have an earliest estimated age of about 13,000 y before the present [15,40,41]. The oldest generally agreed upon New World archaeological date is from the non-Clovis Monte Verde site in Southern Chile, which has been dated to about 14,000 y before the present [10,42,43]. Clearly the time points associated with our estimates of t are more recent than expected, given the archaeological estimates. However, these distributions do span the time periods that have been most discussed. For example, a time of 14,000 y has a relatively high probability in each of the analyses (Figure 3E). Given that people have lived in the New World probably for only several hundred generations, it is noteworthy both that the posterior densities for t do show clear peaks in the expected time period and that the probability estimates drop to zero as t approaches zero. In other words, the data contain a clear signal of a nonzero, albeit recent, founding time of New World populations.
With regard to migration, each of the three analyses show nonzero peaks for both directions of gene flow. These may well reflect the occurrence of more than one episode of migration to the New World. For example, it has been suggested on the basis of mitochondrial DNA haplotypes and glaciation history that an initial migration along a coastal route may have been followed later by another migration, possibly through an ice-free noncoastal corridor . However, the posterior distributions shown here have little resolution, as all of the curves for m1 and m2 are broad, and all have high probability at the lower limit of resolution, indicating that zero gene flow is nearly as well supported by the data as are nonzero gene flow levels.
The ancestral population parameter, θA, shows a relatively narrow distribution with a very consistent peak location across the three analyses. These attributes are partly to be expected, given that the very large majority of the variation in the samples is older than t. In effect, more information is available for θA than for the other parameters. The estimated effective size of the ancestral population is about 9,000 (Table 3), which is roughly consistent with previous estimates for Asian samples . The current Asian population parameter (θ1) revealed broad distributions and estimates that are near those for the ancestral population. Although the estimates of current effective size in Asia vary among the analyses (Table 3), they are all fairly close to the ancestral size estimates, suggesting that there has not been much population growth in Asia since t. Also consistent with the apparent constancy of population size is the distribution of s, the splitting parameter, which shows a peak at 0.992, signifying that only a small portion (less than 1%) of the ancestral Asian population left to found the New World population.
In contrast to the Asian population, the New World population parameter (θ2) is much smaller, and suggests a recent New World effective population size of less than 1,000 (Table 3). However, given the estimate of the effective size of the founding New World population (about 70; Table 4), the overall picture is of a nearly 10-fold growth in the New World effective size since t.
In order to gain a sense of how consistent the data actually are with the model and the parameter estimates, 500 simulated datasets were generated under the model in Figure 1B, with sample sizes and true parameter values (see Table 2, column 3) that were the same as for the actual data. From each simulated dataset, the average number of pairwise differences between sequences were calculated within each population (Asia and the New World) and between these populations. The average of these values from the 500 simulated datasets, and the observed values from the actual data, are shown in Table 5. In general, the observed and expected values are similar; however, one consistent pattern of departure is that the data from the New World, for most loci, show more variation by this measure than were found in the simulated data.
The method described is one of several new approaches that can glean information about ancestral population sizes [30,45–47]. By including a new parameter for population splitting, it is possible to generate estimates not only of the size of the ancestral population, but also of the founding size of each founder population.
Taken together, the analyses in this study suggest a recent founding of the New World Amerind-speaking peoples by a small population of effective size near 70, followed by population growth in the New World. It is interesting that the analyses do not suggest much population size change in Asia since the time of the founding of the New World population. Given the very broad distributions for θ1, it is possible that the true value of this parameter is much higher than suggested by the peak location, and that there has been considerable population growth in Asia. The analyses reveal very broad distributions for migration parameters, and although the peak locations suggest that gene flow has been fairly high (2Nm values greater than 1; see Table 3), the estimated probabilities of migration rates having been zero are also high (Figure 3G and and3H).3H). Also, because Eskimo-Aleut and Na Déne speakers were not included in this study, the question of separate migrations for these groups has not been addressed .
As parameter-rich as the method is, neither this nor any mathematical model can be expected to fully represent the complex history of two related populations. However, the same is essentially true of narrative models, as investigators are always constrained by limited data and the need to keep explanations as simple as possible given their data. In this light, the IM model provides a fairly complete framework for some oft-debated questions on human history. With the addition of a new parameter, the IM framework can now also be used to address questions about the founding size of populations and of population size change.
In the context of human demographic history, the most problematic assumption under the IM model is that each population is panmictic. Certainly this is not the case today, and it is likely to have even been less true in times past. This raises the general and important question of how local patterns of population structure affect regional or global estimates of diversity [44,48,49]. Although this question cannot be answered here, the analyses do suggest that some kinds of departures from panmixia have not occurred. For example, if the New World had been founded by a local population that had long been separated from other Asian populations, then the estimate of t would be expected to reflect this older population structure, rather than the founding of the New World. Our generally low estimates of t argue against this scenario. Similarly, if the sampled Asian populations had been highly structured, with many long-separated local populations, then this would have inflated the estimates of NA and N1, respectively. However, the generally low estimates of effective population size argue against this particular kind of population structure.
The analyses presented here share with some other genetic studies estimated dates for the peopling of the Americas that are more recent than archeologically based estimates [8,9,16]. However, the difficulty of estimating such recent events using genetic data alone should not be overestimated . When considering human populations within the past few tens of thousands of years, two gene copies that share the same haplotype will often have had a common ancestor far longer ago than any of the dates in question. Similarly, genetic evidence on the peopling of the Americas has been interpreted both as consistent with multiple migrations  and as indicating just a single founder event [16,19,50]. Divergent interpretations are understandable, given that a finding of two populations that share sequence haplotypes at a locus can be taken as evidence of two quite different models: (1) a recent population separation; or (2) gene exchange between populations.
The available data do not yet allow precise estimates of founding time nor of whether there has been gene flow between the New World and Asia following the initial founding event. However, the new method implements a parameter-rich model of divergence and has the potential to recover the history of complex divergence processes. The method can also be applied to a large number of loci, with large sample sizes, and in the future can be expected to provide increasingly detailed portraits of human population divergence.
Given the prevailing model of the founding of New World populations via a Bering land bridge, the descendant populations were defined as the Amerind speakers of the New World and the peoples of northeastern Asia. Greenberg et al.  proposed that New World populations include three linguistic groups (Eskimo-Aleut, Na Déne, and Amerind), each associated with a separate episode or period of migration. Because of the limited number of published comparative DNA sequence studies that include samples from Eskimo-Aleut and Na Déne group, the present study was limited to samples from Amerind-speaking populations. Asian samples were limited to those from China, Mongolia, Korea, and Siberia. These are partly arbitrary boundaries selected as a balance between the need to include as many loci as possible and uncertainty about the present locations of descendants of those Asian populations that gave rise to the founders of the New World.
The model fitting requires data from loci that do not show evidence of recombination and that do not show clear evidence of directional or balancing selection. All available datasets from the literature that met these criteria and that had multiple DNA sequences from both of the designated sample regions were selected. The selected loci are listed in Table 1. The input data file is provided in Dataset S1, and a list of sample locations is provided in Protocol S1.
At the center of the method for estimating the parameters is an expression for the posterior probability distribution of model parameters Θ, given the data. For the case of multiple loci
where Θ refers to the vector of parameters of the model, Xi refers to the data for locus i, and Gi is the genealogy for locus i . With n loci, the full set of parameters includes six or seven demographic parameters, depending on the inclusion of s, as well as n locus-specific mutation rate scalars . A genealogy includes the topology of an ultrametric tree, the associated coalescence times, and the times of migrations on each branch of the tree . For a given locus i, the probability f(Xi|Gi) is calculated using the mutation model for that locus and the branch lengths in the genealogy. The probability f(Gi|Θ) is calculated using expressions from basic coalescent theory [30,51–55]. By integrating over all possible genealogies that are consistent with the data, the results obtained are not conditioned on any particular estimate of the genealogy, and they necessarily incorporate all of the stochastic variance that arises among independent loci under the model.
The integration in Equation 1 cannot be solved directly for any but the simplest of models, but it can be approximated using a Markov chain simulation . This approach was originally applied to the IM model by Nielsen and Wakeley , and then augmented to include multiple loci  and additional mutation models [32,57].
Over the course of a simulation the genealogy for a given locus varies for topology, branch lengths, and migration times. However, the probability of the data for a locus given a particular genealogy, f(Xi|Gi), depends only upon the branch lengths and the mutation model for that locus . Although inclusion of s will affect the genealogies that arise in the course of the simulation, there will be no effect on the calculation of the probability of the data for a given genealogy (i.e., f(Xi|Gi) is not a function of s), and thus including s has no effect on the applicability of the method to diverse mutation models. In contrast, the probability of a genealogy given a set of parameter values, f(Gi|Θ), depends strongly on s because the probability of individual coalescent and migration times are functions of population size.
The calculation of f(Gi|Θ) is most directly done by taking the product of the probabilities of each of the coalescent and migration events that occur in the genealogy. Griffiths and Tavare  developed the general theory for the probability distribution of coalescence times when the population size is changing. Given a function v(τ)= Nτ/N0 of the population size at time τ, relative to that at time 0, they provide a general expression for the distribution of coalescent times. For population 1, the effective size goes from N1 at time zero, to sNA at time t. If it is assumed that the size change is exponential over this period, then for population 1,
and for population 2,
One additional complication that arises is that when the population is growing exponentially back into the past (decreasing in size as time moves forward), there is a finite probability that the time to coalescence will be infinity . Thus, for population 1 when sNA is less than N1, it is necessary to calculate the probability of coalescence time conditioned on there being a coalescent event.
Migration under an exponentially changing population size can also be incorporated under this same framework with two changes. First, unlike coalescence, where the rate is inversely proportional to population size, the rate of migration is directly proportional to population size. Second, as time goes backward in the coalescent, the migration rate from population 1 to population 2 (i.e., m1) actually corresponds to the movement of genes from population 2 to population 1 as time moves forward. This means that in the coalescent under changing population size, we expect that the migration rate from population 1 to 2 will vary with the size of population 2. Thus the corresponding relative rate function for migration from population 2 to population 1 is
and for migration in the reverse direction it is
These intensity functions for coalescence and migration were used to develop an expression for f(Gi|Θ) that includes s, and that could be directly incorporated into the update criteria for all of the demographic, mutation, and inheritance scalars described in Hey and Nielsen . Also needed, in order to allow for changing population size, are the update criteria for s and the update criteria for the genealogies. For s, updates are drawn from a uniform distribution over the user-specified prior range (e.g., in the current study, an interval within the range of 0.5 to 1). An update from s to s* will affect the probability of all genealogies and thus has an acceptance probability, under the Metropolis Hastings criterion, of
where n is the number of loci and Gi is the current genealogy for locus i (see Equation 3 in Hey and Nielsen ). If we assume a uniform prior distribution for s, such that the prior probability of s, f(s), is constant for all s, and if we choose updates such that the q(s* → s) = q(s → s*) , then this simplifies to
For genealogy updates the same proposal distribution of genealogies that was used in the case without s was retained, and then this proposal distribution was incorporated into the update criteria . If f(Gi|s) denotes the probability of the genealogy for locus i, given the other parameters including s, and f(Gi) is the Hastings term for the proposal probability of the genealogy for locus i, given the other parameters1 excluding s, then the update criteria for the genealogy for locus i is
The IM computer program  was modified to include the additional parameter. The program is available from http://lifesci.rutgers.edu/~heylab/HeylabSoftware.htm#IM. For the Markov chain simulation that is implemented by the program, it is difficult to assess how well the method works, because of the need to generate large numbers of simulated datasets and because of the long run times required . To conduct testing, a program was written to generate simulated datasets under the models in Figure 1. Datasets were simulated in groups of 10 or 20, each having 10–20 loci, for a given set of parameter values, and for a range of parameter values. Figure 4 shows the marginal posterior densities estimated from each of 20 independent simulations for a case of modest population growth with the following parameter values. θ1 = 10; θ2, = 10; θA = 10; t = 2.5; s = 0.2; m1 = 0.04; and m2 = 0.1. For each parameter, the mean of the 20 estimates is shown, and in general these are fairly close to the true value, though there is considerable variance for the peak locations in individual runs. To test whether the locations of these distributions are consistent with the true values of the parameters (i.e., the values used in the simulations), probabilities were combined by treating each simulation as an independent test of the same hypothesis . For each posterior density pi, i = 1,…,20, is the chance that a parameter value is more extreme (i.e., departs more from the mean of the distribution) than is the actual true value. That is, if x is the area of the curve to the left of the true value then pi = 2x if x < 0.5 and pi = 2(1 − x) if x > 0.5. If the pi's are uniformly distributed, then the quantity
is χ2 distributed with 40 degrees of freedom (i.e., two times the number of densities). The z values were as follows: θ1, 35.5; θ2, 26.4; θA, 41.7; t, 41.1; s, 26.4; m1, 29.9; m2, 44.1; and the mean of the seven values was 35.0. In the corresponding χ2 distribution, 90% of the probability mass falls above 29.05; 50% falls above 39.3; and 10% falls above 51.8 . Clearly these values are not entirely independent of each other, but they all fall in the middle of the χ2 distribution with a mean (35.0) close to the 50% point of the χ2 distribution (39.3).
From these simulations, and many others (additional results provided in Protocol S1), it is clear that sample sizes do need to be large for the posterior distributions to be informative. With data from fewer than five loci or fewer than ten individuals per population per locus, it is often the case that distributions are very flat or that there are multiple peaks. There is a tradeoff in sampling effort required for different kinds of histories. When t is small, sampling effort should be shifted to larger sample sizes per locus, whereas when t is large, sampling effort should be shifted toward more loci. This tradeoff is a byproduct of the fact that the stochastic variance among loci, that is associated with coalescent and migration events in genealogies at times near t, goes up as t increases. Another tradeoff that arises is between s and the migration rate parameters. Just as the frequency of polymorphic sites can be used to estimate changes in population size , it can also be appreciated that the information for s must reside in the distribution of times of node intervals in the descendant populations. Migration can have dramatic effects on node interval times within populations. In practice, via simulation, the method does not resolve a sharp peak for s for populations that have had more than moderate amounts of migration (e.g., 2Nm values are greater than 0.5; see Protocol S1).
Each of the three analyses were done using at least three independent runs, with ten or more independent chains under Metropolis coupling  as described by Geyer . Each chain was initiated with a burn-in period of 100,000 updates, and the total run length of each analysis was between 10 million and 30 million updates. The mixing properties of individual runs were monitored by measuring the autocorrelation of individual parameters over the course of the run, and by estimating the effective sample size for each of the parameters as a function of the autocorrelation estimates (see p. 499 in ). Analyses were taken to have converged upon the stationary distribution if independent runs generated similar distributions, with each having a lowest effective sample size of 50 for the time parameter (the parameter to show the slowest rate of mixing).
To convert estimates of parameters that include the mutation rate to more easily interpreted units, a value of 6 million y since the splitting of human and chimpanzee lineages was used [65–69]. The geometric mean of the human-chimpanzee DNA sequence divergence of each locus, except ATM (see Table 2), was calculated and then used as a molecular clock calibration for converting the estimate of the time parameter, t, to divergence in years. The geometric mean mutation rate across these loci was estimated to be 4.66 × 10−6 mutations per year. The geometric mean is used rather than an arithmetic mean, because under the multilocus model, the mutation rate by which demographic parameters are scaled is the geometric mean of the individual locus-specific mutation rates .
To convert the estimates of the population mutation rate parameters (θ1, θ2, and θA) to estimates of effective population size (N1, N2, and NA, respectively) a measure of mutation rate on a scale of generations is needed. Thus, an assumption was made of 20 y per generation, and the geometric mean divergence between humans and chimpanzees for each species contrast was divided by 12 million y then multiplied by 20 y per generation. These calculations yielded a geometric mean value of 9.32 × 10−5 mutations per generation. These mutation rate values were then used to convert individual θ estimates to effective population size estimates (i.e., θ = 4Nu, and N = θ/4u).
Migration parameters in the model can be used to obtain population migration rate estimates (i.e., M = 2Nm, the product of the effective number of gene copies and the per gene copy migration rate) using an estimate of the population mutation rate (θ = 4Nu). Thus θ × m/2 = (4Nu × m/u)/2 = 2Nm .
This is the input file that contains all of the data and that was analyzed using the IM computer program.
(582 KB TXT).
(92 KB DOC).
John Wakeley, Tad Schurr, and David Meltzer provided input on an early draft of the paper. Rasmus Nielsen provided some helpful suggestions on parameter updating. Thanks also to three reviewers for very helpful suggestions and critique.
Competing interests. The author has declared that no competing interests exist.
Author contributions. JH conceived and designed the model and analyses, selected the datasets, wrote the computer programs, performed the analyses, and wrote the paper.
Citation: Hey J (2005) On the number of New World founders: A population genetic portrait of the peopling of the Americas. PLoS Biol 3(6): e193.