The distribution of the time since the most recent common ancestor (TMRCA) between two alleles in an individual provides information about the history of population size change over time. Existing methods for reconstructing the detailed TMRCA distribution have analyzed large samples of individuals at non-recombining loci like mitochondrial DNA13
. However, the statistical resolution of inferences from any one locus is poor, and power fades rapidly moving back in time as there are few independent lineages probing deep time depths (in humans, no information at all is available from mitochondrial DNA beyond about 200kya when all humans share a common maternal ancestor11
). In contrast, a diploid genome sequence harbors hundreds of thousands of independent loci, each with its own TMRCA between the two alleles an individual carries. In principle it should be possible to reconstruct the TMRCA distribution across the autosomes and chromosome X by studying how the local density of heterozygous sites changes across the genome, reflecting segments of constant TMRCA separated by historical recombination events. To explore whether we could take advantage of this idea to learn about the detailed TMRCA distribution from a diploid whole genome sequence, we proposed the PSMC model, which is a specialization of the sequentially Markovian coalescent model14
to the case of two chromosomes (). The free parameters of this model include the scaled mutation and recombination rates, and piecewise constant ancestral population sizes (Methods). We scaled results to real time assuming 25 years per generation and a neutral mutation rate of 2.5×10−8
. The consequences of uncertainty in the two scaling parameters will be discussed later in the text.
Illustration of the PSMC model and its application to simulated data
To validate our model, we simulated a hundred 30Mbp sequences with a sharp out-of-Africa bottleneck followed by a population expansion, and inferred population size history with PSMC (). PSMC is able to recover the parameters used in the simulation and the variance of the estimate is small between 20kya–3Mya. More recently than 20kya or more anciently than 3Mya few recombination events are left in the present sequence, which reduces the power of PSMC, and therefore the estimated effective population size (Ne
) in these time intervals is not as accurate and has large variance. To test the robustness of the model, we introduced variable mutation rates and recombination hotspots in the simulation (Supplementary Text
). The inference is still close to the true history (). A uniform rate of SNP ascertainment errors does not change our qualitative results, either (Figure S2
). On the other hand, the simulations also reveal a limitation of PSMC in recovering sudden changes in effective population size. For example the instantaneous reduction from 12,000 to 1,200 at 100kya in the simulation was spread over the preceding 50,000 years in the PSMC reconstruction.
We applied the PSMC model to real data from recently published genome sequences (). reveals that all populations are very similar in estimated Ne
history between 150 and 1500kya. YRI differentiates from non-African populations around 100–120kya (at 110kya, NeYRI
=15313±559 and NeCHN
=12829±485). This evidence of early population differentiation is potentially consistent with the archaeological evidence of anatomically modern humans found in the Near East around 100kya12
. European and East Asian populations are nearly identical in estimated Ne
beyond 11kya. From a peak of 13,500 at 150kya, the Ne
dropped by a factor of 10 to 1,200 between 40–20kya, before a sharp increase whose precise magnitude we do not have power to measure. We also observe a less marked bottleneck in YRI from a peak of 16,100 around 100–150kya to 5,700 at 50kya, recovering earlier16
than the out-of-Africa populations with increases back to 8,700 by 20kya, coinciding with the Last Glacial Maximum. All populations show increased Ne
between 60–200kya, about the time of origin of anatomically modern humans17
. An alternative to an increase in actual population size during this time would be that there was population structure involving separation and admixture11,16
Properties of the input sequences.
We also see in all populations an increase in estimated Ne
beyond 1Mya, with a sharp increase beyond 3Mya. Although it is tempting to read into this the transition from the previously estimated larger Ne
at the time of the split from chimpanzee18
, our method may also be subject to artifacts in this region due to regions of balancing selection or clustered false heterozygotes related to segmental duplications (Figure S3
Analyzing a European female X chromosome (EUR3.X) yields a history similar to that from autosomes scaled by 0.75, as expected for the X chromosome (). We do not observe a more severe bottleneck on the X chromosome19
. To investigate the relationship between African and non-African populations, we combined X chromosomes from YRI and a non-African to construct a pseudo-diploid genome. From , we can see that although African and non-African populations might have started to differentiate as early as 100–120kya, they largely remain as one population until approximately 60–80kya, the time point when the YRI1-EUR1.X curve clearly leaves EUR3.X. This supports the recent analysis of the relationship between the Neandertal genome and that of modern humans20
, which concluded that West Africans and non-Africans descended from a homogeneous ancestral population in the last 100,000 years with subsequent minor admixture out of Africa from Neandertals, rather than an alternative explanation of ancient (>300,000 year old) sub-structure separating West African and non-African populations.
From , we also notice surprisingly that there is a low Ne
between African and non-African populations until approximately 20kya, suggesting substantial genetic exchanges between these populations long after the initial separation. Complete separation would correspond to very large or effectively infinite Ne
, as seen below 20kya. To explore whether the inferred recent gene flow is a modeling artifact, we simulated complete divergence at 60kya according to the Schaffner et al.
, and saw increased rather than reduced Ne
in the period 20–60kya (brown line in ). To explore further, we extracted from YRI1-KOR.X segments that PSMC indicates coalesce more recently than 50kya. These comprise 220 segments covering 31.2Mbp (>20% X chromosome). We observe 1,363 base-pair differences in 20.7Mbp of callable sequence in these segments, corresponding to an average divergence time of 37.4kya. In contrast, if we apply the same process to the simulated data from the Schaffner et al.
model, the apparently recently diverged segments cover only 0.4% of the simulated chromosome. The human-macaque divergence in the 220 segments was only 4% lower than the chromosome average, so regional variability in mutation rates cannot explain these results. In summary the existence of long segments of low divergence between YRI1 and KOR suggests substantial genetic exchange between West African and non-African populations up until 20–40kya, and is not consistent with a simple separation approximately 60kya.
The evidence for continued gene flow between Africans and non-Africans prior to the separation of Europeans and East Asians (Supplementary Section S4.2
) is more recent than the archaeologically documented time of the out-of-Africa dispersal since there are fossils in both Europe and Australasia that date to >40 kya22
. An important caveat to this result is uncertainty of the per-year mutation rate 1.0×10−9
/25). While this mutation rate agrees well with the rates estimated between primates averaged over millions of years (Supplementary Section S3.1
), generation intervals as high as 29 years per generation over the last few thousands of years23
and present mutation rates lower than 2.5×10−8
are possible in principle, and these could make our recent date estimates somewhat older, although it is difficult to imagine a date of final gene flow of as old as 60kya as being consistent with the data due to these inaccuracies. Our analyses can also not exclude the possibility that the divergence time inferred from X chromosomes may not be representative due to sex-biased demographic processes19
, highlighting the importance of repeating this analysis on autosomal data once haploid whole genome sequences become available24
. Intriguingly, a recent study using an orthogonal type of data (analysis of allele frequencies) also inferred that gene flow between Africans and non-Africans continued until strikingly recently, in the case of that study, until 17–26kya25
. An important goal for future work is to determine whether these recent dates reflect real history, and if so to obtain more detail about the timing and scale of the events involved.
In this paper we have introduced a novel method to infer the history of effective population size from genome wide diploid sequence data. It is relatively straightforward to apply, with less potential ascertainment bias in comparison to existing methods that use selective genotyping data or the resequencing data from a few loci. Furthermore, our method is computationally tractable and typically uses much more primary sequence data than the existing methods, which allows us to estimate population size at each time going back in history, rather than assume a parametric structure of times, divergences and size changes. The results described above concerning the timing and depth of the out-of-Africa bottleneck are broadly consistent with previous studies though our results are more detailed (Supplementary Section S4.2
). The hypothesis that there was significant ongoing genetic exchange throughout the bottleneck is surprising in light of current views about human migrations; however, it is not inconsistent with the archaeological literature, and should motivate further research. There is the potential to extend this type of SMC-HMM approach to data from multiple individuals, which would access more recent times, but this will require inference over a substantially more complex hidden state space of trees on the haplotypes, with each Markov path representing an ancestral recombination graph14
. In addition, beyond humans, there is the potential to apply the method to investigate the population size history of other species for which a single diploid genome sequence has been obtained (Supplementary Section S2.2