|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: SG MV KS TKO ARL EGB JCMC CDB. Analyzed the data: SG FZ JKB MM AME JLRF EEK CRG WG. Contributed reagents/materials/analysis tools: SG FZ JKB AME CRG BKM JD GB TKO ARL EGB. Wrote the paper: SG MM AME CDB.
There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is in MXL, in CLM, and in PUR. Native American ancestry in PUR is most closely related to populations surrounding the Orinoco River basin, confirming the Southern America ancestry of the Taíno people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a demographic model for three ancestral Native American populations. These ancestral populations likely split in close succession: the most likely scenario, based on a peopling of the Americas thousand years ago (kya), supports that the MXL Ancestors split kya, with a subsequent split of the ancestors to CLM and PUR kya. The model also features effective populations of in Mexico, in Colombia, and in Puerto Rico. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest effective size and the earlier migration from Europe. Finally, we compare IBD and ancestry assignments to find evidence for relatedness among European founders to the three populations.
Populations of the Americas have a rich and heterogeneous genetic and cultural heritage that draws from a diversity of pre-Columbian Native American, European, and African populations. Characterizing this diversity facilitates the development of medical genetics research in diverse populations and the transfer of medical knowledge across populations. It also represents an opportunity to better understand the peopling of the Americas, from the crossing of Beringia to the post-Columbian era. Here, we take advantage sequencing of individuals of Colombian (CLM), Mexican (MXL), and Puerto Rican (PUR) origin by the 1000 Genomes project to improve our demographic models for the peopling of the Americas. The divergence among African, European, and Native American ancestors to these populations enables us to infer the continent of origin at each locus in the sampled genomes. The resulting patterns of ancestry suggest complex post-Columbian migration histories, starting later in CLM than in MXL and PUR. Whereas European ancestral segments show evidence of relatedness, a demographic model of synonymous variation suggests that the Native American Ancestors to MXL, PUR, and CLM panels split within a few hundred years over 12 thousand years ago. Together with early archeological sites in South America, these results support rapid divergence during the initial peopling of the Americas.
The 1000 Genomes project  released sequence data for 66 Mexican-American (MXL), 60 Colombian (CLM), and 55 Puerto Rican (PUR) individuals using an array of technologies including low-coverage whole genome sequence data, high-coverage exome capture data, and OMNI 2.5 genotyping data. These data provide a unique window into the settlement of the Americas that complement archeological and the more limited genetic data previously available. Here we interpret these data to answer basic questions about the pre- and post-Columbian demographic history of the Americas.
People reached the Americas by crossing Beringia during the Last Glacial Maximum, likely between 16–20 kya (see e.g. , , , ). The presence of early South American sites such as Monte Verde  suggests a rapid occupation of the continent, which is supported also by recent mitochondrial DNA studies . A coastal route has been proposed to explain this rapid expansion (e.g., ,,), but other migration routes, possibly concurrent, have also been proposed (see. e.g., ,, and references therein). This original peopling of the Americas, followed by European contact starting in 1492 and substantial African slave trade starting in 1502, have created a diverse genetic heritage in American populations.
The initial settlement of the Caribbean has been much debated (e.g. ,, and references therein). People reached the islands around 7 kya, probably from a Mesoamerican source . Around 4.5 kya, a second wave of migrants probably reached the islands, likely coming from the Orinoco Delta or the Guianas in South America and speaking Arawakan languages (see  and references therein). By approximately 1.3 kya, they had established large Taíno communities through the Greater Antilles, including Puerto Rico.
The earliest available account reports 600,000 Native Americans in Puerto Rico at the time of European arrival, not counting women and children (Vázquez de Espinosa 1629). More conservative estimates suggest 110,000 individuals , and as few as 30,000 inhabitants in 1508 . All references agree that the Native American population was subsequently largely decimated through disease, forced labor, emigration, and war. Despite the bottleneck at contact, admixture and the subsequent population growth on the Island resulted in a Native American genetic contribution averaging of the modern population of million .
The MXL were sampled in Los Angeles, USA and the CLM in Medellin, Colombia. These panels represent urban populations, but recent urbanization means that they derive ancestry from larger geographic areas. Among respondents to the 2005 Colombia Census in Medellin, were born in the city, and were born in another part of Colombia, with a sizable proportion from the surrounding Department of Antioquia. Given this high rate of within-country migration, but a relatively low rate of migration from outside Colombia, we can think of the sample as representing a diverse sample from Antioquia. Similarly, the 1.2M Angelenos of Mexican origin in the 2010 US census represent the added contributions of multiple waves of migrations starting with the city's foundation in 1781 and received contributions from diverse states.
The use of genetic data to study Native American history is well established. The bulk of these studies rely on Y chromosome ,,,,,,, and mitochondria DNA (mtDNA) ,,,,,,,,,,,,, with a number of studies using increasingly dense sets of autosomal markers ,,,,,. Such studies provided evidence for a bottleneck recovery into the Americas 16–12 kya (e.g., ,), and for complex models of migrations and admixture within Native groups .
In this article, we use the 1000 Genomes data and a diversity of population genetic tools to delve deeper in the founding of the Puerto Rican, Mexican, and Colombian populations. To propose models for Native American demography, we must first quantify the African, European, and Native American contributions to these populations. Because of strong sex-asymmetric migrations, autosomal and sex-linked markers exhibit substantial differences in ancestry proportions ,,,,,. Focusing on the autosomal regions, we infer the locus-specific pre-Columbian continental ancestry in each sample, and estimate the timing and intensity of different migration waves that contributed to these populations. Using identity-by-descent analysis, we identify relatedness among the different ancestral groups and estimate recent effective population sizes.
We also propose a three-population model based on the diffusion approximation to study the distribution of allele frequencies across the Native American ancestors of the MXL, PUR, and CLM. We present statistical methods that take advantage of admixture linkage patterns to disentangle the histories of each continental group. The large sample of sequence data allows for the joint inference of split times and effective population sizes among the Native ancestors to the three panels. Finally, through an expectation maximization (EM) framework, we estimate genome-wide allele frequencies in the inferred Native components of MXL, CLM, and PUR genomes.
A broad summary of the data and analysis pipelines used in this article are displayed in Figure 1.
To estimate the global proportions of African, European, and Native American ancestry in the CLM, MXL, and PUR, we combined them with YRI, CEU, and a panel of Native American samples  and performed an admixture  analysis (Figure 2(a)) and principal component analysis (Figure S1). Dense genotyping arrays allow for inference of ancestry at the level of individual loci, using software such as RFMix . Trio-phased OMNI data was used to generate such locus-specific ancestry calls for 66 CLM, 68 MXL, and 64 PUR individuals, including all sequenced individuals, as part of the 1000 Genomes Project. Summing up the local ancestry contribution inferred by RFMix provides an alternate estimate of ancestry proportions.
Using admixture, we find Native American proportions being in PUR, in CLM, and in MXL (Figure 2a). RFMix finds values falling within percentage points of these values, and within one percentage point of the values inferred in the 1000 Genomes project through related methods . Estimates of African ancestry showed a larger difference across methods, with admixture (RFMix) estimates at in PUR, in CLM, and in MXL.
The inferred Native American ancestry proportions are in good agreement with results from the GALA study , which reported proportions of in Puerto Rico and in Mexico. The PUR result is also comparable to the of Native ancestry inferred in a different Puerto Rican sample . By contrast, none of the populations from Colombia in  show median ancestry proportions quite similar to the CLM sample from Medellin, the closest being the sample from the surrounding Department of Antioquia, with Native, African and European.
Figure 2(c–d) shows a principal component analysis restricted to segments of inferred Native ancestry . We find that the MXL individuals cluster primarily with southern Mexican Native groups (mostly Mixe), and the CLM cluster primarily with the Embera, Kogii, and Wayu, all of which were sampled in Colombia North-West of the Andes, where Medellin is also located. The PUR clusters principally with populations South-East of the Andes, surrounding the Guyanas and the Orinoco River basin (Ticuna, Guahibo, Palikur, Jamamadi, Piapoco), although a few populations from further south are also close in PCA space, particularly the Guaraní and the Chané, together with some Kaqchikel, Toba, and Wichi individuals. The Piapoco and the Palikur speak Arawakan languages. The other groups with known Arawakan-speaking ancestors in our panel are the Chané, whose ancestors spoke Arawakan and likely originated in Guiana , and the Guarani, through gene flow from the Chané . Taken together, these clustering patterns support a demic diffusion of the Arawakan/Taínos into Puerto Rico from a southern American route, and reduced gene flow between Native Americans groups living in the Andes or to the west, and groups living east of the Andes.
Because continuous tracts of local ancestry are progressively broken down by recombination, the length distribution of continuous ancestry tracts can reveal details of the timing and mode of the migration processes. We used RFMix to infer ancestry tracts (Text S1), and the software tracts  to infer the migration rates and model likelihoods under different scenarios. Tracts can predict the distribution of ancestry block length for arbitrary models of time-varying migration, under the assumptions that the migrants are themselves not admixed, and that the admixed population follows Wright-Fisher reproduction. Since admixture only begins after two populations are in contact, the admixed population is founded when the second population arrives. Tracts determines the time and ancestry proportions at the onset of admixture and the time and magnitude of subsequent migrations by maximum likelihood. Because of limited statistical power, we start with a simple model in which each population contributes a single pulse of migration. We then progressively introduce models with additional periods of migration when justified by information criteria, as described in Text S1. The models that best describe the data are shown in Figures 3 and S2. Parameters for these, together with confidence intervals obtained through bootstrap over individuals, are provided in Table S1 in the Text S1 file.
For MXL, we considered a model introduced in : three populations start contributing migrants at the same time, but Europeans and Native Americans keep contributing at a constant rate. The best-fitting model has an onset of admixture 15.1 generations ago (ga), with a CI of , in good agreement with  despite a different genotyping chip and local ancestry inference method.
In PUR, we found evidence for two periods of European and African migration, the first ga ( CI ) and the most recent period at ga ( CI 5.9–8.8). This model is in excellent agreement with historical records, which suggest that isolated Native populations contributed little gene flow to the colony after the initial contact period, and that substantial slave trade and European immigration continued until the second half of the 19th century. We do not mean to imply that migrations actually occurred in exactly two distinct pulses-we do not have the resolution to distinguish more than two pulses per population. However, the inference of a migration pulse 6.8 ga indicates that migrations occurred during a period spanning this date. This complex scenario, with multiple waves of migration from African and European individuals, is consistent with the observation that European and African ancestries vary across the island, whereas no evidence of such variation was found in Native ancestry .
The inferred onset of admixture in CLM is 13.0 ga ( CI ), significantly later than that in both MXL and PUR and consistent with later European settlement in western Colombia compared to Mexico and Puerto Rico. We also find evidence for a small but statistically significant second wave of Native American migration, 4.8 ga ( CI 4–6). As above, this does not necessarily indicate a single, punctual event, but probable contact between an admixed population and Native American individuals during that period. By contrast, we find no evidence for continuing African gene flow in CLM.
We used germline  and the trio-phased OMNI data above to identify segments identical-by-descent (IBD) within and across populations (see Text S1). Not surprisingly, we found more IBD segments within populations (23936) compared to across populations (1440), and within-population segments were longer (Figure S3).
The MXL population exhibits significantly less within-population IBD compared to the other two panels (Figure 4). The amount of IBD among unrelated individuals can be used to infer the underlying population size under panmictic assumption: the larger a population, the more distant the expected relationship between any two individuals . Using IBD segments longer than 4 cM, we infer effective population sizes of 140,000 in MXL, 15,000 in CLM, and 10,000 in PUR. As we will show, these largely reflect post- admixture population sizes.
We expect long IBD segments to be inherited from a recent common ancestor, and therefore to have identical continental ancestry. Comparing the RFMix ancestry assignments on chromosomes that have been identified as IBD by germline thus provides a measure of the consistency of the two methods (see  for a related metric). Rates of IBD-Ancestry mismatch ranged from in segments of to less than for segments longer than 40 Mb (Figure S4).
Patterns of ancestry in IBD segments within a population differ markedly from those across populations (Figure 5): IBD segments within populations contain many ancestry switches. This indicates that many common ancestors lived after contact, and that the effective population sizes estimated using IBD largely reflects post-contact demography. The IBD patterns in cross-population IBD segments exhibited fewer ancestry switches than a random control (Figure S5), as may be expected if common ancestors often predate the onset of admixture. Cross-population IBD segments were also found to be overwhelmingly of European origin: among the 120 longest cross-population IBD segments, 117 are in European-inferred segments, two are among Native segments, and one is among African segments. This is not due to overall ancestry proportions, as can be observed by considering the alternate (non-IBD) haplotypes at the same positions (Figure S5). This is likely a result of the colonization history, in which European colonists rapidly spread from a relatively specific region over a large continent. This interpretation is supported by the admixture analysis (Figure S6), showing a common cluster of ancestry for the European component dominant in PUR, CLM, MXL, and Andean populations, but not in CEU, Eskimo-Aleut, and Na-Dene. Finally, we were interested in testing whether the relationship between IBD and ancestry can be used to date recombination events. The ancestry within an IBD segment represents the ancestry state of the most recent common ancestor. The shorter the IBD segment, the older the ancestor, and the less time available since the onset of admixture to create ancestry switch points through recombination. Indeed, we find that the density of ancestry switch-points on IBD tracts increases with IBD tract length in PUR (bootstrap , see Text S1) and in MXL (bootstrap ), whereas the results are not significant in CLM. Thus we can use ancestry patterns in admixed populations not only to recognize recombination events but also to help date most recent common ancestors and recombination events (see Text S1 for details). The small amount of cross-population IBD among Native American tracts tells us that the ancestral Native populations were not as closely related as European founders, consistent with historical and anthropological data.
To infer split times and population sizes of the Native ancestors, we consider the joint site frequency spectrum (SFS). The SFS is informative of demography because stochastic differences in allele frequencies accumulate over time and at a rate that depends on population sizes. We use the diffusion-approximation framework implemented in  to perform the inference. We focus on synonymous sites in the 1000 Genomes exome capture data of 60 CLM, 66 MXL, and 55 PUR individuals because the high coverage reduces sequencing artifacts and synonymous sites are less affected by selection compared to non-synonymous sites. A complete model with admixture would require at least one European, one African, and three Native American populations, which is beyond the 3-population limit of We therefore wish to focus on variants within Native American backgrounds.
Unfortunately, trio-phased sequencing data was not available for most samples. Because of phasing uncertainty, the actual ancestry assignment for variants at ancestry-heterozygous loci is uncertain. To overcome this, we introduce a negative ascertainment scheme, in which we only consider variable sites that have not been observed in any of the non-Native populations in the 1000 Genomes data set. The effect of this ascertainment scheme is to remove the majority of variants that predate the split of Native Americans from the rest of the populations. An additional benefit of this approach is that the impact of European and African tracts incorrectly assigned as Native American will be substantially reduced. We hypothesized that the effect of negative ascertainment could be approximately modeled by a strict bottleneck at the Native/non-Native split time. This was confirmed through simulations (see S1).
We considered a simple 3-population demographic model starting with a constant population of size . At time the population size changes to . From this population of size , population diverged with size at time and populations and diverge at a later time with respective sizes and . We considered all three split orderings, with . In the optimal model, illustrated on Figure 6, we have , , . This model is a vast oversimplification of the historical demographic processes. However, given the limited statistical power to reconstruct time-dependent demographic histories using allele frequency data (e.g. ), such simple models with step-wise constant population sizes provide useful coarse-grained pictures of human demography. The population sizes in this model are effective population sizes: they are the size of Wright-Fisher populations that best explain the observed patterns of polymorphism. They differ from census sizes because of population size fluctuations, overlapping generations, sex bias, offspring number dispersion, and other departures from the Wright-Fisher assumptions. The ratio is expected to converge to large values to reflect both the negative ascertainment scheme (see Methods) and the expansion post-founding of the Americas. The current data does not enable us to model these two effects separately, so the recovery time can be thought of as an interpolation between the two events. When performing likelihood optimization, tended to slowly increase without bound. Beyond a value of 100, this had minimal impact on the likelihood function and other parameter estimates. We therefore fixed this value to to facilitate optimization and prevent numerical instabilities. All other parameters, and the order of population splits, were chosen to maximize the model likelihood.
We find dramatic differences in the inferred population sizes of the Native Ancestors to the MXL, CLM, and PUR (see Table 1), with the MXL showing by far the largest effective population size at 64,000, times larger than the CLM and 32 times larger than the PUR. Given the many sources of uncertainty and model limitations, these ratios are in good qualitative agreement with pre-Columbian populations estimated at 14M in central Mexico , 3M in Colombia , and somewhat over 110,000 in Puerto Rico . This could largely be a coincidence, given that the Native ancestors to the MXL and CLM were not panmictic populations over present-day political divisions. Another possible explanation for the differences in effective population sizes is a serial founder model after the crossing of Beringia: CLM and PUR would have experienced stricter and longer bottlenecks compared to MXL due to greater distances traveled from Beringia. The crossing to Puerto Rico is likely to have introduced intense bottlenecks in PUR, resulting in a smaller recent effective population size.
The model suggests that PUR and CLM ancestral populations did not share serial founding events past the split with the MXL ancestors and split well before the expected arrival of the Arawak people of the Caribbean. Indeed, the first and second split times ( and , respectively) are remarkably close to each other, with (bootstrap CI: , see S1, Figure S7, and Table 1). This corresponds to a difference of about 500 years, 12,000 years ago. In fact, the splits are so close that it is impossible to distinguish which population split first, with bootstrap instances supporting all three orderings: the Taíno ancestry does not appear much more closely related to either CLM or MXL Native ancestors. This is also consistent with the PCA results shown in Figure 2, showing a clear distinction between Native American groups in eastern and western Colombia.
Despite strong historical evidence for extensive population bottlenecks suffered by Native American populations following the arrival of Europeans , we could not detect the presence of such bottlenecks through allele frequency analysis. However, the presence of such bottlenecks may affect our interpretation of effective population sizes. To quantify this, we fixed the timing and magnitudes of bottlenecks using non-genetic sources, and re-inferred model parameters. Dobyns  proposed a maximum population reduction of in the Native American population after European contact, but this number is expected to vary from location to location. Because we are studying admixed populations, the size of the bottleneck is related to the number of individuals that contributed to the admixed population, thus Dobyns' estimate may not apply. In PUR, where the decline was particularly abrupt, we considered a decline of spanning years (see S1). We found that inferred parameters were little affected by the existence of such a bottleneck, with the exception of the effective population size in the pre-bottleneck PUR population, which would be 3.9 times larger than in the no-bottleneck model. Assuming an additional bottleneck in the CLM population led to similar 4-fold increase in inferred pre-bottleneck CLM population size, with little effect on inferred split times. These are significant effects, but are less than the inferred differences in effective population sizes. Thus, in the absence of extreme differences in the recent bottlenecks experienced by the three populations, the observed differences in population sizes likely point to differences in pre-Columbian demography.
By calibrating our results using , towards the most recent end of the range of plausible values for the peopling of the Americas (see e.g.,  and references therein), we find a mutation rate of (bootstrap CI: ), within the range of recently published human mutation rates . The narrowest confidence interval reported in  was , obtained from a de novo exome sequencing study . Our sampling confidence interval is narrower than this value, but the main source of uncertainty here is the degree to which the bottleneck in our model reflects the bottleneck at the founding of the Americas, or the earlier split with the ancestors to the Chinese (CHB) and Japanese (JPT) sample, as well as uncertainty with respect to the timing of these two events (see Figure 7). The effect of changing the founding time or mutation rate assumptions would be to scale all parameters and confidence intervals according to Thus the absolute uncertainty on individual parameters is larger than the sampling uncertainty suggests.
There is scarce publicly available, genome-wide data about Native American genomic diversity. The 1000 Genomes dataset offers the opportunity to provide a diversity resource for Native American genomics by reconstructing the genetic makeup of Native American populations ancestral to the PUR, CLM, and MXL. This is particularly interesting in the case of the Puerto Rican population, where such reconstruction may be the only way to understand the genetic make-up of the pre-Columbian inhabitants of the Islands. Using the expectation maximization method presented in the Methods section, we estimated the allele frequencies in the Native-American-inferred part of the genomes of the sequenced individuals. These estimates are available at http://genomes.uprm.edu/Taino/.
Figure 8 shows the distribution of the number of Native American haplotypes per site and the resulting confidence intervals for allele frequency in each population for exome capture target regions. Absolute confidence intervals are narrow for rare variants, and reach a maximum for SNPs at intermediate frequency; the leftmost peak in the bimodal distribution corresponds to the large number of rare variants, whereas the right most peak encompasses a broader range of frequencies.
Focusing on the variants with observations in all populations and within the exome capture regions, where coverage and accuracy were highest, the most significantly different among Native groups is rs11183610 on chromosome 12, with an estimated frequency of in MXL Native ancestry, in CLM Native ancestry, and in PUR Native Ancestry. The MXL-PUR difference remains significant after Bonferroni correction (bootstrap , see Methods). The bulk of the differentiation among populations is likely due to genetic drift, but such sub-continental ancestry informative markers are also interesting candidates for further selection scans.
The bottleneck at the founding of the Americas provides a unique opportunity to obtain precise estimates of the human autosomal mutation rate, as reported in Table 1 and Figure 7. One remaining challenge in interpretation is whether the ‘founding time’ studied here corresponds to the bottleneck at the founding of the Americas, or the split time of the Native Americans with the Asian populations. Fortunately, this uncertainty can be addressed by sequencing either trio-phased populations from the Americas, or individuals of Native American ancestry without large amounts of recent European and African ancestry. In either case, the dramatic events that led to the initial peopling of the Americas, together with the early dates of South American archaeological sites, provides us with estimates of the human mutation rate that are more precise than pedigree-based estimates. A more thorough study of the robustness of these estimates to model assumptions is therefore desirable.
We find substantially larger effective population size in Mexico than in the other two populations through IBD-based and allele-frequency based estimates. These methods are sensitive to different time-scales: IBD analysis largely reflects post-Columbian events, as evidenced by the large number of mixed ancestry IBD segments in Figure 5(a). Allele frequencies reflect older events as well, and we showed that recent bottlenecks alone are unlikely to be responsible for the much larger effective MXL population size. To interpret the population size differences, we must consider the recent histories of the populations studied here. The MXL panel was recruited in Los Angeles among Mexican-American individuals, who may come from different regions in Mexico, a much wider geographical region than Puerto Rico, thus likely more populated. A natural question is whether the larger effective population sizes in MXL reflect a large panmictic population in Mexico, or a large number of small, previously isolated populations. Figure 2 and references , provide compelling evidence that there is substantial population structure within Native groups of Mexico. However, Figure 2 also shows that the Native component of the MXL forms a relatively homogeneous cluster together with populations from southern Mexico. The much larger Native populations in central and southern Mexico are likely to have contributed the most to the Native American ancestry of Mexican mestizos, and thus Mexicans-Americans. Even though the MXL may have ancestors in different parts of Mexico, their Native genetic origins likely reflect the demographic history of the areas in Mexico with the highest Native American population sizes.
Because Puerto Rico is an island, building a relatively complete population genetic model for the population may be more tractable. Clearly, our model of a single idealized pre-Columbian Native American, European, and African populations, joining to form a panmictic admixed population, is an oversimplification. African and European ancestry proportions vary along the island  and eastern parts of Puerto Rico, with elevated proportions of African ancestry, are underrepresented in this study. By contrast, we do not have evidence for variation in the amount or composition of the Native American ancestry across the island, and it is likely that the conclusions about the pre-Columbian Native American fraction of the population are robust to sampling ascertainment. Interestingly, we find that the distribution of ancestry tract length in a sample of individuals of Puerto Rican descent in south Florida gave very similar results, despite different location, sequencing platform, and local ancestry inference method . Historical gene flow inference using individuals of Colombian descent in south Florida provided comparable estimates of the time of admixture onset, but different patterns of recent gene flow–as is typical in demographic inference, inference of recent events is more sensitive to population structure.
Our analyses largely rely on accurate estimates of local ancestry patterns along the genome obtained through RFMix. This method has been shown to provide more than accuracy on three-way admixture using comparable reference panels , an accuracy level that enables accurate estimation of genome-wide diversity . To ensure that our results are robust to residual errors, we further took into account the difficulty of calling short ancestry tracts in our migration estimates, and performed negative ascertainment of non-Native American alleles in the demographic inference. Some of these results can be independently verified by independent sequencing of contemporary or ancient individuals with more uniform ancestry. However, understanding the genetic history of admixed populations will continue to rely on statistically picking apart the contributions of different ancestral populations, and the development of improved statistical methods, particularly for admixture that is ancient or between closely related populations, remains highly desirable.
The genetic heterogeneity in continental ancestry proportions among populations of the Americas is well appreciated ,,. Our results emphasize more fine-scale aspects of this diversity: because of the similarity between European founders of different populations and the high divergence among the Native American ancestors, populations that appear similar under classical tests such as or principal component analysis may still harbor population specific Native American haplotypes that must be carefully accounted for when performing rare-variant association testing in cosmopolitan cohorts. Similarly, the choice of a replication cohort for an identified risk variant should be guided by the ancestral background on which the variant is found. The PUR may be an excellent replication cohort for a result found in CLM if the background is European. If the background is Native American, a different cohort with related Native Ancestry would likely be much more appropriate. Understanding the genetics of the different ancestral populations of the Americas, and the relatedness among these ancestral groups, will therefore facilitate the development of association methods that account for and take advantage of this rich diversity.
Ideally, we would have been able to directly model the joint site-frequency spectrum (SFS) of all the ancestral populations to the PUR, CLM, and MXL. However, because we are interested in distinguishing the Native American ancestries to the three populations, this would require modeling at least 5 populations, which is beyond the scope of current methods. We would like to use the inferred local ancestry to focus on the Native American ancestry only, but this is difficult because most Native American haplotypes are in segments heterozygous for ancestry. Because of phasing errors, allele-specific ancestry can be incorrectly assigned. To minimize the impact of such mis-assigned ancestry and to ensure that we focused on variants of genuine Native American ancestry, we discarded all variants observed in 1000 Genomes individuals of African, European, and Asian ancestry, as well as variants observed in Hispanic/Latino populations in segments with no Native American ancestry inferred.
We then considered all remaining variable sites that were assigned Nat/Nat diploid ancestry and Nat/Eur ancestry, and calculated the expected frequency distribution under the assumption of perfect negative ascertainment, that is, that all remaining variants were on the Native American background. Because the European backgrounds are expected to carry a number of singletons, this would result in an overestimate of the number of singletons in the Native Ancestry. Fortunately, this bias is easy to estimate empirically: we first choose segments of Eur/Eur ancestry to mimic the European haplotypes in our sample. After performing the negative ascertainment scheme on these genotypes, we can directly estimate the bias in the negative ascertainment scheme. In practice, this correction is very low except for singletons, as expected. The number of excess singletons was 129 for CLM, 73 for PUR, and 40 for MXL. The largest non-singleton correction is 1.3 for doubletons in CLM.
Because negative ascertainment removes a significant proportion of the variants that were present at the Native American split from other populations, we hypothesized that this effect could be well-approximated by a severe bottleneck at the time of split between non-Native and Native American ancestry.
Figure 9 provides a simulated example, wherein a marginal spectrum (top) is compared to a spectrum negatively ascertained using 100 diploid individuals from the ‘outgroup’ population (middle) and to a bottleneck approximation equivalent (bottom). More quantitatively, we simulated a two-populations sample diverged 12.1kya, and negatively ascertained using a population diverged at 16.5 kya, and attempted to model this as a two-population model with an early bottleneck. The inferred bottleneck timing was within of the split time with the outgroup, and the three population sizes and split time between populations 1 and 2 were within of the correct value. These biases are well within the acceptable range given other biases and uncertainties.
We wish to estimate the allele frequencies at each site among segments of Native American origin, but we have to contend with a finite sample and inaccurate phasing. We therefore choose to model the underlying population frequency across all populations using Bayes rule
where is the observed genotype data, , and is the diploid local ancestry calls (e.g., for populations A and B). From this distribution we can calculate expected frequency and confidence intervals. We report inferred frequencies and confidence intervals at non-monomorphic sites.
To estimate , we write as the frequencies of the non reference allele in populations and . We have , for ancestry and genotype heterozygous segments, , and so forth. To estimate , we first observe that because we are considering population frequencies, rather than sample frequencies, is independent of : . This suggests the use of a self-consistent, expectation-maximization procedure. We estimate the underlying frequency distribution as
the sum over the estimated probabilities at each site. We can thus iterate Equations (1) and (2) until self-consistency is reached to estimate both allele frequency distributions and single-site allele frequencies in each population.
A final caveat is that the sum runs over all sites, including monomorphic ones. If we only observe the subset of sites that are polymorphic, an additional step is needed. If is the number of monomorphic (unobserved) sites (denoted as ), and represents the sum over polymorphic sites, we have
Intuitively, we are correcting for the proportions of sites at every frequency that might have gone undetected. Results are reported using 20 EM iterations, for sites where all individuals had both ancestry and genotype calls, and data can be downloaded at http://genomes.uprm.edu/Taino/.
To test this method, we considered 84 diploid individuals, each formed by drawing two chromosomes (without replacement) from 84 CEU and 84 YRI individuals, resulting in a simulated 50–50 admixture proportion. We considered 100,000 sites on chromosome 22, and performed the EM inference as described.
Among the 85677 sites that were found to be polymorphic, only 13 had a sample allele frequency departing from the confidence interval for the European ancestry, and 51 among the African ancestry. Confidence intervals encompass much more than of sample allele frequencies, emphasizing that the width of the confidence interval largely reflects the uncertainty about the population frequency given a fixed sample frequency, rather than the phasing uncertainty.
Because the demographic model considered here does not involve migrations between Native groups, we considered the composite likelihood of three pairwise two-population allele frequency distributions, rather than the full three-population spectrum. This allows for much faster inference and better convergence of the numerical optimization. In principle, it also enables the joint inference of more than three populations. We showed through simulations that the use of a composite likelihood had an effect on inferred parameters that was much smaller than other sources of uncertainty. We used grids of 20,40, and 60 grid points per population, and projected Native American allele frequencies to sample sizes of 10 in PUR, 20 in CLM, and 40 in MXL.
The first two principal components for 1000 Genomes populations, showing the distribution of admixed populations.
Ancestry tract length distribution in MXL compared to the predictions of the best-fitting migration model (displayed below). Solid lines represent model predictions and shaded areas are one-sigma confidence regions surrounding the predictions, assuming a Poisson distribution .
Distribution of IBD lengths within populations (red) and across populations (purple).
IBD inconsistency rate as a function of IBD length. Long IBD segments exhibit significantly fewer ancestry inconsistencies. The line represents within-population IBD, the red dots represents across-population IBD.
Ancestry assignments in a control formed by taking the non-IBD matching haplotypes at loci where the alternate haplotype are IBD.
Results of admixture analysis with K=3 to K=12, with Native American populations grouped by geographic origin.
(a) Bootstrap distributions and (b) pairwise correlations for demographic inference parameters. Vertical red bars mark the optimal parameters.
Supplementary methods include additional description of statistical and filtering methods used in this article.
The members of the 1000 Genomes project are: Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Gibbs RA, Dinh H, Kovar C, Lee S, Lewis L, Muzny D, Reid J, Wang M, Wang J, Fang X, Guo X, Jian M, Jiang H, Jin X, Li G, Li J, Li Y, Li Z, Liu X, Lu Y, Ma X, Su Z, Tai S, Tang M, Wang B, Wang G, Wu H, Wu R, Yin Y, Zhang W, Zhao J, Zhao M, Zheng X, Zhou Y, Lander ES, Altshuler DM, Gabriel SB, Gupta N, Flicek P, Clarke L, Leinonen R, Smith RE, Zheng-Bradley X, Bentley DR, Grocock R, Humphray S, James T, Kingsbury Z, Lehrach H, Sudbrak R, Albrecht MW, Amstislavskiy VS, Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B, Yaspo ML, Sherry ST, McVean GA, Mardis ER, Wilson RK, Fulton L, Fulton R, Weinstock GM, Durbin RM, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Schmidt JP, Davies CJ, Gollub J, Webster T, Wong B, Zhan Y, Auton A, Gibbs RA, Yu F, Bainbridge M, Challis D, Evani US, Lu J, Muzny D, Nagaswamy U, Reid J, Sabo A, Wang Y, Yu J, Wang J, Coin LJ, Fang L, Guo X, Jin X, Li G, Li Q, Li Y, Li Z, Lin H, Liu B, Luo R, Qin N, Shao H, Wang B, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Marth GT, Garrison EP, Kural D, Lee WP, Leong WF, Ward AN, Wu J, Zhang M, Lee C, Griffin L, Hsieh CH, Mills RE, Shi X, von Grotthuss M, Zhang C, Daly MJ, DePristo MA, Altshuler DM, Banks E, Bhatia G, Carneiro MO, del Angel G, Gabriel SB, Genovese G, Gupta N, Handsaker RE, Hartl C, Lander ES, McCarroll SA, Nemesh JC, Poplin RE, Schaffner SF, Shakir K, Yoon SC, Lihm J, Makarov V, Jin H, Kim W, Kim KC, Korbel JO, Rausch T, Flicek P, Beal K, Clarke L, Cunningham F, Herrero J, McLaren WM, Ritchie GR, Smith RE, Zheng-Bradley X, Clark AG, Gottipati S, Keinan A, Rodriguez-Flores JL, Sabeti PC, Grossman SR, Tabrizi S, Tariyal R, Cooper DN, Ball EV, Stenson PD, Bentley DR, Barnes B, Bauer M, Cheetham R, Cox T, Eberle M, Humphray S, Kahn S, Murray L, Peden J, Shaw R, Ye K, Batzer MA, Konkel MK, Walker JA, MacArthur DG, Lek M, Sudbrak R, Amstislavskiy VS, Herwig R, Shriver MD, Bustamante CD, Byrnes JK, De La Vega FM, Gravel S, Kenny EE, Kidd JM, Lacroute P, Maples BK, Moreno-Estrada A, Zakharia F, Halperin E, Baran Y, Craig DW, Christoforides A, Homer N, Izatt T, Kurdoglu AA, Sinari SA, Squire K, Sherry ST, Xiao C, Sebat J, Bafna V, Ye K, Burchard EG, Hernandez RD, Gignoux CR, Haussler D, Katzman SJ, Kent WJ, Howie B, Ruiz-Linares A, Dermitzakis ET, Lappalainen T, Devine SE, Liu X, Maroo A, Tallon LJ, Rosenfeld JA, Michelson LP, Abecasis GR, Kang HM, Anderson P, Angius A, Bigham A, Blackwell T, Busonero F, Cucca F, Fuchsberger C, Jones C, Jun G, Li Y, Lyons R, Maschio A, Porcu E, Reinier F, Sanna S, Schlessinger D, Sidore C, Tan A, Trost MK, Awadalla P, Hodgkinson A, Lunter G, McVean GA, Marchini JL, Myers S, Churchhouse C, Delaneau O, Gupta-Hinch A, Iqbal Z, Mathieson I, Rimmer A, Xifara DK, Oleksyk TK, Fu Y, Liu X, Xiong M, Jorde L, Witherspoon D, Xing J, Eichler EE, Browning BL, Alkan C, Hajirasouliha I, Hormozdiari F, Ko A, Sudmant PH, Mardis ER, Chen K, Chinwalla A, Ding L, Dooling D, Koboldt DC, McLellan MD, Wallis JW, Wendl MC, Zhang Q, Durbin RM, Hurles ME, Tyler-Smith C, Albers CA, Ayub Q, Balasubramaniam S, Chen Y, Coffey AJ, Colonna V, Danecek P, Huang N, Jostins L, Keane TM, Li H, McCarthy S, Scally A, Stalker J, Walter K, Xue Y, Zhang Y, Gerstein MB, Abyzov A, Balasubramanian S, Chen J, Clarke D, Fu Y, Habegger L, Harmanci AO, Jin M, Khurana E, Mu XJ, Sisu C, Li Y, Luo R, Zhu H, Lee C, Griffin L, Hsieh CH, Mills RE, Shi X, von Grotthuss M, Zhang C, Marth GT, Garrison EP, Kural D, Lee WP, Ward AN, Wu J, Zhang M, McCarroll SA, Altshuler DM, Banks E, del Angel G, Genovese G, Handsaker RE, Hartl C, Nemesh JC, Shakir K, Yoon SC, Lihm J, Makarov V, Degenhardt J, Flicek P, Clarke L, Smith RE, Zheng-Bradley X, Korbel JO, Rausch T, Sttz AM, Bentley DR, Barnes B, Cheetham R, Eberle M, Humphray S, Kahn S, Murray L, Shaw R, Ye K, Batzer MA, Konkel MK, Walker JA, Lacroute P, Craig DW, Homer N, Church D, Xiao C, Sebat J, Bafna V, Michaelson JJ, Ye K, Devine SE, Liu X, Maroo A, Tallon LJ, Lunter G, Iqbal Z, Witherspoon D, Xing J, Eichler EE, Alkan C, Hajirasouliha I, Hormozdiari F, Ko A, Sudmant PH, Chen K, Chinwalla A, Ding L, McLellan MD, Wallis JW, Hurles ME, Blackburne B, Li H, Lindsay SJ, Ning Z, Scally A, Walter K, Zhang Y, Gerstein MB, Abyzov A, Chen J, Clarke D, Khurana E, Mu XJ, Sisu C, Gibbs RA, Yu F, Bainbridge M, Challis D, Evani US, Kovar C, Lewis L, Lu J, Muzny D, Nagaswamy U, Reid J, Sabo A, Yu J, Guo X, Li Y, Wu R, Marth GT, Garrison EP, Leong WF, Ward AN, del Angel G, DePristo MA, Gabriel SB, Gupta N, Hartl C, Poplin RE, Clark AG, Rodriguez-Flores JL, Flicek P, Clarke L, Smith RE, Zheng-Bradley X, MacArthur DG, Bustamante CD, Gravel S, Craig DW, Christoforides A, Homer N, Izatt T, Sherry ST, Xiao C, Dermitzakis ET, Abecasis GR, Kang HM, McVean GA, Mardis ER, Dooling D, Fulton L, Fulton R, Koboldt DC, Durbin RM, Balasubramaniam S, Keane TM, McCarthy S, Stalker J, Gerstein MB, Balasubramanian S, Habegger L, Garrison EP, Gibbs RA, Bainbridge M, Muzny D, Yu F, Yu J, del Angel G, Handsaker RE, Makarov V, Rodriguez-Flores JL, Jin H, Kim W, Kim KC, Flicek P, Beal K, Clarke L, Cunningham F, Herrero J, McLaren WM, Ritchie GR, Zheng-Bradley X, Tabrizi S, MacArthur DG, Lek M, Bustamante CD, De La Vega FM, Craig DW, Kurdoglu AA, Lappalainen T, Rosenfeld JA, Michelson LP, Awadalla P, Hodgkinson A, McVean GA, Chen K, Tyler-Smith C, Chen Y, Colonna V, Frankish A, Harrow J, Xue Y, Gerstein MB, Abyzov A, Balasubramanian S, Chen J, Clarke D, Fu Y, Harmanci AO, Jin M, Khurana E, Mu XJ, Sisu C, Gibbs RA, Fowler G, Hale W, Kalra D, Kovar C, Muzny D, Reid J, Wang J, Guo X, Li G, Li Y, Zheng X, Altshuler DM, Flicek P, Clarke L, Barker J, Kelman G, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE, Streeter I, Toneva I, Vaughan B, Zheng-Bradley X, Bentley DR, Cox T, Humphray S, Kahn S, Sudbrak R, Albrecht MW, Lienhard M, Craig DW, Izatt T, Kurdoglu AA, Sherry ST, Ananiev V, Belaia Z, Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C, Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P, O'Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V, Shekhtman E, Sirotkin K, Slotta D, Xiao C, Zhang H, Haussler D, Abecasis GR, McVean GA, Alkan C, Ko A, Dooling D, Durbin RM, Balasubramaniam S, Keane TM, McCarthy S, Stalker J, Chakravarti A, Knoppers BM, Abecasis GR, Barnes KC, Beiswanger C, Burchard EG, Bustamante CD, Cai H, Cao H, Durbin RM, Gharani N, Gibbs RA, Gignoux CR, Gravel S, Henn B, Jones D, Jorde L, Kaye JS, Keinan A, Kent A, Kerasidou A, Li Y, Mathias R, McVean GA, Moreno-Estrada A, Ossorio PN, Parker M, Reich D, Rotimi CN, Royal CD, Sandoval K, Su Y, Sudbrak R, Tian Z, Timmermann B, Tishkoff S, Toji LH, Tyler-Smith C, Via M, Wang Y, Yang H, Yang L, Zhu J, Bodmer W, Bedoya G, Ruiz-Linares A, Ming CZ, Yang G, You CJ, Peltonen L, Garcia-Montero A, Orfao A, Dutil J, Martinez-Cruzado JC, Oleksyk TK, Brooks LD, Felsenfeld AL, McEwen JE, Clemm NC, Duncanson A, Dunn M, Green ED, Guyer MS, Peterson JL.
Conceived and designed the experiments: SG MV KS TKO ARL EGB JCMC CDB. Analyzed the data: SG FZ JKB MM AME JLRF EEK CRG WG. Contributed reagents/materials/analysis tools: SG FZ JKB AME CRG BKM JD GB TKO ARL EGB. Wrote the paper: SG MM AME CDB.
This study was supported by NSF grant 7188155, NHGRI grant HG005715, and NIH R01 GM090087 (to CDB), UCSF Chancellors Research Fellowship, Dissertation Year Fellowship, and in part by NIH Training Grant T32 GM007175 (to CRG); 1P60 MD006902, R01 HL088133, R01 ES015794, RWJF Amos Medical Faculty Development Award, the Sandler Foundation; the American Asthma Foundation (to EGB), and the Pew Latin American Fellows Program (MM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.