Ancestry for Mesoamerican and South American samples was assessed initially using mtDNA, and Y-STRs. We sought to identify samples with maximal New World and minimal European ancestry for additional high-throughput genotyping in a larger study of worldwide genetic variation [9
]. The analysis of mtDNA HVS I and II showed that all Bolivian and Totonac samples belong to haplogroups A2, B2, C1 and D1, consistent with pre-Columbian New World maternal ancestry. Mitochondrial haplogroup A2 was the predominant lineage in the Totonacs (63%), while haplogroup B2 was prevalent in the Bolivians (71%). All Totonacs and 17 Bolivians (61%) had pre-Columbian Y-chromosomes (Q1a3a1). Consistent with historical accounts of male European admixture, 11 Bolivians (39%) carried Y-chromosome lineages that are common in Europe (R1b, J2, G) (Figure
Sampling locations and the distribution of major mtDNA and Y-chromosome haplogroups for Mesoamerican Totonacs and South American Bolivians.
Totonac and Bolivian samples were genotyped on Affymetrix 6.0 microarrays. Following filtering (see Methods), a final autosomal dataset of 815,377 SNPs was assembled for the Totonacs (24), the Bolivians (23), and four HapMap populations (YRI, CEU, CHB, and JPT). Allele-sharing distances among individuals were estimated. A principal components analysis (PCA) of the individual distance estimates shows that most New World Bolivians and Totonacs are tightly clustered and more similar to eastern Asians than to Europeans (Figure
a, panel 1). Nine Bolivians have substantially greater genetic affinity to HapMap Europeans than to other New World individuals based on their allele-sharing distances, suggesting European admixture in these samples. In the context of other southern Native Americans populations, the nine admixed Bolivians and one Mayan diverge from all other groups, while the Totonac are loosely clustered but relatively distinct from other samples (Figure2a, panel 2).
Figure 2 a) Principal components plot of individual pairwise genetic distance estimates. Panel 1 – most New World Totonac and Bolivian individuals are clustered and have smaller estimated distances to the HapMap CHB/JPT than to the CEU or YRI (~815K (more ...)
To estimate ancestry and the fraction of European admixture in each individual, we used the model-based population structure analysis implemented in the Admixture program [17
]. The nine Bolivians identified as having potential European admixture by PCA show substantial European ancestry (22–47%) (Figure
b). This analysis also detected one additional Bolivian with a small amount of European ancestry that was not clearly discerned by the PCA analysis. Inclusion of the HapMap African and East Asian populations in the population structure analysis yielded 2–8% potential African admixture in 8 Bolivians and 3 Totonacs. Though separated geographically by ~5,000
km, Bolivians and Totonacs remained identified by a single ancestry component (K
) until K
8 (not shown).
With the appropriate reference populations, high-density SNP data can be used to map the ancestry of chromosomal regions in admixed individuals. We constructed representative reference populations from the CEU samples and the non-admixed New World samples. Reference population genotypes were phased, and the Hapmix algorithm was used to estimate the probability that each SNP allele originated from one of the reference groups. This procedure was also performed for a randomly selected individual from each reference population. After optimizing parameters (see Methods), the average estimated fraction of European admixture in the 10 admixed Bolivians ranged from 0.13 to 0.48 (Table
). These values were highly concordant with estimates from the population structure analysis performed using the Admixture algorithm (r
European admixture estimates for admixed Bolivian samples
To better assess potential African admixture in the native Bolivians and Totonacs, we tested each population against the African YRI reference population using Hapmix. No African haplotype segments were found in the native Bolivians. Admixed Bolivian samples could not be tested against the African reference because the number of ancestry components exceeds two. The Totonacs yielded a total of 3 heterozygous YRI segments of less than 2.9
Mb found in two samples. This third approach suggests very minimal African admixture in the Totonacs or native Bolivian samples and excludes recent African admixture based on the small segment size.
The two New World populations provided us with an opportunity to compare ancestry predictions based on mtDNA, Y-chromosome, and autosomal data in non-admixed and admixed populations. The autosomal SNPs show that Totonacs have, at most, ~1.3% average admixture. All Totonac mtDNA and Y-chromosome haplogroups are consistent with pre-Columbian New World ancestry. In contrast, the Bolivians have, on average, ~12.1% admixture, attributable to 10 of the 23 individuals. Because five Bolivians with J or G Y-chromosome haplogroups were not typed on microarrays, our estimate of European autosomal admixture in the Bolivians is likely conservative due to this bias. Three of the ten admixed Bolivians carried pre-Columbian New World mtDNA and Y-chromosome haplogroups yet harbored ~5–30% autosomal European admixture at the individual level, demonstrating that ancestry prediction based on mtDNA and Y-chromosome haplogroups alone does not necessarily capture an individual’s actual ancestry.
To estimate the average age of the admixture event, we calculated the likelihood of the data from each individual and chromosome under models that assumed different numbers of generations since admixture. The sum of likelihoods over all admixed individuals and chromosomes is maximized for a European admixture event 12 generations ago (Figure
). This result suggests an approximate time of admixture of 360–384
years ago, assuming a generation time of 30–32
Estimate for the age of admixture in Bolivians. The Hapmix log likelihoods summed over all individuals and chromosomes is plotted for generations 2 through 35.
We tested for familial relationships among the admixed Bolivians using a maximum-likelihood approach as implemented in the Estimation of Recent Shared Ancestry (ERSA) software package [19
]. Only one of 45 pairwise comparisons among the ten admixed samples showed significant familial ties (p
0.001; estimated at 9th
–degree relatives (6–19 degrees, 95% CI ), indicating that the admixture in the Bolivians is not explained by recent shared ancestry. Additionally, we used ERSA to test for relatedness in all Bolivian and Totonacs. Among Bolivians, we found 1 second-degree, 3 fourth-degree, 6 fifth-degree, and 10 sixth-degree relatives in 253 pairwise tests. Among Totonacs, there were 3 third-degree, 21 fourth-degree and 240 fifth-through seventh-degree relationships in 276 pairwise tests, typical of a population isolate with shared ancestry. Between Bolivian and Totonacs, no pairwise tests showed significant shared ancestry due to relatedness.
With the goal of producing a small set of AIMs that can rapidly identify indigenous American ancestry, we analyzed autosomal SNPs for New World ancestry information content in the non-admixed Bolivian and Totonac samples. SNPs were screened to identify those with low allele-frequency variance between the non-admixed Bolivians and Totonacs and high allele-frequency variance between the combined New World populations and each Old World population (YRI, CEU, CHB/JPT). A set of 324 AIMs was identified (see Methods and Additional file1
: Table S1).
The 324 markers accurately distinguished the Totonac and Bolivian samples from other populations in a population structure analysis (Figure
a). No Old World sample exceeded 14% inferred New World ancestry, while all non-admixed Bolivian and Totonac samples had at least 91% inferred New World ancestry (median
98%). The admixture estimates in the ten admixed Bolivian samples using these 324 AIMs were correlated with estimates from 120,958 unlinked genome-wide SNPs (r
Figure 4 Structure analysis of Bolivians and Totonacs using a panel of 324 AIMs.a) New World ancestry is predicted for all Bolivians and Totonacs. A non-New World ancestry component is correctly distinguished in the ten Bolivians with European admixture. b) A (more ...)
To assess the utility and portability of the AIMs to other New World populations and to compare these AIMs to other AIMs sets, we merged our data with samples from the Human Genome Diversity Project (HGDP) [20
] which were typed on the Affymetrix platform. We also added worldwide populations examined previously by our group [9
]. The five HGDP New World populations (Surui, Karitiana, Colombian, Maya, and Pima; N
5 each), Bolivians, and Totonacs were assessed with all the AIMs present in both data sets (173 AIMs). These AIMs have power to distinguish all seven New World populations from 61 different Old World groups (Figure
b). Kosoy et al.
identified a set of 128 AIMs [21
], and forty-seven of these AIMs were present in the merged data set. The 47 Kosoy AIMs identify Native American ancestry but do not separate the closely related Old World populations (central and eastern Asians) from the New World populations as effectively as an equal number of New World AIMs identified in this study (Figure
c). The estimated fraction of non-New World ancestry for the larger panel remained well-correlated with a genome-wide estimates based on 130,288 unlinked SNPs (r
0.001). Thus, our panel of AIMs represents a small set of loci that efficiently identifies Native American ancestry in two unrelated ascertainment populations and five independent indigenous groups from Meso-and South America. We emphasize that our AIMs were designed for and are most effective for identifying New World ancestry under a two ancestry component model (K
We assessed the minimum number of AIMs that could still effectively distinguish the Native American ancestry component in Totonacs, Bolivians, and admixed Bolivians. Using a resampling strategy, all 324 AIMs were ranked empirically for their ability to correctly estimate Native American ancestry in each of these populations as compared to the estimate from 120,958 genome-wide SNPs (see methods). The root-mean-squared error for these ranked sets of AIMs shows that the best 40–50 AIMs provide nearly the same accuracy for estimating Native American ancestry as all 324 AIMs (Figure
). These estimates are within 7% of the genome-wide average and are conservative, under-estimating the actual proportion of Native American ancestry.
Figure 5 Accuracy and performance of the 324 AIMs. The root mean square error (RMSE) between Native American ancestry estimates using AIMs and the ancestry estimate using 120,598 genome-wide markers. (Note: the full AIMs panel (324 markers) produces ancestry estimates (more ...)
We next extended our SNP screening procedure to identify the most highly-differentiated New World SNPs in our data set. We selected SNPs comprising the upper 5% tails for standardized allele-frequency variance and Kullback–Leibler divergence of the derived allele for the New World (non-admixed Bolivians and Totonacs) versus each of the Old World groups. We obtained the intersection of the SNPs identified in these comparisons to find New World SNP alleles that were present in, but highly divergent from, the same alleles in each major Old World group, thus obtaining alleles with low variance in the Americas but high variance and high divergence between New and Old World groups. We found 22 SNPs in 17 genomic regions meeting these criteria (Table
, Additional file 2
: Table S2 and Additional file 3
: Table S3).
Location and regional genomic features of highly-differentiated New World SNPs
To evaluate the effects of selection and drift on the regions containing the highly differentiated alleles, we performed a genome-wide scan for selection in the New World samples using a multi-locus composite likelihood ratio test of allele-frequency differentiation as implemented in the XP-CLR program [22
]. This method tests for alleles whose frequencies have changed more rapidly than predicted under a model of genetic drift and may be especially effective for detecting older selection signals. We used the combined non-admixed New World samples as the test population and the Old World Eurasians (CHB, JPT, and CEU) as the reference group. We also considered the CHB/JPT and the CEU as reference populations separately. Of the 22 SNPs identified as highly differentiated, 13 were included in the top 1% of the XP-CLR scan for selective sweeps, and the other 9 were in the top 10%, suggesting moderate effects of selection at these regions.
To control for the possibility that our highly-differentiated SNPs and the XP-CLR method are detecting similar signals based only on allele frequency differences between New and Old World populations, we used XP-EHH to identify candidate selection regions in the New World samples with extended haplotype homozygosity. To find candidate regions most likely to be specific to the Americas, we performed the XP-EHH test using the closely related CHB/JPT population as the reference group. Two SNPs identified by the highly-differentiated SNP screen occurred in genomic blocks that scored second and fifth in the XP-EHH test, and within these genomic blocks, the highest scoring XP-EHH SNP was located within 24 and 33
kb of the highly-differentiated SNPs, respectively. These high-scoring regions are contained with the solute carrier family 6 (SLC6A11
) and the activin A type 1B (ACVR1B
) receptor genes. An additional 11 highly-differentiated SNPs, from 9 independent regions, were located within genomic blocks that scored in the top 2.5% of the XP-EHH distribution (see Table