In this study, we developed, validated, and tested a novel panel of AIMs designed to accurately estimate the ancestral components (African, European, and Native American) of contemporary Latin American populations. We developed a new algorithm (provided in the web resources online) capable of taking genome-wide data from multiple populations within each continental group and identifying the most informative, well-balanced and portable markers to estimate ancestry proportions.
The ancestral samples used to identify the AIMs represented a wide variety of populations within each continental group. Specifically, we used six samples from Mesoamerica and the South American Andes as representatives of the ancestral Native American populations that make up modern Latin Americans. Our Native American samples had a median Native American ancestry of 97.7% (25: 75 range 93.2% to 100%) based on ancestry ascertainments using genomewide data. Given the history of European colonization in the Americas, a small amount of European genetic admixture (2.3%, 1×10−5: 6.2%) is not surprising. However, a small amount of European admixture would be expected to result in an underestimate of the information content of our AIMs. Although we did not include Native American populations from English-speaking North America for our analysis, our selection of markers excluded those with significant heterogeneity between Native American populations. Thus, we have no reason to believe the markers cannot be applied to North American populations, though the use of these markers for populations outside of Latin America should be pursued with caution.
We also included two samples from Africa (Yoruba from Nigeria and Luhya from Kenya, in East Africa). Historical records and genetic analyses indicate that most of the slaves imported into the Americas originated in West Africa
[24]. Although it would have been ideal to include multiple West African ancestral populations, we included the Luhya sample in our study because unlike the Yoruba, who are descendants of the Benue-Congo subfamily of the Niger-Congo language family, the Luhya are a Bantu-speaking population, and many of the enslaved Africans brought to the Americas were Bantu speakers. Multiple studies show that the Luhya and other Bantu-speaking groups from East Africa are more closely related to West African Bantu speakers than to other East African ethnic groups
[24],
[25]. In addition, a small but significant number of slaves originated in Southeastern Africa
[26],
[27],
[28]. Finally, we used three European samples to estimate ancestral frequencies in Europe. Importantly, samples from Italy and the Iberian Peninsula, which have been the largest sources of European migrants to Latin America, were included in this analysis.
By excluding markers with significant within-continent heterogeneity, the selected panel of AIMs should be broadly portable to populations from throughout the Americas. Moreover, the exclusion of markers exhibiting substantial within-continent heterogeneity serves to ensure that there is relatively little bias in the estimates of ancestral allele frequency. This is because any bias would have had to occur in all of the ancestral populations within a given continent, at a similar magnitude and in the same direction.On the other hand, by design, the AIMs panel would not be expected to differentiate within-continent population substructure. Indeed, we found that the eight Native American populations genotyped with the AIMs panel were indistinguishable in principal component space beyond the first principal component, which represented the degree of European admixture (data not shown). There are several reasons we chose to exclude markers that could have potentially been used to differentiate within-continent substructure. First, the principal reason for designing this panel was for identifying continental ancestry proportions in admixed samples, as continental admixture is the most important source of population structure in Latin Americans. Secondly, because we had a limited number of Native American ancestral groups available for study, we would have only been able to generate AIMs that distinguished Mesoamerican populations from Andean populations. Third, the use of heterogeneity filters was an important element of quality control, as it served to filter out alleles with extreme frequencies due to bias. Finally, because the genetic differences within continental groups are smaller than between continental groups, we would have required many more markers to accurately determine within-continent substructure.
We validated the panel of AIMs by comparing ancestry estimates derived from the subset of AIMs to estimates derived from genome-wide data in four Latin American populations, three from Mexico and one from Puerto Rico. Overall, the ancestral estimates for both Puerto Ricans and all Mexican groups were consistent with previously published literature
[29]. Specifically, Bryc et al found that Puerto Ricans had 23.6%±12% African ancestry, consistent with our finding of 20.6%±12.3% and Mexicans had 5.6%±2% African ancestry, consistent with our findings of between 3.5%±3.1% and 5.4%±3.6%. The Native American component in the three Mexican populations (64.2%±17.6%, 54.4%±16.9%, and 49.6%±17.4% in Mexicans from Mexico City, INMEGEN, and GALA studies, respectively) is also consistent with results obtained by Bryc et al (50.1%±13%) and in a study by Silva-Zolezzi
et al of diverse Mexican Mestizo populations (55.2%±15.4%)
[30].
There was strong correlation between ancestral estimates obtained from the AIMs panel and those obtained from GWAS, providing strong support for the use of the AIMs panel to accurately estimate ancestry. For over 95 percent of the samples, the estimates of ancestry using AIMs were within 10% of the value obtained using GWAS data.
The correlation was lower for the minor ancestral components (African ancestry in Mexican populations and Native American ancestry in Puerto Rican populations). This reflected the more limited between-subject variance in the minor ancestral component. Since the coefficient of determination (R2) represents the proportion of variance in the outcome variable (in this case, the true measure of ancestry), explained by the predictor (estimates of ancestry using AIMs), in cases where there is more limited variance in the outcome variable such as estimates of African ancestry in Mexicans and Native American ancestry in Puerto Ricans, we observe a lower R2. Nonetheless, measures of individual error in estimate, such as the root mean squared error, are comparable for all three ancestral estimates in both Puerto Ricans and Mexicans, suggesting that the panel performs consistently across all ancestral components, and in most cases, the estimate of ancestry using AIMs lies within 10% of the true measure of ancestry, as can be seen in .
The small systematic errors in the estimation of ancestry with AIMs are likely due to the bounding of ancestry proportions at 0 and 1. The most a minor ancestral component can be underestimated is equal to its true value (for example, an ancestral estimate of 4% can at most be underestimated by 4%, if it is estimated to be 0%), but it can be overestimated much more substantially. Conversely, the major ancestral component cannot be overestimated by more than the difference between 100% and its true value, but it can be significantly underestimated. This effect is most notable in with African ancestry in Mexicans from Mexico City, where the bounding is visible as what appears to be a line with a slope of −1 that forms the lower limit of error estimates for ancestry proportions less than 0.05. The slight increase in noise from AIMs panels compared to genome-wide estimates should then result in overestimates of minor components and underestimates of major components, consistent with observation.
We used the panel of AIMs to genotype 373 individuals from 18 Latin American populations. The samples were very diverse, and included individuals from several indigenous groups, African descendants and Mestizos from five different countries. Generally speaking there is strong concordance between ethnicity and admixture estimates. Specifically, seven out of eight indigenous samples showed a high degree of Native American ancestry. In particular, the four isolated groups from Venezuela (Warao, Panare and Pemon from the Amazon and Wayu from the northwestern region of Venezuela) showed very little evidence of European or African admixture. The three indigenous groups from Colombia (Coyaima, Pastos and Awa) had average Native American proportions higher than 80%, and a relatively small European contribution. That our AIMs panel could effectively estimate ancestry in lowland South American Native American populations (such as those in Venezuela) despite the fact that our AIMs were derived from Mesoamerican and Andean populations is reassuring and demonstrates that our strategy of excluding markers with significant heterogeneity ensures the generalizability of the markers. The indigenous Wichi from Argentina had considerably lower Native American ancestry and higher European ancestry (0.41 and 0.54, respectively) than the indigenous groups from Venezuela and Colombia. This is consistent with a recent study of Y-chromosomes that found widespread European paternal ancestry among Amerindian groups, including the Wichi, in Argentina
[31]. Interestingly, we observed cryptic and previously unreported European admixture in the two isolated Indigenous populations from Southern Colombia, a fairly common phenomenon in Native American populations
[32].
In Bolivia, we found that the individuals from the Departments of Beni, Cochabamba and the Altiplano region of the La Paz Department had, on average, high Native American contributions. However, in the subtropical area of Yungas, many of the individuals recruited in the small community of Tocaña and one of the individuals recruited in the nearby town of Coroico had high African ancestry (median

=

0.78, 0.74: 0.80). The subtropical Yungas region is home to several scattered Afro-Bolivian communities. These Afro-Bolivians are the descendants of African slaves who were brought to work on the Potosi mines and coca plantations
[33]. Our data indicate that the admixture process in this Afro-Bolivian community has been primarily with the indigenous groups living in this region (median Native American ancestry

=

0.13, 0.09: 0.20, median European ancestry

=

0.04, 0.02: 0.06).
Two additional groups of African descent were included in this study, the Mulaló and Chocó from Colombia. African slaves were brought to Colombia early during the colonial period for gold mining, sugar cultivation, and cattle ranching. The proportion of African ancestry in these two Afro-Colombian groups was slightly lower than in the Afro-Bolivian community (0.54, 0.46: 0.69 in the Mulaló and 0.76, 0.64: 0.83 in the Chocó). Unlike the Afro-Bolivian sample from the Yungas region, in which most of the non-African contribution came primarily from indigenous groups, the two Afro-Colombian samples had similar European and Native American ancestral contributions (). This highlights the diverse history of admixture in different areas within Latin America. Similar observations have been reported by Castro de Guerra and colleagues
[34],
[35], which compared two African derived populations in Venezuela and found that one, the Patenemos, showed mostly European ancestry, while the other population, Ganga, was principally admixed with Native American ancestry. We estimated that the time since admixture in the three samples of African descent is approximately 6 to 7 generations, corresponding to between 174 and 203 years, indicating that, the admixture process in these groups has been relatively recent. Though the point estimates of the years since admixture are approximately 50 to 100 years after the time when slaves were introduced into the region for gold and silver mining, because of the wide credible intervals, our estimates are not inconsistent with the historical record
[36].
Our samples from the four Mestizo populations from Chile, Colombia, and Venezuela showed a wide variability in the ancestral proportions, though the primary ancestral contributions were European and Native American. Only some of the subjects from Maracaibo, on the Caribbean coast of Venezuela, had greater than 10% African ancestry, as did some of the Puerto Rican subjects used to validate the AIMs. This is unsurprising, given that the rest of our Mestizo populations are from Mexico, Chile and the Northwest of Colombia, areas where the slave trade was not prominent. This is consistent with the findings of Wang
et al, who examined thirteen Mestizo populations in Latin America and found extensive variation in Native American and European ancestry and relatively low levels of African ancestry
[37]. We estimated between eight and thirteen generations since admixture for the mestizo samples, corresponding to between 230 and 375 years, reflecting the earlier settlement of substantial contingents of Europeans in Colombia than in Chile
[38].
One striking finding in this paper is the rich ancestral variation in the Americas, even within a single country. For example, among the six Colombian populations examined (three Native American populations, one Mestizo population, and two Afro-Colombian populations), median Native American ancestry varied between 0.13 in the Chocó and 0.86 in the Coyaima, African ancestry varied between 0.02 in the three Amerindian populations and 0.74 in the Chocó, and European ancestry varied between .09 in Coyaima and 0.52 in the Colombian mestizos. Likewise, even among the Bolivians in a single administrative department (state), there was a wide variation in African and Native American ancestry (). These patterns of variation in ancestry within small regions seem to be a common feature across the Americas and have also been recently found in the island of Puerto Rico
[3]. This has broad implications for genetic association studies in Latin American subjects, as there is a strong potential for population stratification, even in samples from a single country or a single administrative region within a country, and emphasizes the importance of incorporating ancestry estimates into future genetic association studies in these populations. We anticipate the primary use of this panel of AIMs will be to control for population stratification in genetic association and medical genetic studies. Thus, the ability of our panel of AIMs to effectively control for population stratification, as evidenced by its ability to reduce the genomic inflation factor in a highly stratified study of Type II diabetes in Mexican subjects, is an important source of validation. Even small subsets of AIMs from the panel adequately control for population stratification, suggesting that the panel should adequately cope with the significant patterns of variation in ancestry seen in Latin American. Nonetheless, because the panel of markers is not designed to identify within-continent heterogeneity, it is possible that it may not adequately control for finer population substructure.
In summary, we have developed and validated a panel of 446 AIMs to estimate European, Native American and African admixture proportions. The markers were selected to have low heterogeneity within continents, in order to be portable throughout the Americas. This panel was specifically designed to provide accurate individual admixture estimates and to control for the effects of population stratification in association studies in admixed populations. The use of this panel will minimize the risk of false positives in candidate gene studies, or in research efforts designed to replicate signals identified in genome-wide association studies, even in studies with substantial population stratification.
Our analysis of subsets of this panel has shown that to successfully control for population stratification in association studies, panels with 314, 194 and even 88 AIMs provide adequate estimates of the ancestral proportions with greatest variance that are strongly correlated with the genome-wide estimates (R2 of 0.9 or higher) and have mean absolute error under 5%. Panels with 314, 194 and 88 AIMs all adequately controlled for the effects of population stratification in the Mexico City sample. The inflation factor (lambda) was reduced from 1.40 when using sex and age as covariates, to less than 1.04 when incorporating ancestry estimates based on genome-wide data and panels of 314, 194 and 88 AIMs, and reasonable control for population stratification could be achieved with even smaller panels.
There are several important limitations to our AIMs panel. It is important to point out that the density of the markers in this panel is inadequate for admixture mapping, although the enclosed Python script could be used to identify a sufficient number of AIMs to perform an admixture mapping study
[39]. Several research groups have already made available denser genome-wide panels of AIMs for admixture mapping in African Americans
[40],
[41],
[42] and Hispanics
[43],
[44],
[45], although none of these panels was designed for admixture models including three ancestral populations. The AIMs were selected for their information content on African, European and Native American ancestry. These have been the major population groups contributing ancestry in the Americas since the 15
th century. However, in many locations within the Americas, the history of human migration and admixture has been extremely complex, and has involved other population groups, such as East Asians and South Asians
[46]. This panel of AIMs should be applied cautiously to populations (or individuals) with such complex admixture histories. Finally, while the panel has been validated to study the history of recent admixture in Latin America, it is unlikely to be effective in inferring finer scale population history.
As with all panels of AIMs, our panel is vulnerable to ascertainment bias, because the AIMs were selected to maximize the difference in continental ancestral allele frequencies. However, there are several factors that minimized the impact of this bias. First, we had a large sample size of all ancestral groups, particularly the European populations. Since the standard error of the estimate of allele frequency is inversely proportional to the square root of the number of individuals, the large sample sizes minimize the standard error in allele frequency estimates. Secondly, we used multiple populations within each continental group, and excluded any markers that showed large amounts of heterogeneity among ancestral groups within each continent. Thus, samples biased in one population (due to chance or genotyping error) are likely to have been filtered out. Finally, when we applied our panel to new populations, it produced credible ancestry estimates, which compare favorably to ancestry ascertained from genomewide data not subject to ascertainment bias.
This panel is intended to be an important resource for the community and we have provided both the source code for the algorithm to generate the AIMs, as well as allele frequency data and anonymized ancestral African, European, and shuffled Native American genotype information. We hope that investigators can use the selected panel of AIMs, which can be easily genotyped on readily available platforms, as a cost-effective tool to estimate continental ancestry in modern populations of the Americas.