Small Ancestry Informative Marker Sets Distinguish Major Population Groups
A set of 128 SNPs selected on the basis of informativeness (I
n) between four continental groups (European, Amerindian, West African and East Asian) passed our initial quality filters (see
Methods). Analysis of genotypes using this informative marker set (designated 128 I
n4) was first evaluated using Fst as a general measure of the ability to separate continental population groups. The markers showed large Fst differences between the continental populations and relatively small differences within large groups of disparate individuals within these continental groups (). The South-Asian group, not used in the marker selection, showed substantial differences with the European group consistent with previous observations that this sub-continental group is distinct (
Yang, et al., 2005). In addition, there was a larger intercontinental difference among the Amerindian groups as previously observed (
Price, et al., 2007;
Tian, et al., 2007).
| Table 1Summary of Fst values between and within Ancestry Groups |
Population structure analyses using a Bayesian cluster analysis (STRUCTURE) showed a clear distinction between the continental population groups when the number of clusters was defined at 4 (K=4). The 128 In4 set consistently identified diverse individuals corresponding to European, West African, Amerindian, and East Asian population groups (, ). Adding an additional cluster (K=5), also allowed the identification of individuals from another genetically distinguishable population, that corresponding to a South Asian sub-continental group ().
| Table 2Summary of Population Structure Results Using Markers Selected by Informativeness Between Four Continental Populations. |
The ability of smaller sets of I
n4 markers (96, 64, 48 and 24) to discern population genetic structure was also examined. Here, the smaller sets were in each case the highest ranking I
n4 SNPs (
Supplementary Table S1 and see
Supplementary Table S2 for additional summary information). The individual estimation of continental ancestry was nearly identical when 128, 96 or 64 I
n4 markers were used (e.g. compare ). A summary of all the results shows that as few as 24 I
n4, could in fact identify the same general population clusters (). Specifically, for both West African and European ancestry the results are very consistent with similar proportion of population measurements seen even when comparing 128 I
n4 with 24 I
n4 results. For the Amerindian and East Asian continental population groups there is a modest fall-off in the concordance with self-identification as the numbers of markers decrease, for example, the cluster membership that corresponds best to self identified Amerindian ancestry (pop 4) decreased from 0.94 (128 I
n4) to 0.88 (24 I
n4) (). However, the difference is more pronounced for the estimated contribution from pop5 (corresponding to South Asian background) in the South Asian population (0.75/0.68/0.70/0.59/0.55). The increased uncertainty for South Asian contribution may be explained by the relatively low Fst values between South Asian and European/East Asian populations observed for the I
n4 markers () that in turn reflects the selection criteria (see
Methods).
The population structure analyses of different population groups are also influenced by which subjects are included. When the subject set is limited to only those individuals of particular self-identified backgrounds the results show more distinct cluster assignments. This is illustrated in when East Asian and South Asian subjects are excluded from the analyses and the number of assumed population groups is defined as three (K=3). In addition, small numbers of markers chosen using other criteria may provide good distinction between two or three population groups but provide inaccurate information on other non-included population groups. The performance of subsets of markers selected using either European/West African informativeness or European/Amerindian informativeness is provided in
Supplementary Table S3.
Ability to Exclude Subjects of Disparate Ancestry for Specific Studies
One practical aspect of utilizing continental AIMs is to identify sets of individuals corresponding to a particular continental group. The ability of In4 sets to exclude subjects from the different self identified groups is summarized in using the predominant population group cluster membership as the standard for each continent. Two criteria, 10% non-membership and 15% non-membership are shown. In general, the 128 In4 AIMs and smaller sets showed nearly complete exclusion of individuals with other self-identified ancestries when considering any of the continental groups. However, for European, there was a large decrease in the performance of smaller maker sets (<64 markers) with respect to exclusion of South Asian subjects.
| Table 3Comparison of the Ability of AIMs to Distinguish Different Continental Populationsa |
For both Amerindians and East Asians the exclusion criteria used in these analyses also would result in excluding a relatively large number of subjects for these specific ancestries. For example, 10% non Amerindian exclusion would result in excluding 17% of the Amerindian subjects using 128 In4. While this result is probably partially due to European admixture, there also is some difficulty in fully resolving AMI and EAS ancestry at this level. This issue is less severe when the criteria is set at 15% non-membership but is much more problematic when smaller In4 marker sets are used (). Nevertheless, investigators can use these criteria to improve analyses by excluding most subjects from disparate ancestry regardless of whether they are the result of miss-self-identification and/or due to mislabeling of samples.
Use of Ancestry Informative Markers for Admixture Studies
Another major use of continental AIMs is in admixture studies. The differences in admixture proportions estimated using the 128 In4 AIMs is illustrated in and summarized in for African Americans (AFA), Mexican Americans (MAM), Mexican (MXN) and Puerto Rican (PRA) population groups. These results using STRUCTURE, similar to those with continental populations, are robust and yield consistent admixture proportions in multiple runs using appropriate analysis parameters (see Methods). The results also show that the overall admixture proportions of these groups, AFA, MAM, MXN and PRA can be ascertained with small numbers of In4 AIMs.
In order to further evaluate how consistently different subsets markers can estimate individual admixture we examined the correlation of ancestry assignments. Using the 128 In4 results as the standard we compared the estimated contribution of one of the ancestral parental populations contributing to each of three different admixed populations. These include West African contribution in AFA, European contribution in PRA, and Amerindian contribution in MAM and MXN. The latter two groups (MAM and MXN) were combined since the admixture proportions are similar. Marker sets chosen for their optimum ability to discriminate between four ancestral populations (In4 sets), and two ancestral populations (In2 sets) were examined (). The correlation values (r2) for West African contribution in AFA are high, ranging between 0.988 for 96 In4 to 0.835 for 24 In4, suggesting that small number of markers are sufficient to identify West African contribution. Similar results in AFA were also observed using the marker sets selected specifically to distinguish European and West African (e.g. 0.976 for 48 In2 European/West African). As anticipated, the markers chosen for European/Amerindian differences did not accurately distinguish European/African admixture.
For Amerindian contribution in MAM and MXN the correlation values using In4 markers was also strong but did show a discernable decrease when 48 or 24 In4 markers were examined. For the In2 AIMs optimized for European/Amerindian differences, the results showed stronger correlations (e.g. 0.798 for 48 In2 European/Amerindian versus 0.733 for 48 In4). Similar results are also shown for the European contribution in PRA, however, the correlations were markedly lower. The correlations for European contribution in PRA population were 0.877, 0.587, 0.560, and 0.519 for 96 In4, 64 In4, 48 In4 and 24 In4.
The low correlation between estimates for European contribution in PRA may be explained by the fact that three ancestral populations, Europeans, Amerindians, and West Africans, have substantial contributions in the PRA population. This is unlike AFA and MAM/MXN, where there are two main contributing ancestral populations, West African and Europeans, and Amerindian and Europeans. Using r2 >0.8 as a threshold for high correlation, any of the In4 sets should be acceptable to estimate West African contribution in AFA, 128 In4, 96 In4 and 64 In4 are sufficient for Amerindian contribution in Mexican and Mexican American populations, and 128 In4 and 96 In4 sets should provide sufficiently accurate information for European contribution in PRA.
To further measure the precision of the ancestry estimation of individual subjects in admixed populations, we examined the 90% confidence intervals. For each individual the 90% Bayesian confidence interval was measured (STRUCTURE output). For each set of AIMs, the average size of this confidence interval was then calculated (). Comparison of these results shows the decrease in individual confidence intervals based on the number of markers and the dependency on the admixed population being analyzed. These confidence limits show that in studies of AFA, smaller sets can still provide good precision in individual admixture measurement. However, for MAM/MXN relatively larger numbers of AIMs are required. The confidence limits are smaller when In2 marker sets optimized for the particular admixed population are used. However, the 96 In 4 and 128 In4 set appear to perform very well in each of the admixed groups.
| Table 4Summary of Confidence Intervals Using Different Marker Sets |
The ability to exclude subjects of other continental ancestry in admixed populations was also examined (
Supplementary Table S4). For AFA, nearly all individuals of non-West African or European ancestry could be excluded at the 15% exclusion criteria while maintaining nearly all of the subjects of self-identified AFA ancestry using 64 or more I
n4 AIMs. However, for the MAM/MXN subjects much looser criteria (>30% non-Amerindian or European ancestry) were necessary to include >90% of self-identified MAM/MXN even with 96 I
n4 AIMs. This is probably due to the small West African contribution present in the MAM/MXN populations requiring a larger number of AIMs to enable good definition of this admixture component.
Performance of AIM sets in Association Studies
As another assessment of the performance of the AIM sets, we examined whether these AIMs could correct for false-positive association results in models for population specific disease susceptibility loci. Using 200K genotypes from the I-control database and additional genotypes available from other ongoing studies (see Methods) we specified specific genotypes as disease surrogates and identified true (located in a close genetic position to the modeled SNP) and false (unlinked) associated SNPs. These population sets included genotypes for each of the 128 In4 AIMs since each is included within the Illumina 300K array. Three disease gene models were specified using the surrogate phenotypes defined by SNPs in strong LD with 1) a nonsynonymous genetic substitution in SLC24A5 on chromosome 15 under strong positive selection in Europeans, 2) lactase tolerance phenotype on chromosome 2 that is under strong positive selection in northern European populations and 3) a nonsynonymous coding variant in ADH1B under positive selection in East Asian populations (see Methods for additional details).
The surrogate phenotypes were specified in a sample set of 865 individuals primarily from three disparate continental populations, European (254 subjects), East Asia (283 subjects) and Africa (as represented by 328 African American subjects). In addition, the phenotype defined by
SLC24A5 was examined in 1847 African American subjects. For each of the phenotypes examined, both putative true positives (SNPs located close to the chromosomal position of the modeled genotype) and false positives, unlinked SNPs were found with strong association (p <0.01 after Bonferroni correction) ( and
Supplementary Table S5).
As expected, principal components analysis (PCA) using the entire 200K SNP sets were effective in correcting the false positive associations for each of the three surrogate phenotypes was examined in mixed population sets ( and
Supplementary Table S5). The 128 I
n4 and 96 I
n4 AIM sets were nearly as effective in correcting the false positive associations. Smaller I
n4 sets also corrected most of the false positive results, however these sets failed on some of the analyses e.g. the false association for rs4871195 in the
LCT model remained significant for 64 I
n4 and smaller sets. For the admixed AFA population group, similar results were observed ( and
Supplementary Table S5). Here, the smallest set (24 I
n4) showed incomplete correction. Together, these analyses show that relatively small numbers of AIMs can correct for false positive results in these Mendelian models.