|Home | About | Journals | Submit | Contact Us | Français|
To provide a resource for assessing continental ancestry in a wide variety of genetic studies we identified, validated and characterized a set of 128 ancestry informative markers (AIMs). The markers were chosen for informativeness, genome-wide distribution, and genotype reproducibility on two platforms (TaqMan® assays and Illumina arrays). We analyzed genotyping data from 825 subjects with diverse ancestry, including European, East Asian, Amerindian, African, South Asian, Mexican, and Puerto Rican. A comprehensive set of 128 AIMs and subsets as small as 24 AIMs are shown to be useful tools for ascertaining the origin of subjects from particular continents, and to correct for population stratification in admixed population sample sets. Our findings provide general guidelines for the application of specific AIM subsets as a resource for wide application. We conclude that investigators can use TaqMan assays for the selected AIMs as a simple and cost efficient tool to control for differences in continental ancestry when conducting association studies in ethnically diverse populations.
Analyses of population genetic structure have shown that continental population groups can be identified by examining differences in allele frequencies (Rosenberg, et al., 2005; Rosenberg, et al., 2002). Over the last several years studies have demonstrated that thousands of individual single nucleotide polymorphisms (SNPs) distributed through out the genome have very large differences in allele frequencies between two or more continental populations (Mao, et al., 2007; Price, et al., 2007; Smith, et al., 2004; Tian, et al., 2007; Tian, et al., 2006). These studies have set the framework for both admixture mapping and adjusting for population genetic structure in association testing. The latter is particularly important since differences in population genetic structure between cases and controls can confound SNP-disease associations leading to false positive or negative findings (Campbell, et al., 2005; Clayton, et al., 2005; Freedman, et al., 2004; Helgason, et al., 2005; Marchini, et al., 2004). Methods to measure, and therefore address differences in population structure in association testing have been developed (Epstein, et al., 2007; Hoggart, et al., 2003; Price, et al., 2006; Pritchard, et al., 2000b; Purcell, et al., 2007; Satten, et al., 2001). In the context of whole genome association (WGA) scans, these methods can be readily applied. However, for follow-up association studies to further define critical candidate regions in larger population sets, or for analyses of additional populations, a small set of ancestry informative markers (AIMs) is highly desirable.
While differences within continental populations, and population substructure, must also be considered (Bauchet, et al., 2007; Price, et al., 2008; Seldin, et al., 2006; Tian, et al., 2008), the larger difference in allele frequencies between continental populations potentially creates the greatest confounding problem in interpreting such association studies. At this point a large number of WGA studies have been conducted in populations of primarily or exclusively European ancestry. Thus, the issue of confounding by population stratification will become particularly evident as more genetic associations are conduced among multiethnic, and therefore substantially admixed populations, in order to evaluate ethnic disparities in disease risk. Addressing these differences in population structure is particularly relevant for extending genetic associations to underserved minority groups that include substantial admixture between continents.
The current study was undertaken to provide a resource for determining and quantifying differences in continental populations using the smallest numbers of SNPs possible as a cost and time efficient strategy. Previous studies by both our group and others, have shown that AIM sets of 200 markers or less have ability to discern continental structure (Parra, et al., 2004; Salari, et al., 2005; Yang, et al., 2005). However, the use of such markers has been sporadic, the validation of many of the markers incomplete, and in some cases have been limited to specific platforms that cannot be readily and inexpensively used by multiple laboratories. The current study utilizing the widely used TaqMan® platform provides a set of AIMs that distinguish continental groups that can be widely applied to genetic studies. In addition, the application of AIMs depends in part on availability of genotypes. Our study also provides genotypes of continental populations as a research community resource. Most importantly, the current study shows both the value and limitations of using smaller subsets of AIMs by providing guidance in practical application.
DNA samples or genotypes used for population structure analyses were from 825 individuals that included: 128 European Americans (NYCPEA), 60 CEPH Europeans (CEU), 56 Yoruban African (YRI), 19 Bini West African, 23 Kanuri West African, 50 Mayan Amerindians, 26 Quechuan Amerindians, 29 Nahua Amerindians, 40 Mexican Americans (MAM), 26 Mexican (MXN), 28 Puerto Rican American (PRA), 43 Chinese (CHB), 43 Chinese American (CHAH), 43 Japanese (JPT), 8 Vietnamese American (VAH), 1 Korean American (KAH), 45 Filipino American (NYCPFA), 2 unspecified East Asian Americans (OEAS), 3 Japanese American (JAH), and 64 South Asian Indian Americans (SAS).
These populations were based on self-identified ethnic affiliation. The NYCPEA, NYCPFA and PRA were from New York City and were collected as part of the New York Cancer Project (Mitchell, et al., 2004). The Mayan samples were collected from two villages, Bola De Oro and Cienega Grande, from Chimaltenango Guatemala (provided by G.S. and J.B.), the Quechuan individuals were from Peru (provided by J.B.); the Nahua were from central Mexico (provided by M.EAR); the MXN were from Mexico City (provided by ME.AR.), the MAM and AFA were from California, and the CHAH, VAH, KAH, and SAS were from Houston (provided by J.B.). For the West African samples the Bini, are a Niger-Congo group of Bantu speakers from Edo State and the Kanuri, a group of Nilo Saharan speakers fro the Lake Chad region of northern Nigeria (provided by R.K.). The CEU and YRI were HapMap panel genotypes (Altshuler, et al., 2005) and the JPT and CHB were from the I-ControlDB (www.illumina.com/iControlDB, Illumina, San Diego, CA).
Additional genotypes used in modeling studies derived included 1) EURNIHLN genotypes (254 subjects) that were available from the NIH Laboratory of Neurogenetics at the Coriell Queue website, 2) East Asian genotypes from the iControlDB (198 subjects), 3) East Asian samples (85) genotyped at North Shore and 4) African American genotypes (1847 subjects) from the iControlDB. For the modeling studies we limited the genotypes to autosomal SNPs that were typed in >95% of each of the included subjects and that were in HWE (p>0.001) within a given self-identified group and in combined samples from a given continent.
The subjects studied were all healthy and not first-degree relatives of each other based on self-reporting. All DNA and blood samples were obtained according to protocols and informed-consent procedures approved by institutional review boards, and were labeled with an anonymous code number.
Fst was determined using Genetix software (Belkhir, et al., 2001) that applies the Weir and Cockerham algorithm (Weir and Cockerham, 1984) This algorithm defines Fst as (MSP-MSG)/[MSP +(nc-1)MSG] where MSP denotes the observed mean square errors for loci between populations and MSG denotes the mean square errors for loci within populations. The pairwise Fst values thus provide a measurement of inter-population genetic variance in comparison to intra-population genetic variance. Hardy-Weinberg (HW) equilibrium was examined using an exact test implemented in the FINETTI software that can be accessed interactively at the internet address provided in the Web Resources section. Population admixture proportions were determined using the Bayesian clustering algorithms developed by Pritchard and implemented in the program m STRUCTURE v2.1 (Falush, et al., 2003; Pritchard, et al., 2000a). Informativeness between multiple population groups was determined using the In algorithm (Rosenberg, et al., 2003).
For STRUCTURE, unless otherwise noted in the results, each analysis was performed without any prior population assignment and was performed at least 3 times with similar results using > 10,000 replicates and 5000 burn-in cycles under the admixture model. For analyses using smaller marker sets (24 and 48 markers) longer runs were necessary to achieve similar results on multiple run comparisons. For 24 and 48 marker sets, 50,000 replicates and 10,000 burn-in cycles were used with the exception of 24 markers selected using In4 (four population informativeness). For these analyses, 100,000 replicates and 20,000 burn-in cycles were necessary. For all analyses reported we used the “infer α” option with a separate α estimated for each population (where α is the Dirichlet parameter for degree of admixture). Runs were performed under the λ = 1 option where λ parameterizes the allele frequency prior and is based on the Dirichlet distribution of allele frequencies.
Fst, In and allele Frequencies were determined using sets of 80 subjects representing European (EURA), West African (AFR), Amerindian (AMI) and East Asian (EAS) ancestry. These included the following distribution of subjects: 1) EURA, CEPH (17 subjects), NYCPEA (63 subjects); 2) AFR, YRI (45 subjects), Bini (17 subjects), and Kanuri (18 subjects); 3) AMI, Mayan (38 subjects), Nahua (23 subjects), and Quechuan (19 subjects); and 4) EAS, HCB (15 subjects), Filipino (16 subjects), 25 diverse ethnic Chinese American (25 subjects), JPT (15 subjects), Japanese American (1 subject), Korean American (1 subject), and Vietnamese Americans (7 subjects).
For modeling studies, association tests were performed using the EIGENSTRAT statistical package (Price, et al., 2006). False discovery rate statistics(Devlin and Roeder, 1999) were determined using HelixTree 5.0.2 software (Golden Helix, Bozeman, MT, USA).
TaqMan® SNP genotyping assays were developed for each of the SNPs used in the current study (Supplementary Table S1) and are commercially available (Applied Biosystems, Foster City, CA; cf. www.allsnps.com). Assays were performed with the TaqMan Genotyping Master Mix, using conditions recommended by the manufacturer, on an ABI 7900 Sequence Detection System (Applied Biosystems, Foster City, CA).
For the current studies the deCODE (Kong, et al., 2002) genetic map was used. The position of each SNP was determined by interpolation using markers that were both on the genetic map and for which an unambiguous physical map position was available in NCBI build 35. Any markers that were not in the same relative order in both the genetic and physical maps were omitted as anchors for the interpolation of the genetic positions of the SNPs.
The SNPs chosen for inclusion were based on two large sets of previous genotyping results in our laboratory (Tian, et al., 2007; Tian, et al., 2006) were limited to those SNPs that overlapped with the 300K genome-wide Illumina SNP array. 250 SNPs were chosen selecting the best SNP in each 10 cM deCODE bin that met the criteria of a large allele frequency differences (>45%) between EURA and AMI groups and small allele frequency differences (<5%) between two disparate AMI groups (Pima and Mayan). Similarly, 250 SNPs with large frequency differences (>45%) between African and European groups were selected. From these 500 SNPs we reduced the number for testing to 184 based on the following criteria: 1) in silico design criteria for TaqMan assays; 2) genome-wide distribution pattern (minimum inter-marker distance = 8 cM on deCODE map); and 3) EAS differences based on HapMap results in JPT and CHB. TaqMan® SNP genotyping assays were designed for the 184 SNPs and tested using DNA panels. Of these, 128 SNPs passed our quality filters demonstrating reproducible genotyping results in population samples of diverse origin, >90% complete typing results in each population and were in HW equilibrium (p>0.01) in the EURA group. A small number of SNPs were not in HW equilibrium in specific populations (2 SNPs in AFR, 3 SNPs AMI, and 3 SNPs EAS). These SNPs did not overlap between these groups and only 2 SNPs showed HW <0.005). Thus, these SNPs were not excluded, because recent admixture in these self-identified ethnic groups could result in departure from HW. Summary information for the final set of 128 SNPs is provided in Supplementary Table S1.
Subsets of the 128 marker set were chosen using the In algorithm (Rosenberg, et al., 2003) with the goal of finding the most informative markers distinguishing one or more of the following: 1) four continental populations EURA, AFR, AMI, and EAS; 2) three continental populations (EURA, AFR, and AMI); or 3) two continental populations (EURA and AFR or EURA and AMI). Each subset was determined using 80 subjects from each ethnic group (described in Statistical Methods) and marker selection was based on the most informative set for each analysis (provided in Supplementary Table S2).
To test whether a limited number of AIMs can correct for false positive results observed in case-control studies due to population stratification we modeled three population specific loci as disease phenotypes. The modeling was done in the following step-wise manner independently for each surrogate phenotype: 1) surrogate cases and controls (with available SNP genotypes on Illumina 300 K platform) were chosen on the basis of genotypes for a population specific marker; 2) 200 K SNPs that passed quality control filters in the surrogate case-control sample sets were tested for association using the HelixTree software package; 3) significantly associated markers (by Armitage χ2 test, χ2 ≥26.6. p≤0.05 with Bonferroni correction for 200,000 tests) in or near the locus designating the surrogate phenotype are defined as true positive signal, while significantly associated SNPs outside the locus are defined as false positives; 4) six to ten SNPs with the strongest false positive associations and a similar number of true positive associations with χ2 values comparable to the false positives were selected for further analysis; 5) the genotypes for the chosen true and false positively associated markers are combined with genotypes for the markers in the selected sets (all 200K SNP markers, 128 In4, 96 In4, 64 In4, 48 In4, and 24 In4), and were tested for association testing correcting for substructure by principal component analysis using EIGENSTRAT (Price, et al., 2006); 6) the positively associated markers were re-analyzed for association using correction for population stratification with an appropriate number of principal components (PC 1 or PC2 depending on the studies population, determined by the plateau of χ2 values).
The surrogate phenotypes were assigned based on SNPs selected from haplotype analyses of three regions that contained genes with strong ancestry association. The models chosen were for the SLC24A5, lactase gene (LCT) and ADH1B. SLC24A5, coding for a K gated Na/Ca exchanger, is located on chromosome 15, and plays a role in human skin pigmentation (Lamason, et al., 2005). This study provided evidence that a non-synonymous genetic substitution (rs1426654, A/G 111) is under strong positive selection in Europeans, with allele A nearly fixed in various European populations (98.7 to 100%), whereas allele G is present at 97 to 100% frequency in African and East Asian HapMap populations (Lamason, et al., 2005). Since genotypes for rs1426654 was not available in our dataset, individuals homozygous for allele A of rs2675348, in complete linkage disequilibrium (LD) with allele A of rs1426654 (r2 =1.00 in HapMap CEU samples), were designated as surrogate cases, while individuals with A/G and G/G genotypes were designated as surrogate controls (Allele A is 1.0 in CEU, 0.5 in CHB, 0.589 in JPT, and 0.25 in YRI).
The second locus chosen for modeling a population specific phenotype is LCT located on chromosome 2. A variant within LCT gene, rs4988235 (C/T -13910), is associated with lactase persistence, leading to ability to digest milk in adults, and has been demonstrated to be under strong positive selection in Europeans (Bersaglieri, et al., 2004; Hamblin and Di Rienzo, 2000; Tishkoff, et al., 2007). Allele A is found at 0.75 frequency in HapMap European samples, but is absent in HapMap YRI, CHB and JPT samples. Since rs4988235 genotypes were not available for our sample set, an allele A for rs1446585, a nearby SNP in strong LD with allele T of rs4988235 (r2=0.73 in HapMap CEU samples) was used for modeling. Individuals homozygous for allele A for rs1446585 were designated as surrogate cases, while individuals with A/G and G/G genotypes were designated as surrogate controls (Allele A is 0.792 in CEU, and 0.00 in CHB, JPT, and YRI).
The third locus is for the alcohol dehydrogenase ADH1B gene, where a nonsynonymous coding genetic variant rs1229984 (Arg47His) is reported to be under positive selection in East Asia (Han, et al., 2007). Allele A is found at 0.77 frequency in HapMap CHB and JPT, but is absent in CEU and YRI samples. Since genotypes for rs1229984 were not available for our sample set, allele A for rs10008281, a nearby SNP in strong LD with allele T for rs1229984 (r2=0.53 in HapMap CEU samples) was used for modeling. [Note: since the trait is modeled on the proxy SNP, the performance of AIM sets should be unaffected by the r2.] Individuals homozygous for allele A for rs10008281 were designated as surrogate cases, while individuals with A/G and G/G genotypes were designated as surrogate controls (allele A is 0.82 in CHB and 0.83 in JPT, and 0.28 in CEU and YRI).
A set of 128 SNPs selected on the basis of informativeness (In) between four continental groups (European, Amerindian, West African and East Asian) passed our initial quality filters (see Methods). Analysis of genotypes using this informative marker set (designated 128 In4) was first evaluated using Fst as a general measure of the ability to separate continental population groups. The markers showed large Fst differences between the continental populations and relatively small differences within large groups of disparate individuals within these continental groups (Table 1). The South-Asian group, not used in the marker selection, showed substantial differences with the European group consistent with previous observations that this sub-continental group is distinct (Yang, et al., 2005). In addition, there was a larger intercontinental difference among the Amerindian groups as previously observed (Price, et al., 2007; Tian, et al., 2007).
Population structure analyses using a Bayesian cluster analysis (STRUCTURE) showed a clear distinction between the continental population groups when the number of clusters was defined at 4 (K=4). The 128 In4 set consistently identified diverse individuals corresponding to European, West African, Amerindian, and East Asian population groups (Fig. 1a, Table 2). Adding an additional cluster (K=5), also allowed the identification of individuals from another genetically distinguishable population, that corresponding to a South Asian sub-continental group (Fig. 1b).
The ability of smaller sets of In4 markers (96, 64, 48 and 24) to discern population genetic structure was also examined. Here, the smaller sets were in each case the highest ranking In4 SNPs (Supplementary Table S1 and see Supplementary Table S2 for additional summary information). The individual estimation of continental ancestry was nearly identical when 128, 96 or 64 In4 markers were used (e.g. compare Fig. 1c with 1a). A summary of all the results shows that as few as 24 In4, could in fact identify the same general population clusters (Table 2). Specifically, for both West African and European ancestry the results are very consistent with similar proportion of population measurements seen even when comparing 128 In4 with 24 In4 results. For the Amerindian and East Asian continental population groups there is a modest fall-off in the concordance with self-identification as the numbers of markers decrease, for example, the cluster membership that corresponds best to self identified Amerindian ancestry (pop 4) decreased from 0.94 (128 In4) to 0.88 (24 In4) (Table 2). However, the difference is more pronounced for the estimated contribution from pop5 (corresponding to South Asian background) in the South Asian population (0.75/0.68/0.70/0.59/0.55). The increased uncertainty for South Asian contribution may be explained by the relatively low Fst values between South Asian and European/East Asian populations observed for the In4 markers (Table 1) that in turn reflects the selection criteria (see Methods).
The population structure analyses of different population groups are also influenced by which subjects are included. When the subject set is limited to only those individuals of particular self-identified backgrounds the results show more distinct cluster assignments. This is illustrated in Fig. 1d when East Asian and South Asian subjects are excluded from the analyses and the number of assumed population groups is defined as three (K=3). In addition, small numbers of markers chosen using other criteria may provide good distinction between two or three population groups but provide inaccurate information on other non-included population groups. The performance of subsets of markers selected using either European/West African informativeness or European/Amerindian informativeness is provided in Supplementary Table S3.
One practical aspect of utilizing continental AIMs is to identify sets of individuals corresponding to a particular continental group. The ability of In4 sets to exclude subjects from the different self identified groups is summarized in Table 3 using the predominant population group cluster membership as the standard for each continent. Two criteria, 10% non-membership and 15% non-membership are shown. In general, the 128 In4 AIMs and smaller sets showed nearly complete exclusion of individuals with other self-identified ancestries when considering any of the continental groups. However, for European, there was a large decrease in the performance of smaller maker sets (<64 markers) with respect to exclusion of South Asian subjects.
For both Amerindians and East Asians the exclusion criteria used in these analyses also would result in excluding a relatively large number of subjects for these specific ancestries. For example, 10% non Amerindian exclusion would result in excluding 17% of the Amerindian subjects using 128 In4. While this result is probably partially due to European admixture, there also is some difficulty in fully resolving AMI and EAS ancestry at this level. This issue is less severe when the criteria is set at 15% non-membership but is much more problematic when smaller In4 marker sets are used (Table 3). Nevertheless, investigators can use these criteria to improve analyses by excluding most subjects from disparate ancestry regardless of whether they are the result of miss-self-identification and/or due to mislabeling of samples.
Another major use of continental AIMs is in admixture studies. The differences in admixture proportions estimated using the 128 In4 AIMs is illustrated in Fig. 1 and summarized in Table 2 for African Americans (AFA), Mexican Americans (MAM), Mexican (MXN) and Puerto Rican (PRA) population groups. These results using STRUCTURE, similar to those with continental populations, are robust and yield consistent admixture proportions in multiple runs using appropriate analysis parameters (see Methods). The results also show that the overall admixture proportions of these groups, AFA, MAM, MXN and PRA can be ascertained with small numbers of In4 AIMs.
In order to further evaluate how consistently different subsets markers can estimate individual admixture we examined the correlation of ancestry assignments. Using the 128 In4 results as the standard we compared the estimated contribution of one of the ancestral parental populations contributing to each of three different admixed populations. These include West African contribution in AFA, European contribution in PRA, and Amerindian contribution in MAM and MXN. The latter two groups (MAM and MXN) were combined since the admixture proportions are similar. Marker sets chosen for their optimum ability to discriminate between four ancestral populations (In4 sets), and two ancestral populations (In2 sets) were examined (Fig. 2). The correlation values (r2) for West African contribution in AFA are high, ranging between 0.988 for 96 In4 to 0.835 for 24 In4, suggesting that small number of markers are sufficient to identify West African contribution. Similar results in AFA were also observed using the marker sets selected specifically to distinguish European and West African (e.g. 0.976 for 48 In2 European/West African). As anticipated, the markers chosen for European/Amerindian differences did not accurately distinguish European/African admixture.
For Amerindian contribution in MAM and MXN the correlation values using In4 markers was also strong but did show a discernable decrease when 48 or 24 In4 markers were examined. For the In2 AIMs optimized for European/Amerindian differences, the results showed stronger correlations (e.g. 0.798 for 48 In2 European/Amerindian versus 0.733 for 48 In4). Similar results are also shown for the European contribution in PRA, however, the correlations were markedly lower. The correlations for European contribution in PRA population were 0.877, 0.587, 0.560, and 0.519 for 96 In4, 64 In4, 48 In4 and 24 In4.
The low correlation between estimates for European contribution in PRA may be explained by the fact that three ancestral populations, Europeans, Amerindians, and West Africans, have substantial contributions in the PRA population. This is unlike AFA and MAM/MXN, where there are two main contributing ancestral populations, West African and Europeans, and Amerindian and Europeans. Using r2 >0.8 as a threshold for high correlation, any of the In4 sets should be acceptable to estimate West African contribution in AFA, 128 In4, 96 In4 and 64 In4 are sufficient for Amerindian contribution in Mexican and Mexican American populations, and 128 In4 and 96 In4 sets should provide sufficiently accurate information for European contribution in PRA.
To further measure the precision of the ancestry estimation of individual subjects in admixed populations, we examined the 90% confidence intervals. For each individual the 90% Bayesian confidence interval was measured (STRUCTURE output). For each set of AIMs, the average size of this confidence interval was then calculated (Table 4). Comparison of these results shows the decrease in individual confidence intervals based on the number of markers and the dependency on the admixed population being analyzed. These confidence limits show that in studies of AFA, smaller sets can still provide good precision in individual admixture measurement. However, for MAM/MXN relatively larger numbers of AIMs are required. The confidence limits are smaller when In2 marker sets optimized for the particular admixed population are used. However, the 96 In 4 and 128 In4 set appear to perform very well in each of the admixed groups.
The ability to exclude subjects of other continental ancestry in admixed populations was also examined (Supplementary Table S4). For AFA, nearly all individuals of non-West African or European ancestry could be excluded at the 15% exclusion criteria while maintaining nearly all of the subjects of self-identified AFA ancestry using 64 or more In4 AIMs. However, for the MAM/MXN subjects much looser criteria (>30% non-Amerindian or European ancestry) were necessary to include >90% of self-identified MAM/MXN even with 96 In4 AIMs. This is probably due to the small West African contribution present in the MAM/MXN populations requiring a larger number of AIMs to enable good definition of this admixture component.
As another assessment of the performance of the AIM sets, we examined whether these AIMs could correct for false-positive association results in models for population specific disease susceptibility loci. Using 200K genotypes from the I-control database and additional genotypes available from other ongoing studies (see Methods) we specified specific genotypes as disease surrogates and identified true (located in a close genetic position to the modeled SNP) and false (unlinked) associated SNPs. These population sets included genotypes for each of the 128 In4 AIMs since each is included within the Illumina 300K array. Three disease gene models were specified using the surrogate phenotypes defined by SNPs in strong LD with 1) a nonsynonymous genetic substitution in SLC24A5 on chromosome 15 under strong positive selection in Europeans, 2) lactase tolerance phenotype on chromosome 2 that is under strong positive selection in northern European populations and 3) a nonsynonymous coding variant in ADH1B under positive selection in East Asian populations (see Methods for additional details).
The surrogate phenotypes were specified in a sample set of 865 individuals primarily from three disparate continental populations, European (254 subjects), East Asia (283 subjects) and Africa (as represented by 328 African American subjects). In addition, the phenotype defined by SLC24A5 was examined in 1847 African American subjects. For each of the phenotypes examined, both putative true positives (SNPs located close to the chromosomal position of the modeled genotype) and false positives, unlinked SNPs were found with strong association (p <0.01 after Bonferroni correction) (Fig. 3 and Supplementary Table S5).
As expected, principal components analysis (PCA) using the entire 200K SNP sets were effective in correcting the false positive associations for each of the three surrogate phenotypes was examined in mixed population sets (Fig. 3a, b, c and Supplementary Table S5). The 128 In4 and 96 In4 AIM sets were nearly as effective in correcting the false positive associations. Smaller In4 sets also corrected most of the false positive results, however these sets failed on some of the analyses e.g. the false association for rs4871195 in the LCT model remained significant for 64 In4 and smaller sets. For the admixed AFA population group, similar results were observed (Fig. 3d and Supplementary Table S5). Here, the smallest set (24 In4) showed incomplete correction. Together, these analyses show that relatively small numbers of AIMs can correct for false positive results in these Mendelian models.
The current study was undertaken to provide researchers with a set of validated AIMs for distinguishing continental populations. We believe that the results provide strong confidence that these 128 In4 AIMs and subsets of these SNPs can be used for characterizing sample sets from diverse population groups. These markers can be applied either to identify those individuals from a particular study that are members of one continental population group or alternatively used to adjust for population stratification due to differences in continental population frequency in cases and controls. The former will reduce population heterogeneity that may also correspond to reducing genetic heterogeneity for specific traits. The latter can, as shown in our modeling studies, allow the reduction or elimination of false positive results.
Our analyses provide guidelines for application especially with regards using the program STRUCTURE (Falush, et al., 2003; Pritchard, et al., 2000a). Other computational programs including ADMIXMAP (Hoggart, et al., 2004) can also be applied with very similar results (data not shown). In general, as indicated in the methods section, the performance of smaller AIM SNP sets in STRUCTURE analyses is only consistently reproducible when very large numbers of iterations are used. This is not a major limitation since the computational time is not a major problem when small sets of markers are used even with large sample sets; several thousand samples will require <24 hours for 100,000 replicates using STRUCTURE and 48 markers. However, smaller marker sets (especially those <64) provide a poorer ability to exclude subjects of disparate continental ancestry and will provide less precision in the individual ancestry assignment. For larger studies (sample sizes of several thousand) the precision of individual assignments will be less consequential than for smaller studies in which the investigation will be more dependent on the accurate assessment of ancestry of each individual. Thus choice of the number of SNP AIMs depends on the populations being studied as well as practical aspects of genotyping. However, as shown in our study, the 96 In4 SNP AIMs perform well for each of the potential applications with only a very modest reduction of potential information compared with the 128 In4 set. Even smaller numbers perform adequately in particular situations but may require additional confidence in the prior information i.e. confidence in self identification of population membership.
A major application of SNP AIMs is to reduce false positives in association studies. For traits associated with continental ancestry our modeling studies found that relatively small numbers of SNP AIMs (64 or more) could adequately adjust for differences in ancestry stratification between cases and controls. It is notable that without the use of AIMs we observed many false positives even when the surrogate models used loci were not in complete linkage disequilibrium with the true ancestry associated trait (i.e. r2 = 0.73 for model 2 and r2 = 0.53 for model 3). This suggests that it is necessary to adjust for population structure for traits that are only partially association with continental ancestry and underscores the importance of the application of these or similar methods when subjects of mixed ancestry are studied. Our modeling studies also examined the use of AIMs in association tests for an admixed population (African Americans). Similar to the subject sets containing individuals from multiple continents, these studies showed that relatively small numbers of highly informative SNP AIMs (64 or more) can adequately adjust for population substructure and eliminate false positive results. Additional studies will be necessary to determine the efficacy of these AIMs in more complex sample sets and other population groups.
The identification of the ancestry groups using non-hierarchical clustering algorithms or for that matter PCA, is enhanced by the inclusion of representatives of the parental population groups. In the analyses performed in the current studies there were representatives of the different continental groups. The inclusion of these groups is particularly important when admixed populations are being examined. The inclusion of these groups, even without specifying population membership, allows more accurate cluster separation. In general, and specifically for the studies reported herein, we did not specify population membership, an available option in the STRUCTURE program. [Similar results are obtained using this option but with larger confidence intervals (data not shown)]. To facilitate the appropriate application of the AIMs described in this study, the genotypes of continental populations groups are provided as a resource to the scientific community (Supplementary Table S6).
Finally, for each of the SNP AIMs used in the current study a TaqMan® SNP genotyping assay is readily available (Supplementary Table S2). We also note that each of the SNPs is also part of the Illumina 300K array, that should enable inspection and utilization of genotypes that are provided in the I-control data base. A summary of the information for each SNP is provided in Supplemental Tables S2 and S6. In addition, since many researches may wish to use a smaller AIMs set we have optimized a panel of 96 SNPs for which robust TaqMan assays are available as a cost-effective format (see Supplementary Table S1 footnote b).
This work was supported by NIH grants AR050267, DK071185, and P30 CA093373, and by Applied Biosystems. The researchers would also like to thank the Swedish Research Council support to MEAR and the Christine Landgraf Memorial Research Fund (LMB).
P.W. and F.D.L.V. declare competing financial interests.