Admixture mapping is a powerful gene mapping approach [5
]. However, the power of this method relies on the ability of ancestry informative markers (AIMs) to infer ancestry along the chromosomes of previously separated but recently admixed individuals. In a recent paper from ASHG, Royal et al
. (2010) [29
] outlined the challenges, opportunities and implications of genetic ancestry inference. Over 40 companies provide genetic ancestry testing to the public. However, these companies differ in their approaches, the types of ancestry markers used and tests they offer. The promise of utilizing genetic ancestry information to advance medical genomics depends on our ability to correctly and precisely infer/measure ancestry using informative markers.
Several methods have been proposed to measure ancestry informativeness of markers and to choose a panel of markers to be genotyped while maintaining the power of detecting ancestral chromosome segments in each genomic location. The choice of which of these measures to use should depend on the efficiency of each measure in selecting most ancestry informative markers. However, there is no consensus as to which criteria to use to select markers for ancestry inference or admixture mapping, and the performance of these methods has not been carefully evaluated and compared. The rule of thumb is to select markers with large allele frequency differences between ancestry populations. However, the number of markers required for population assignment will depend on the populations under consideration, their respective level of genetic differentiation and the desired stringency of assignment [30
]. For instance, in humans, the level of genetic variation between populations is only 5%-10% whereas genetic variation within dogs is about 27% [31
]. As a result, the number and types of markers required for individual assignment and discrimination amongst populations is different between populations/species under consideration [30
]. Previous studies selected markers based on different datasets and marker types using only one of the methods at a time, and there has not yet been a formal comparison of the performance of these methods. Therefore there is a need to compare all methods using the same data sets and evaluate their efficiencies and accuracies in estimating ancestral proportions for admixed populations.
In this study, we applied five different analytic tools to evaluate the concordance of selected informative SNPs using the same dataset. Our investigation using 500 top ranked markers for each measure and accounting for the physical distance between consecutive AIMs to be at least 100 kb, showed the following overlap between the different measures: δ vs FST
(n = 479), δ vs FIC (n = 220), δ vs SIC (n = 319), δ vs In
(n = 424), FST
vs FIC (n = 230), FST
vs SIC (n = 329), FST
(n = 445), FIC vs SIC (n = 395), FIC vs In
(n = 258), and SIC vs In
(n = 354) (Additional file 13
, Table S7). On average, the overlap of each measure with the other four was 361, 371, 276, 349, and 370 for δ, FST
, FIC, SIC, and In
, respectively. FIC had the least overlap with other measures. However, based on current cutoff values used for each measure, the δ measure included a number of loci that were not selected by the remaining four methods. FST
, FIC, and In
gave relatively smaller and similar AIM panels, whereas SIC gave a very small panel of AIMs (Additional file 14
, Figure S7). Analyses based on deciles showed that sets of SNPs at the highest or lowest SNP information content selected for admixture mapping were highly similar across the different measures of informativeness. Towards the middle of the informativeness scales, the agreement among the sets of SNPs selected by different methods to discriminate between populations decreased (Figure , , and ). Furthermore, FIC and SIC were more likely to pick the same set of SNPs, δ, FST
, and In
were more likely to pick the same set of SNPs, and FIC was more likely to choose SNPs that were not chosen by the other measure.
Analytically, the FIC and SIC measures require pre-defined ancestral proportions in an admixed population, whereas FST
, δ, and In
do not. We ran sensitivity analysis to study the impact of ancestral proportion in choosing informative markers using arbitrary values using CEU and YRI population. Compared with FIC, SIC was less sensitive to the proportion of ancestry contribution in the selection of AIMs (Additional file 15
, Table S8 and Additional file 16
, Table S9). Proportion of ancestry had virtually no effect on the selection of top 1% AIMs for SIC. For two proportions of ancestry contribution within a distance of 0.1, FIC selected 56%-71% common sets of AIMs and SIC selected 75%-100% common sets of AIMs. Therefore, it is important, when using FIC, to have a good a priori
estimate of proportion of ancestry contribution.
There are some limitations for some of these measures, for example, FIC favors selection of markers that are closer to fixation in one parental population and may not be appropriate to assess the level of informative markers when ancestral populations are more than two [32
]. Compared with other methods such as FST
, δ is easy to calculate and independent of mutation and model assumptions, however, δ has a major limitation of being only useful for admixed populations from two parental populations and it doesn't account for multiallelic situations at a locus. FST
may not be appropriate to assess the level of genetic information in SNP markers when the number of populations is > 2, as the method could result in the selection of SNP markers which are specific for a single most genetically distinct population. The selected SNP markers that were specific for only the most distinct population are expected to have low heterozygosity. Genetic markers with high expected heterozygosity are informative and therefore useful in individual assignment analysis [33
]. Although most researchers traditionally focus on global axes of variation in a dataset, substantial information about population ancestry exists locally- across chromosomes. Adjustment of global ancestry between study subjects may lead to false positives when chromosomal (local) population ancestry is an important confounding factor [34
]. In a recent chromosome-based study by Baye (2011) [35
], fine-scale substructure was detectable beyond the broad population level classifications that previously have been explored using genome-wide average estimates. The study of population ancestry in terms of local ancestry has broader practical relevance because genetic diversity is directly related to recombination rate (meiosis), which differs among regions of the genome, and genes are not randomly distributed along chromosomes. The current analytical approach using genome-wide average estimates will control for confounding due to global ancestry but will not control for confounding due to the local ancestry effect because the global ancestry information is obtained from all markers across the genome and may not accurately reflect local ancestry variation. It is becoming increasingly important to recognize local ancestry variation, especially when populations have been recently admixed [35
]. Future studies should focus on the applications of these measures to important genomic regions.
Many factors impact the accuracy of the estimation of ancestry contributions, which include but are not limited to sample size, the panel of AIMs used, the number of AIMs used, and the underlying distribution of ancestry contribution of the individuals in the sample. The use of a phased HapMap dataset allowed us to simulate individuals that share common founding populations. Moreover, the ancestry proportion for each individual is known, allowing for the comparison of true and estimated individual admixture values, thus enabling the comparison of different methods by estimation accuracy. Our findings indicate that, the different measures of marker informativeness [δ, FST
, FIC, SIC, and In
] performed well and as few as the top 20 ranked informative markers were adequate for accurate classification of ancestral populations. This is in agreement with the commonly made claim in the literature on marker selection for population assignment that 'classification accuracy can be substantially improved if only a subset of loci is used in the assignment test' [36
]. For instance, Lao et al., (2006) [37
] found that 10 SNP markers from a 10 K SNP array contained enough genetic information to differentiate individuals from Africa, Europe, Asia and America and no further gain in power of assignment was achieved by including more SNP markers. Indeed, it is generally considered that uninformative markers (i.e., monomorphic loci) may add variability and noise to the results and compromise the power of population genetic studies [38
Although the marker selection methods explored in this study agreed to a large extent in identifying the most informative SNPs, there were differences in their performance in ancestry estimation. The simulation study revealed that In was the best in selecting the set of AIMs giving the smallest bias and mean square error in ancestry estimation. Analysis based on random subsets of top 1% to 10% ranked AIMs indicated that, compared to other methods, AIM panels selected by In behaved consistently and reasonably well for both the ASW population and simulated admixed populations. These results illustrate that effective exploration of all these methods can help to not only identify the most informative markers but also produce an optimal minimum set of markers that can accurately and efficiently differentiate among populations.
We suggest that the different measures may provide unique insights into a marker's informativeness under different scenarios, including varying ancestral proportion and when more than two ancestral populations are present. To identify all potentially informative SNPs, results from all measures could be considered. For example, the union of the top 500 SNPs for all five measures could be considered as the best AIMs panel. Researchers need to be aware of the differences between the various methods for evaluating ancestry informativeness of SNP markers. Furthermore, as we attempted in our simulation studies using either average rank or minimal rank of all five measures, combined information from more than one method may provide a reliable means, although may not be the best, in selecting markers for ancestry inference. Further research on this topic may shed light on how to best integrate different measures to obtain a set of AIMs most effective for the populations under consideration. We believe that the information that a set of markers provides for assigning or discriminating individuals to their source populations or different relationships must be critically evaluated before investing millions of dollars on an admixture or ancestry related project. We anticipate identification of more complex patterns of ancestry will require explorations of these and newer methods yet to be defined, to identify an optimal set of markers to use, however this should become increasingly feasible as genotyping costs decrease and available data grow on different populations. This in turn will allow the development of higher resolution of genogeographic and ethnic maps and help investigators designing genetic association studies in stratified homogeneous groups.