|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: PD PP. Performed the experiments: PD JL PP. Analyzed the data: PD JL PP. Contributed reagents/materials/analysis tools: PD PP. Wrote the paper: PD PP.
Recent large-scale studies of European populations have demonstrated the existence of population genetic structure within Europe and the potential to accurately infer individual ancestry when information from hundreds of thousands of genetic markers is used. In fact, when genomewide genetic variation of European populations is projected down to a two-dimensional Principal Components Analysis plot, a surprising correlation with actual geographic coordinates of self-reported ancestry has been reported. This substructure can hamper the search of susceptibility genes for common complex disorders leading to spurious correlations. The identification of genetic markers that can correct for population stratification becomes therefore of paramount importance. Analyzing 1,200 individuals from 11 populations genotyped for more than 500,000 SNPs (Population Reference Sample), we present a systematic exploration of the extent to which geographic coordinates of origin within Europe can be predicted, with small panels of SNPs. Markers are selected to correlate with the top principal components of the dataset, as we have previously demonstrated. Performing thorough cross-validation experiments we show that it is indeed possible to predict individual ancestry within Europe down to a few hundred kilometers from actual individual origin, using information from carefully selected panels of 500 or 1,000 SNPs. Furthermore, we show that these panels can be used to correctly assign the HapMap Phase 3 European populations to their geographic origin. The SNPs that we propose can prove extremely useful in a variety of different settings, such as stratification correction or genetic ancestry testing, and the study of the history of European populations.
The study of human population genetic structure and the selection of Ancestry Informative Markers (AIMs) have attracted considerable attention, mainly due to their implications for diverse areas of genetics and a variety of research scenarios, ranging from forensics to population genetics and medical genetics. Within the European continent, early studies of population genetic structure sought to address questions on the origin of different ethnic groups as well as the historic and genetic relationships among them. Indeed, studies of variation on the non-recombining portion of the Y chromosome supported the hypothesis of an initial settlement of Europe by Paleolithic hunter-gatherer communities, as well as the European re-colonization from glacial refugia in the South and, later, from a rapidly expanding farming population originating from Anatolia , , . The advent of large-scale genotyping allowed us to further explore these hypotheses and also revealed the practical implications of identifying and understanding European population genetic structure. In the search of susceptibility genes for common complex disorders, it became evident that population stratification within Europe does exist and that it can lead to spurious results when coupled with phenotype correlations with geography , .
With the volume of rich genotypic data rapidly increasing, Principal Components Analysis (PCA) emerged as a powerful technique that can be used to summarize and process the vast amounts of available information. PCA is a linear dimensionality reduction technique that can effectively extract the fundamental structure of a dataset without any need for modeling of the data. It has been used to decompose the complex genetic structure of human populations ,  and it can be successfully applied to infer genetic ancestry as well as substructure in a given sample , , , . Furthermore, as we have recently described, PCA can be applied to identify AIMs, which in this case represent SNPs that are correlated with significant Principal Components (PCA Informative Markers - PCAIMs) , . In fact, we demonstrated that small panels of such SNPs can successfully reproduce the structure of a dataset as identified by PCA, without any prior knowledge or hypothesis about on the origin of studied individuals or artificial assignment of individuals to pre-defined clusters , .
Leveraging the power of PCA, recent large-scale studies have allowed us to appreciate the fact that population genetic structure within Europe is discernable at a fine scale, when information from hundreds of thousands of genetic markers that span the entire genome is used , , , , , . A number of recent studies analyzed thousands of individuals across Europe using information from genomewide genotypes and showed that the top two principal components capture a significant amount of variation across European populations , , , . These studies also demonstrated a surprising correlation of the top two principal components with longitude and latitude by showing that the two-dimensional PCA plot of genomewide genotypes yields patterns that are reminiscent of the geographic map of Europe , . This information can subsequently be used to place individuals within a few hundred kilometers of their reported origin .
In our work here we explore the extent to which geographic coordinates within Europe can be predicted based solely on information from small subsets of genetic markers. We investigate a subset of the Population Reference Sample (POPRES), comprising of 1,200 individuals from 11 populations , . Using algorithmic tools that we have previously described, we select small subsets of Single Nucleotide Polymoprhisms (SNPs) that correlate well with population structure, as captured by PCA , . This is the first study to systematically explore this question as a classification problem by performing thorough cross-validation experiments in order to assign individuals of “unknown” origin to specific geographic locations in Europe.
We analyzed a subset of the Population Reference Sample (POPRES) as described in  consisting of 1,387 samples. We focused on populations with at least 40 available samples, thus retaining 1,200 individuals from 11 populations. These samples have been genotyped using the Affymetrix 500K array. We kept 447,212 autosomal SNPs after removing markers with missing entries. We also analyzed the two European HapMap Phase 3 populations: CEPH Europeans (CEU) and Tuscans (TSI).
We computed PCA scores for each SNP using the algorithm of  and we selected the SNPs with the highest scores (PCAIMs). In order to remove redundancy from the selected set of markers, we employed a method that we have previously described in .
We used as ground-truth geographic coordinates the ones provided in , which typically place a sample to the capital city of his/her country of origin. In order to predict coordinates for unassigned individuals, we used a simple Nearest Neighbors (NN) approach. -NN algorithms first compute the distance of the new sample from the individuals in the database and then identify the nearest neighbors of the new sample. In order to predict the coordinates of the new sample, we simply compute the average of the coordinates of its nearest neighbors (we set to ten). In all our experiments our distance metric was the standard Euclidean distance. The distance was computed on the projection of the genotypic data on their top two principal components. We experimented with different values of (the number of nearest neighbors) ranging from ten up to 20 in increments of one, but we did not observe a consistent advantage in using any value above ten. Similarly, we experimented with various schemes using weighted averages of the coordinates of the top nearest neighbors (for example, the contribution of the coordinates of a neighbor to the final prediction could be weighted by – some power – of the inverse of its distance to the new sample); once more, we did not observe a consistent advantage in using such schemes. While we can not rule out that more advanced classification methodologies and/or better distance metrics might be applicable in order to improve prediction accuracy, it is quite interesting and exciting that standard, simple methods are quite accurate and useful.
We ran two different crossvalidation experiments.
We cross-validated a total of 1,200 individuals from the eleven populations in our dataset that had more than 40 samples. In each of the 1,200 repetitions of this experiment, we left out one individual (test set) and used the remaining individuals as the training set. We then used the training set individuals to compute panels of AIMs of various sizes (PCAIMs with redundancy removal) and then we employed our NN algorithm in order to predict the coordinates of origin of the test set individual.
Our second cross-validation experiment uses as training set the POPRES samples and as the test set the HapMap Phase 3 CEU and TSI populations. While extracting genotypes for our POPRES-based panels from the HapMap data we excluded individuals from the HapMap populations that had more than 10% missing entries on our panels.
More details on data encoding, PCA, and our SNP selection procedures are available in Methods S1.
Our first experiment measured the prediction accuracy of our NN algorithm using all available SNPs. The average latitudinal error is 0.99 degrees (a very small deviation) and the average longitudinal error is 2.52 degrees. Interestingly, we get a better prediction of the North-North West to South-South East axis as opposed to the East to West axis. It is also worth noting that the largest average error was in the German samples and that the most accurately predicted populations were the Southern European and Irish ones. In our supporting online material (http://www.cs.rpi.edu/~drinep/POPRESAIMS, Text S1) we included plots for each of the eleven largest populations in our sample showing the mean and the standard deviation for each of the predicted populations.
As a first step, in order to verify our methodology, we attempted to evaluate whether there exist small panels of AIMs that could accurately reproduce the results of coordinate prediction using all 450K available markers. We started by selecting the top 5,000 PCA-Informative markers using two significant principal components. We then removed redundant markers using the algorithm of  and constructed three different panels of PCAIMs: P1 containing 500 markers, P2 containing 800 markers, and P3 containing 1,000 markers. The goal of this experiment is to illustrate that a relatively small (less than .2% of the total number of available SNPs), albeit carefully selected, set of markers suffices for ancestry inference. Indeed, Table 1 indicates the performance of our three PCAIMs panels. The performance of all panels is quite satisfactory, with the largest panel typically being no more than two times worse than the performance of all 450K markers. Especially in countries where the error was large even using all 450K markers (for example, Germany), our panels perform almost as well as the full set of markers. It is important to emphasize that this experiment simply illustrates the fact that the information contained in the full set of 450K markers can be efficiently summarized using only a small number of carefully selected representative AIMs. However, we have not yet selected AIMs in the setting of a true cross-validation experiment. Indeed, the AIMs selected above were the result of processing the full dataset, without splitting it in training and test sets first; this will be done in our next experiment. Finally, we note that detailed lists of all panels (P1, P2, and P3) appear in the online material accompanying this work (http://www.cs.rpi.edu/~drinep/POPRESAIMS, Text S1).
We performed 1,200 splits of data, where in each split we constructed a test set consisting of one individual and the remaining individuals were used as a training set in order to select PCAIMs and predict the coordinates of the test set sample. Figure 1 and Table 2 summarize the performance of our PCAIM panels over all 1,200 individuals in all test sets. The overall performance of our approach using even small panels of PCAIMs is quite remarkable for almost all populations. Especially in terms of latitude, the average error never exceeds three degrees using our largest panel. Even with the smallest panel of 500 SNPs we show satisfactory prediction accuracy that actually exceeds the three degree error threshold only for the Spanish and Portuguese populations. With respect to the more challenging longitudinal predictions, we observe that they are somewhat worse when compared with the performance of all 450K SNPs. In particular, the error in the Serbian population increases to an average of 5.6 degrees (as opposed to less than one degree using all SNPs). Similar increases of a factor of two are observed in the Irish and Italian populations, while the Portuguese population suffers a three-fold loss in accuracy. This illustrates that the East-West axis in Europe is somewhat harder to predict with high accuracy using a small number of SNPs, necessitating either larger panels of SNPs or more advanced methods.
In our second cross-validation experiment we evaluated the performance of the SNP panels derived using the full POPRES data as training set in order to classify individuals from the two European HapMap Phase 3 populations (CEPH Europeans-CEU and Tuscan Italian-TSI). We extracted the genotypes corresponding to CEU and TSI individuals from HapMap release 27 (built 36) raw data and then used our NN prediction algorithm to predict coordinates for the samples using all available SNPs as well as panels P1, P2, and P3. For the TSI and CEU samples, we chose to use as ground truth coordinates our predictions using all 450K SNP panels. Figure 2 illustrates the location of the CEU and TSI populations in the European map, with the red circle denoting the average CEU or TSI subject and the horizontal red lines illustrating the standard deviation in latitude and longitude. The red x and the blue x (along with the corresponding lines) illustrate our coordinate predictions using the 1000 and 500 SNP panels that were selected in the POPRES data. Note that not all SNPs of those panels were present in the HapMap data; for example, for the CEU samples, we found 994 SNPs from P3 in the HapMap data, and 496 SNPs from P1. (These numbers were slightly smaller – 927 and 459 respectively – for the TSI data.) Both our panels do a good job of predicting the location of CEU and TSI samples. In the TSI samples there is essentially no error in the North-South axis, but we are off by a few degrees in the East-West axis using our largest panel. For the CEU data, both latitudinal and longitudinal predictions are off by only a few degrees. In Figures S1 and S2 we show histograms of the (latitudinal and longitudinal) errors for our CEU and TSI samples using our 1,000 SNP panel. These figures highlight that two thirds of the samples are very accurately predicted (with an error of two degrees at most in terms of latitude and eight degrees in term of longitude), but there also exist some isolated samples that are quite inaccurately predicted; these samples increase somewhat disproportionately the average prediction error and its standard deviation. This is very obvious in the case of the TSI samples, where – in terms of longitude – five samples have 18 degrees of error (they had their nearest neighbors in the Spanish and Portuguese populations) thus considerably driving up the error, while over 60 samples had less than three degrees of longitudinal error.
This study is a comprehensive investigation of the possibility to recover geographic coordinates of individual ancestry within Europe based solely on information from carefully selected panels of genetic markers. Analyzing 1,200 individuals from 11 European populations and more than 440,000 SNPs, we show that it is indeed possible to predict individual ancestry within Europe down to a few hundred kilometers from the place of origin, using information from relatively small, albeit carefully selected, subsets of SNPs. Importantly, our findings are supported by thorough cross-validation experiments, both on the analyzed subset of the POPRES dataset ,  and the European HapMap populations. More than 1,200 SVDs for large matrices were computed, which, however, took only two weeks to run on commodity hardware, thanks to the efficient algorithms that we use. Interestingly, within Europe, individual origin seems much easier to predict along the North to South axis than along the East to West axis. This could indicate increased gene flow along the latter axis.
The reduction in the number of markers needed for ancestry inference is made possible through the use of our PCA-based method for the selection of AIMs and our redundancy removal algorithm. Different metrics have been proposed in order to select AIMs, most of which, such as or Wright's rely on the maximization of allele frequency differences between pre-defined populations , , , , . A closely correlated measure, Informativeness for assignment () as defined by Rosenberg et al.  computes a mutual information based metric on allele frequencies. Our algorithm on the other hand ,  does not rely on prior hypotheses about individual ancestry and is naturally coupled with other PCA-based algorithms, such as PCA-based stratification correction methods and the ancestry inference techniques that we describe here. Furthermore, as we have also demonstrated, the performance of our method for AIM selection is comparable or even superior, in some cases, to that of the metric of , .
Recent studies have underlined the existence of population substructure within Europe and a few of them have also explored the potential to uncover individual ancestry based on subsets of selected AIMs , , , , , , . Heath et al.  investigated a panel of 391 PC correlated SNPs for ancestry inference in a sample of 6,000 individuals from across Europe. They showed some degree of correlation between predicted ancestry and ground truth, however, since this was not their main goal, they did not attempt cross-validation of this marker set. McEvoy et al.  focused on Northern European ancestry, studying a genomewide dataset of 2,099 individuals from eight populations of Northern European origin (including the admixed populations of European Australian and American individuals). They identified panels of AIMs based on the measure. Again, individual PC scores, in the studied Northern European populations, especially using the larger panels, were significantly correlated to PC scores using the full dataset . Finally, Tian et al. , focused on AIMs selection for population differentiation along the North to South axis, by selecting -based SNPs for differentiation of Northern versus Southern European populations. However, that study focused on a relatively small sample of distinct European populations with a small number of samples for most populations. Here, we expand these studies, by offering SNP panels for ancestry inference and stratification correction, based on the largest publicly available dataset for European population structure.
The SNPs that we propose here as ancestry informative for European populations, can prove extremely useful for stratification correction in studies seeking to identify etiological genes for common complex disorders, when candidate susceptibility loci are targeted in larger samples, following an initial genome scan. In such cases, the inclusion of AIMs genotyping is essential, especially if underlying population structure related to the phenotype is suspected. Furthermore, these SNPs warrant further study, as they could underlie observed differences in disease frequency across Europe (for instance, the well-noted North to South gradient in the incidence of autoimmune disorders, such as type 1 diabetes . Although, such SNPs could have reached their population differentiating frequencies and patterns, due to demographic factors, it is possible that natural selection has operated on them. In fact, the top SNPs on our lists reside in the lactase gene region which is well known to have undergone a recent selective sweep . Further work will shed light into the relative contribution of migration, and drift versus natural selection in shaping the patterns of genomewide variation in the European population.
Distribution of the latitudinal (panel A) and longitudinal error (panel B) when using a panel of 994 SNPs selected on the POPRES samples to predict the coordinates of origin of the HapMap Phase 3 CEU samples. We consider as ground truth for the CEU samples our predictions using all 450K available SNPs.
(0.13 MB PDF)
Distribution of the latitudinal (panel A) and longitudinal error (panel B) when using a panel of 927 SNPs selected on the POPRES samples to predict the coordinates of origin of the HapMap Phase 3 TSI samples. We consider as ground truth for the TSI samples our predictions using all 450K available SNPs.
(0.12 MB PDF)
(0.04 MB PDF)
(0.01 MB PDF)
The collections and methods for the Population Reference Sample (POPRES) are described in . The datasets used for the analyses described in this manuscript were obtained from dbGaP through dbGaP accession number phs000145.v2.p2.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was supported, in part, by a National Science Foundation Computing and Communication Foundations (NSF CCF) 0447950 CAREER award to PD; an NSF CCF 0824684 award to PD; a European Molecular Biology Organization Arab Science & Technology Foundation (EMBO ASTF) 235-2009 Short-Term Fellowship to PD; and two Tourette Syndrome Association (TSA) Research Grant Awards to PP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.