Recent studies suggest that by combining high-throughput genotyping technologies with dense geographic samples one can shed light on unanswered questions regarding human population structure1–5
. For instance, it is not clear to what extent populations within continental regions exist as discrete genetic clusters versus as a genetic continuum, nor how precisely one can assign an individual to a geographic location on the basis of their genetic information alone.
To investigate these questions, we surveyed genetic variation in a sample of 3,192 European individuals collected and genotyped as part of the larger Population Reference Sample (POPRES) project7
. Individuals were genotyped at 500,568 loci using the Affymetrix 500K single nucleotide polymorphism (SNP) chip. When available, we used the country of origin of each individual’s grandparents to determine the geographic location that best represents each individual’s ancestry, otherwise we used the self-reported country of birth (see Methods and Supplementary Tables 1 and 2
). After removing SNPs with low-quality scores, we applied various stringency criteria to avoid sampling individuals from outside of Europe, to create more even sample sizes across Europe, to exclude individuals with grand-parental ancestry from more than location, and to avoid potential complications of SNPs in high linkage disequilibrium (see Methods and Supplementary Table 3
). Although our main result holds even when we relax nearly all of these stringency criteria, we focus our analyses on genotype data from 197,146 loci in 1,387 individuals (Supplementary Table 2
), for whom we have high confidence of individual origins.
We used principal components analysis (PCA; ref. 8
) to produce a two-dimensional visual summary of the observed genetic variation. The resulting figure bears a notable resemblance to a geographic map of Europe (). Individuals from the same geographic region cluster together and major populations are distinguishable. Geographically adjacent populations typically abut each other, and recognizable geographical features of Europe such as the Iberian peninsula, the Italian peninsula, southeastern Europe, Cyprus and Turkey are apparent. The data reveal structure even among French-, German- and Italian-speaking groups within Switzerland (), and between Ireland and the United Kingdom (, IE and GB). Within some countries individuals are strongly differentiated along the principal component (PC) axes, suggesting that in some cases the resolution of the genetic data may exceed that of the available geographic information.
Population structure within Europe
When we quantitatively compare the geographic position of countries with their PC-based genetic positions, we observe few prominent differences between the two (Supplementary Fig. 1
), and those that exist can be explained either by small sample sizes (for example, Slovakia (SK)) or by the coarseness of our geographic data (a problem for large countries, for example, Russia (RU)); see Supplementary Information
for more detail. Our method also identifies a few individuals who exhibit large differences between their genetic and geographic positions (Supplementary Fig. 2
). These individuals may have mis-specified ancestral origins or be recent migrants. In addition, although the sample used here is unlikely to include many members of smaller genetically isolated populations that exist within countries (for example, Basque residing in Spain or France, Orcadians in Scotland, or individuals of Jewish ancestry), in rare cases outlying individuals could reflect membership of such groups. For example, a small set of Italian individuals cluster ‘southwest’ of the main Italian cluster and one might speculate they are individuals of insular Italian origin (for example, Sardinia or Sicily).
The overall geographic pattern in fits the theoretical expectation for models in which genetic similarity decays with distance in a two-dimensional habitat, as opposed to expectations for models involving discrete well-differentiated populations. Indeed, in these data genetic correlation between pairs of individuals tends to decay with distance (). For spatially structured data, theory predicts the top two principal components (PCs 1 and 2) to be correlated with perpendicular geographic axes9
, which is what we observe (r2
= 0.71 for PC1 versus latitude; r2
= 0.72 for PC2 versus longitude; after rotation, r2
= 0.77 for ‘north–south’ in PC-space versus latitude, and r2
= 0.78 for ‘east–west’ in PC-space versus longitude). In contrast, when there are K
discrete populations sampled, one expects discrete clusters to be separated out along K
− 1 of the top PCs8
. In our analysis, neither the first two PCs, nor subsequent PCs, separate clusters as one would expect for a set of discrete, well-differentiated populations (see ref. 8
The direction of the PC1 axis and its relative strength may reflect a special role for this geographic axis in the demographic history of Europeans (as first suggested in ref. 10
). PC1 aligns north-northwest/south-southeast (NNW/SSE, −16 degrees) and accounts for approximately twice the amount of variation as PC2 (0.30% versus 0.15%, first eigenvalue = 4.09, second eigenvalue = 2.04). However, caution is required because the direction and relative strength of the PC axes are affected by factors such as the spatial distribution of samples (results not shown, also see ref. 9
). More robust evidence for the importance of a roughly NNW/SSE axis in Europe is that, in these same data, haplotype diversity decreases from south to north (A.A. et al.
, submitted). As the fine-scale spatial structure evident in suggests, European DNA samples can be very informative about the geographical origins of their donors. Using a multiple-regression-based assignment approach, one can place 50% of individuals within 310 km of their reported origin and 90% within 700 km of their origin ( and Supplementary Table 4
, results based on populations with n
> 6). Across all populations, 50% of individuals are placed within 540 km of their reported origin, and 90% of individuals within 840 km (Supplementary Fig. 3
and Supplementary Table 4
). These numbers exclude individuals who reported mixed grandparental ancestry, who are typically assigned to locations between those expected from their grandparental origins (results not shown). Note that distances of assignments from reported origin may be reduced if finer-scale information on origin were available for each individual.
Performance of assignment method
Population structure poses a well-recognized challenge for disease-association studies (for example, refs 11–13
). The results obtained here reinforce that the geographic distribution of a sample is important to consider when evaluating genome-wide association studies among Europeans (for example, refs 3–5, 11
). A crucial part is also played by spatial variation in phenotype. To examine this, we simulated genome-wide association data for quantitative trait phenotypes with varying degrees of linear latitudinal or longitudinal trends (Supplementary Fig. 4
). Even for phenotypes modestly correlated with geography (for example, ≥5% of variance explained by latitude or longitude) the uncorrected P
-value distribution shows a clear excess of small values, suggesting that population structure correction may be important even in seemingly closely related populations such as Europeans. Note that many factors, including sample size and distribution of sampling locations, will influence the effects of stratification on P
-value distributions, and so these results should be considered only as illustrative of the settings in which stratification could become a problem in European samples.
In all our simulations, use of a PC-based correction12,14
adequately controlled for P
-value inflation (Supplementary Fig. 4
). The success of PCA-based correction is not unexpected here, because the PCs are excellent predictors of latitude and longitude, and we used only linear functions of latitude and longitude to determine the means of our simulated phenotypes. For real phenotypes, higher order functions of PC1 and PC2 and/or additional PCs might be necessary to correct for more complex spatial variation in phenotype. We speculate that at the geographic scale of many association studies carried out so far, many phenotypes are relatively uncorrelated with geography, and that this may explain why in many cases PC-based correction has had little impact in practice3,13
. For phenotypes that are more strongly spatially structured within a sample (for example, height11,15,16
), spurious associations due to population stratification should be more of a concern.
Although broad correlations between PCs and geography have been observed previously3–5,17,18
only the large number of loci and dense geographic sampling of individuals used here reveal the clear map-like structure to European genetic variation. Because at any one SNP the average level of differentiation across Europe is small (average FST
= 0.004 between geographic regions; FST
is a measure of differentiation between populations that takes values of 0 when there is no differentiation and one when there is maximal differentiation19
), it is the combined information across many loci and many individuals that reveals fine-scale population structure in this sample.
An important consideration in interpreting our analyses is that, as a result of ascertainment bias20,21
, current SNP genotyping platforms under-represent variation at low-frequency alleles. Low-frequency alleles tend to be the result of a recent mutation and are expected to geographically cluster around the location at which the mutation first arose; hence, they can be highly informative about the fine-scale population structure (for example, ref. 22
). In addition, the PCA-based methods used here are based on genotypic patterns of variation and do not take advantage of signatures of population structure that are contained in patterns of haplotype variation1,23–25
. Soon-to-be-available whole-genome re-sequencing will give us access to informative low-frequency alleles, and further statistical method development will allow us to leverage patterns of haplotype variation. The prospect of these developments suggests the geographic resolution presented here is only a lower bound on the performance possible in the near future. Thus, our results provide an important insight: the power to detect subtle population structure, and in turn the promise of genetic ancestry tests, may be more substantial than previously imagined.