Using a variety of approaches and algorithms, we have demonstrated that a major aspect of European population genetic structure follows a north–south distribution. Despite the use of over 5,000 SNPs in the initial dataset, the STRUCTURE analyses showed only a modest ability to distinguish other differences in European populations. The Finnish participants were a notable exception in that 11 of 12 individuals showed predominant affiliation with a unique population group (cluster) when the number of groups (k) set in the STRUCTURE analysis was greater than 7. There were some differences in the population group distribution among the different self-identified participants (e.g., see B), but it is unclear whether the proportions of these groups have any correspondence to differences in contributions (admixture) of founding populations. A leave-one-out cross-validation study using a different algorithm similarly showed a limited ability to distinguish within the “northern” or “southern” population groups.
Factor analysis of correspondence also showed that the largest component (Factor 1) also aligned with this north and south clustering. This analysis also suggested that individual population groups could be at least partially distinguished when additional smaller factors (lower eigenvalues) were considered. These studies suggest the possibility that additional population structure may be discernable within Europe when larger SNP sets and additional “ethnic” or historical population subsets are examined.
The current study has potential limitations in participant selection including the inclusion of large numbers of RA probands that might bias allele frequencies and the lack of a comprehensive sampling strategy. However, the clear clustering of participants of northern compared to southern European ancestry was consistently observed in this diverse set of participants, including a wide distribution of European Americans and participants from Italy, Spain, and Sweden. In addition, this population genetic structure was observed in ten random sets of 25 individuals selected from the different large population groups (western European Americans, Swedish, central European Americans, European Americans, Italian, and Spanish) providing further evidence that these results cannot be attributed to sample selection bias (unpublished data). The patterns of ancestry in those American participants of multiple diverse European origin also strongly support the current results as does the ability to identify a much smaller set of SNPs that distinguish between the “northern” and “southern” European populations (using a subset of Spanish and western European participants). Finally, the reproduction of these results using a panel of the most informative markers in additional sample sets provides additional support for our findings of a north–south European distinction.
What is the importance of the current observations? First, the potential for false-positive results in association studies based on unrecognized population stratification is of substantial concern for any candidate gene study using a case control design. The potential for false-positive associations in studies of European Americans have recently been emphasized [
26]. The use of either structured association tests or genomic control strategies has been suggested by several investigators [
8,
10,
40–
42]. In the current study we selected three loci that show allele frequency differences in Italians compared with western, eastern, and central European populations. These three selected loci effectively function as surrogates for test alleles in a case control analysis in which we examined whether these differences could be correctly controlled for by structured association testing. Both the entire set of 2,657 SNPs and a set of 400 SNPs enriched for the north–south informativeness controlled each of the loci. In contrast, 400 randomly selected SNPs showed substantial variation in the ability to account for the European population structure in this study. These results suggest potential problems when limited numbers of SNPs are used to control for European population stratification unless a set of more informative SNPs is utilized.
Second, genetic heterogeneity may be an important factor in decreasing the power of genetic studies. Performing separate analyses on European participants stratified by population genetic structure is worthy of exploration. Although allele frequency differences are generally small between these European populations (
Table S1), a comparison of Italian and western European participants showed that 10.0% of SNPs had an allele frequency difference >10%, and 1.9% of the SNPs had an allele frequency difference > 15%. Such differences may be important when examining non-Mendelian traits where low and modest relative risks are the general expectation.
A third issue is the explicit consideration of whether ancestry differences are associated with differences in phenotypic expression. Although controversial [
43,
44], some have advocated considering the importance of the ethnicity defined by DNA typing in clinical studies [
45,
46]. Ethnic or regional geographic differences in disease frequency have been noted for both Mendelian diseases and more complex genetic disease. A north/south gradient in the incidence of autoimmune diseases has been noted for several continents, and there is some evidence for increased incidence of multiple sclerosis, type 1 diabetes, and Crohn's disease in northern European compared with southern European countries [
47]. Do differences in European population structure underlie phenotypic differences with respect to disease, response to therapy, or adverse reaction to particular environmental agents? The answer is unknown, but this study suggests that the ability to discern European population structure may enable testing such possibilities.
The identification of a subset of SNPs informative for European substructure also raises the question of whether these informative SNPs may also be in LD with physiologically important functions that were subject to selection events. Therefore, we compared the location of these SNPs with those identified by recent studies examining signals for positive selection using the HapMap data [
48]. Although the most informative SNP was in fact closely associated with a known positive selection event within European populations (rs1375131within 600 kb of the lactase gene), overall we did not find support for the overrepresentation of the most informative SNPs in the chromosomal positions recently shown as having signals for positive selection in the HapMap European participants (no difference in SNP frequency in the 100-kb regions flanking the 250 strongest selection signals comparing the most informative SNPs and random SNP sets). However, it is possible that signals may be present in either particular subgroups of European participants (e.g. “southern” Europeans not included within the CEPH [Utah residents with ancestry from northern and western Europe; CEU] samples). Ongoing studies will examine this possibility as well as the distribution of European substructure “informative” SNPs when these are chosen from much larger initial genome-wide SNP screens.
The finding in the current study that individuals of Ashkenazi Jewish descent are predominantly “southern” European further suggests the later migration of this ethnic group from the Mediterranean region. Regardless of the European country of origin, each of those participants with four grandparents of Ashkenazi Jewish heritage showed this predominant “southern” cluster membership. This finding suggests the importance of ascertaining this aspect of ethnic origin in the design of association studies in European populations. As an example of this potential issue, we showed that inclusion of Ashkenazi samples with other participants of northern European origin (based on country of grandparental birth) did in fact cause a type 2 error when population stratification was not considered.
It is interesting to speculate how the ability to distinguish northern and southern European populations relates to ancient as well as more modern differences in migration and admixture patterns. Archeological and skeletal evidence as well as studies of mitochondrial and Y chromosome haplogroups have provided evidence of upper Paleolithic, Neolithic, and more recent settlement and migrations as contributing to the origin of current European populations [
12–
18,
22,
49–
52]. Phylogenetic analyses of Y haplotypic groups are interpreted to support both separate migrations from the Middle East 4,000 to 7,000 y ago as well as a more recent “Greek” expansion into Italy and the Iberian peninsula occurring closer to 2,500 y ago [
16,
18]. The earlier migrations would be consistent with waves spreading agricultural techniques from the Middle East and are supported by some mitochondrial DNA studies [
13]. However, there is little consensus concerning the association of any of these migrations with agricultural techniques or trading routes [
50,
51], or for that matter with the spread of Indo-European languages [
22,
51,
53]. Some studies of specific mitochondrial and Y haplogroups [
53] are consistent with the demic diffusion hypothesis suggested by Cavali-Sforza et al. [
22], and the work of Sokal et al. [
54] and others have provided evidence of different patterns of repopulation from glacial refuges or have suggested a later influence from North Africa in both Italy and Spain [
14,
15,
18]. As recently discussed by Barbujani and Chikhi, the origin(s) of modern European ancestors remains a controversial issue [
55]. Other major population events, including the multiple epidemics during the Middle Ages, may also have resulted in genetic bottlenecks contributing to current differences in European population structure.
Regardless of the historical explanations for the north–south genetic differences we have described, our results emphasize the importance of considering population structure in both genetic and epidemiological studies in European populations. Future examination of population structure using larger numbers of SNPs in additional population samples may enable a better definition of the differences between European population groups, and similar studies may provide analogous information in other continental populations.