We sampled 296 individuals from 13 world-wide populations, including populations from West Africa, Central Europe, West Asia, Central Asia, South Asia, Southeast Asia, Polynesia, and America (, populations in bold). All samples were genotyped using the Affymetrix 6.0 array and we will refer to this individual set as the “Affy6.0” set in the following analysis. We then combined these samples with 344 individuals from 23 populations in our previous study [7
] (, populations in black), in which the Affymetrix 250K NspI array was used (“Affy250K” set), and 210 individuals from four HapMap populations (YRI, CEU, CHB and JPT, “HapMap” set). The final dataset contains 246,554 autosomal loci genotyped in 850 individuals from 40 populations (See methods for details of SNP selection and merging criteria). To determine the effect of using only this subset of the SNPs, we compared pairwise FST
between each pair of populations in the HapMap and the Affy6.0 sample set using the 246,554 SNP set and the whole SNP set (~866,000 SNPs). The FST
values between all population pairs are virtually identical for the two SNP sets (overall correlation coefficient r = 0.99998, p
), suggesting that the 250K SNP set is sufficient for examining inter-population relationships.
Decrease in population differentiation with more uniform sampling
To assess the effect of more even sampling on the degree of population differentiation, we compared the FST values between three major continental groups (Africa, Europe and Asia) from three individual sets: HapMap, HapMap+Affy6.0, and HapMap+Affy6.0+Affy250K. To match the sample sizes of the HapMap set, we randomly sampled 60, 60, and 90 individuals from Africa, Europe and Asia, respectively, in each individual set. Our results show that the overall FST value decreases substantially with the inclusion of geographically intermediate populations, dropping from 15.9% for HapMap, to 11.2% for HapMap+Affy6.0+Affy250K with non-overlapping confidence intervals (). Adding the American and Polynesian individuals into the HapMap+Affy6.0+Affy250K set increased FST slightly (to 11.3%) because of substantial founder effects and genetic drift in these populations. Nevertheless, the FST value in all individuals is still significantly lower than the FST value of the HapMap individual set. These statistically significant FST differences illustrate the important effects of population sampling. A decrease in population differentiation with more even sampling is also demonstrated by an increase in the proportion of SNPs whose minor alleles are shared in all three continental groups. This value increases from 74.9% for HapMap to 88.2% for HapMap+Affy6.0+Affy250K ().
FST and proportion of polymorphic SNPs shared among continents
For individuals that were genotyped for more than 866,000 autosomal SNPs using the Affymetrix 6.0 array (HapMap and Affy6.0), we also determined the FST values and proportion of polymorphic SNPs using all genotyped autosomal SNPs. In both individual sets the FST values using all SNPs are comparable to FST values using the ~250,000 SNPs (). The difference between the two individual sets remains significant. The percentage of shared polymorphic SNPs decreases slightly in both datasets (), reflecting a relatively higher proportion of low-frequency SNPs in the Affymetrix 6.0 array.
To compare haplotype diversity across populations, we normalized the sample size across population groups by randomly choosing 20 individuals and excluding populations with fewer than 20 samples (see methods for details). The average haplotype heterozygosity is significantly higher in African populations than non-African populations (, Wilcoxon rank test p = 1.2×10−4), and haplotype diversity decreases as geographic distance to east Africa increases (, r = −0.76, p = 4.3×10−6). Despite the overall significant correlation, there appears to be little correlation within Africa between haplotype diversity and distance to east Africa (r = −0.13, p = 0.78). Indeed, when African populations were excluded from the analysis, a stronger correlation is obtained (r = −0.94, p = 4.3×10−10, upper panel).
We also compared the SNP and haplotype heterozygosity values in each population (). These two quantities are generally highly correlated, although there are several exceptions: First, SNP heterozygosity is higher than haplotype heterozygosity in European and Central Asian populations. This may reflect a SNP ascertainment bias, since many of these polymorphisms were historically selected to maximize heterozygosity in European populations. Second, the Pygmy sample shows a low SNP heterozygosity despite relatively high haplotype heterozygosity. This unusual pattern could be caused by stronger effects of SNP ascertainment bias in this population than in others. Indeed, a recent study of Khoisan individuals (another hunter-gatherer group from Africa) showed a similar pattern: despite high SNP heterozygosity (~60%) in whole-genome sequence data, a Khoisan individual showed low heterozygosity on the SNP microarray genotypes (~22%) [32
]. Alternatively, this difference could also reflect unique attributes of population history.
Genetic structure among populations
To examine inter-population relationships, we first constructed a neighbor-joining tree based on genetic distances (). Populations from major geographic regions are clustered, and most branches have very high (>99%) bootstrap support (Supp. Figure S2
). New World populations (Totonac and Bolivian) are placed between Nepalese and Kyrgyzstanis, indicating higher affinity of these American samples to central Asians than to eastern Asians. A second neighbor-joining tree was constructed by adding 40 HGDP populations (46,260 common SNPs), producing similar patterns of population clustering (Supp. Figure S3
Population relationships between the 40 populations
We then performed a Principal Component Analysis (PCA) based on the pairwise allele-sharing distances among all pairs of individuals (). The majority of the genetic variation is found between African and non-African populations, as the first principal component (PC1) accounts for 78.7% of total variance. PC2 reflects genetic variation in Eurasia, and populations from Central and West Asia occupy the space between East Asia and Europe to form a relatively continuous distribution. The two Polynesian populations (Tongan and Samoan) show a close relationship to Southeast Asian populations (). PC3 distinguishes New World populations (Bolivian and Totonac) from other populations (Supp. Figure S4A
At the sub-continental level, we focus first on Eurasia, where most of our samples have been selected (). Overall, PC1 and PC2 mainly reflect the geographic distribution of the populations, with the majority of genetic variation accounted for by their locations. PC1 (accounting for 62.7% of the variance) reflects an east-west gradient, while PC2 (3.3% of the variance) reflects a north-south gradient. Slovenians and Iraqi Kurds show close relationships to European populations. A closer examination (Supp. Figure S4B
) shows that Kurds and eastern European Daghestani populations (Urkarah and Stalskoe) are clearly separated from western European populations. On the other hand, Slovenians show very little differentiation from western European populations (Supp. Figure S4B
Principal components analysis of population structure
Some of our populations form less defined clusters than do the HapMap populations. The Nepalese samples, in particular, are highly diverse, with some individuals showing a closer relationship to East Asian populations, while others are closer to South Asian populations. An examination of the ethnicity of the Nepalese individuals reveals that individuals from the ethnic groups derived from the caste system, including Madhesi, Brahman, and Chhetri, show a closer relationship to South Asian populations (especially Indian Brahmins). Individuals from the two indigenous Nepal ethnic groups (Newar and Magar) are closer to Central/East Asian populations (). Kyrgyzstanis were also widely dispersed along the first PC, although to a lesser extent than the Nepalese samples. This dispersion is expected because Kyrgyzstan is on the trade route between Europe and Asia, where there has long been a high level of migration.
Distinctive patterns can also be observed at the sub-continental level in non-Eurasian populations. Within Africa, the first two PCs separate Mbuti Pygmy and !Kung from other African populations (Supp. Figure S4C
). The remaining African populations appear to follow a north-south gradient, and the Dogon and Bambara from Mali show high similarity to the HapMap YRI from Nigeria (Supp. Figure S4C
). Within America, the two populations showed contrasting patterns: Totonacs from Mexico form a tight cluster, while about half of the Bolivian samples are separated from the Bolivian cluster, which appears to reflect European admixture (Supp. Figure S4D
Individual group membership
We used the program ADMIXTURE
] to assess the ancestry of each individual from 3–12 inferred populations (K) (Supplemental Table S1
). The results from K=4 and K=12 are illustrated in . When K=4, four groups corresponding to Africa, America, Europe, and Asia are identified. Unlike individuals from Africa and America, who form two relatively distinct groups, individuals from Eurasia show a mixture of Asian and European ancestry components.
When K=12, a number of sub-continental patterns appear. In Africa, Mbuti Pygmy, !Kung, and Dogon are separated into distinct groups. Despite being sampled from neighboring regions in Mali, Bambaran and Dogon individuals show quite different ancestry. Most Dogon individuals appear to be composed of a single western African component, while Bambaran individuals contain more than 30% of a component prevalent in eastern Africa. Polynesian and American populations were separated into two distinct components. In agreement with the PCA result, some Bolivian individuals contain more than 20% European ancestry, suggesting admixture in these samples.
Within Eurasia, the patterns are more complex. To examine the relationships among Eurasian populations in detail, we performed ADMIXTURE analysis on the Eurasian individuals only and calculated the average ancestry components in each population. Major regional groups and geographic clines are best visualized with seven ancestral components (K=7, ). In Europe, a northern/western European component is predominant in HapMap CEU, the Utah Northern European, and the Slovenian samples. One Caucasus/Middle East component is predominant in Daghestani and Iraqi samples and appears to decrease clinally to the east through Pakistan and Nepal and to the west through southern and northern Europe. In southern India, this component is a major genetic signal in two independently sampled Brahmin groups (>20%) but is nearly absent in lower castes and Irula (a tribal group, < 1.5%). Notably, the central Asian populations of Nepal and Kyrgyzstan have the most genetic admixture. This result is consistent with our PCA results showing a high level of genetic variation within these two populations. Another interesting observation is that Buryats and Kyrgyzstanis share about 5% ancestry with native American populations (averages of 4.4% in Kyrgyzstanis and 5.8% in Buryats), while East Asian individuals have very little of the Native American ancestry component (average <1%).
ADMIXTURE analysis of Eurasian individuals with K = 7
Copy Number Variation (CNV) profile
As a complement to our SNP analysis, we also used the same array platform to determine each individual’s CNV profile. To investigate the overall inter-population differences due to CNVs, we determined the number of CNVs per person and the average CNV frequencies in each population (). The African populations (Bambaran and Dogon) have the highest number of CNVs among all populations (median of 44 and 42 CNVs per genome, respectively). Outside of Africa, median number of CNVs varies between 38 in Kyrgyzstani to 30 in Totonac (). These data are comparable with previous work, which found a higher number of CNVs in African populations [33
], suggesting a loss of low-frequency CNV alleles due to population bottlenecks during the out-of-Africa migration and the peopling of the Americas.
Next, we identified CNVs that are specific to each population and then counted the number of individuals within each population sharing the same population-specific CNV (). Within the Dogon, Bambaran, Pakistani, and Totonac populations, we found a high proportion of population-specific CNVs that were common to multiple members, with more than 12% observed in two or more individuals. The remaining populations had few population-specific CNVs in common among their members. More than 90% of detected population-specific CNVs in these populations are only present in one individual. Because most population-specific CNVs are relatively rare within each population, and there are only a small number of total CNV loci, samples from different populations do not form distinct clusters in a PCA (Supp. Figure S5
We also investigated CNVs that are common between pairs of populations (). A comparison of the African populations (Dogon and Bambaran) revealed that 23% of CNVs were present in both populations, while both groups had little in common with any other population. There is also a relatively high proportion of CNVs in common between the Slovenian and Iraqi populations. Likewise, the Pakistanis, Kyrgyzstanis, Nepalese and Buryats all have a high percentage of CNVs in common (14–19%, ). This pattern is consistent with the population affinities shown by PCA and ADMIXTURE analysis of the SNP data ( and ). Finally, the Totonac and Bolivian populations have the highest proportion of CNVs in common, with 27.1% of CNVs identified in both populations. This high proportion of CNV sharing and the relatively low number of CNVs identified in these populations may be due to the low genetic diversity in their common founding population.