|Home | About | Journals | Submit | Contact Us | Français|
High-throughput genotyping data are useful for making inferences about human evolutionary history. However, the populations sampled to date are unevenly distributed, and some areas (e.g., South and Central Asia) have rarely been sampled in large-scale studies. To assess human genetic variation more evenly, we sampled 296 individuals from 13 worldwide populations that are not covered by previous studies. By combining these samples with a data set from our laboratory and the HapMap II samples, we assembled a final dataset of ~250,000 SNPs in 850 individuals from 40 populations. With more uniform sampling, the estimate of global genetic differentiation (FST) substantially decreases from ~16% with the HapMap II samples to ~11%. A panel of copy number variations typed in the same populations shows patterns of diversity similar to the SNP data, with highest diversity in African populations. This unique sample collection also permits new inferences about human evolutionary history. The comparison of haplotype variation among populations supports a single out-of-Africa migration event and suggests that the founding population of Eurasia may have been relatively large but isolated from Africans for a period of time. We also found a substantial affinity between populations from central Asia (Kyrgyzstani and Mongolian Buryat) and America, suggesting a central Asian contribution to New World founder populations.
Every major demographic event in a population’s history (e.g., population bottlenecks, expansions, and migrations) leaves an imprint on the population’s collective assemblage of DNA sequences. Consequently, studies of DNA variation have illuminated many aspects of human population history. Because the genetic variation responsible for disease is a subset of genetic variation in general, these studies are also providing a foundation for important biomedical studies [1; 2]. Large-scale genotyping efforts using high-density SNP microarrays have generated an unprecedented amount of human population genetic data. In addition to their application in whole-genome association studies, these data have been used to address issues such as the evolutionary history of human populations [3; 4; 5; 6; 7; 8; 9; 10], estimation of individual ancestry [3; 11; 12; 13; 14; 15], and patterns of natural selection in populations [16; 17; 18; 19; 20].
In contrast to the rapid pace of technological development, progress in collecting human DNA samples has been slow and uneven. All existing human genetic diversity datasets, including the HapMap collection, the Coriell collection, and the Human Genome Diversity Project (HGDP-CEPH), are only partial representations of worldwide human diversity. For example, the HGDP database, one of the most widely used resources, lacks coverage in India. Other major regions, such as Eastern Europe and central/north Asia, are also under-represented in databases of human genetic variation.
To help achieve a more uniform sampling of world-wide human genetic diversity, we genotyped a sample of 296 individuals from 13 populations using Affymetrix 6.0 microarrays (~900,000 SNPs and 946,000 copy number variation (CNV) probes). We included populations from West Africa (Dogon and Bambaran), Central Europe (Slovenian), West Asia (Iraqi), Central Asia (Kyrgyzstani and Buryat), South/Southeast Asia (Pakistani, Napalese, and Thai), Polynesia (Tongan and Samoan), and America (Bolivian and Totonac). By adding these populations from previously under-represented regions to existing datasets, we sought to achieve two goals: first, a more comprehensive understanding of the distribution of human genetic diversity; second, a more detailed inference of human demographic history, such as the mode and tempo of the out-of-Africa diaspora, the peopling of South Asia, and the peopling of America.
DNA samples from 13 worldwide populations were collected by the Sorenson Molecular Genealogy Foundation (SMGF) and genotyped (Figure 1, Table 1). Informed consent was obtained from all study subjects at the sampling location, and the Western Institutional Review Board approved all procedures. The sampling locations of these populations are: Bambaran: southwest Mali; Dogon: central Mali in the state of Mopti; Slovenian: several locations in Slovenia; Iraqi Kurds, born in Akra, northern Iraq and (collected in Baghdad); Pakistani: Arain agriculturalists from the Punjab region; Nepalese: collected from Kathmandu, Nepal (samples consist of 16 Brahman, 2 Magar, 2 Chhetri, 2 Newar,1 Madhesi, and 2 Nepalese with unknown ethnicity); Kyrgyzstani: collected from Bishkek, the capital of Kyrgyzstan, having origins in several states in northeast Kyrgyzstan; Thai: 19 samples from the Moken ethnic group, and ten from Phuket, Thailand; Buryat: Buryat ethnic group from northeastern Mongolia; Samoan: ethnic Samoans sampled in Samoa: Tongan: ethnic Tongans sampled in Tonga; Totonac: agriculturalists living near Vera Cruz, Mexico; Bolivian: high-altitude Native American Aymara speakers living near La Paz. Most of these DNA samples were collected from saliva, with the exception of 22 Tongans and Samoans from whom blood samples were obtained.
High-throughput microarray genotyping of approximately 906,000 SNPs was performed using the Affymetrix Genome-Wide Human SNP Array 6.0 (Affymetrix, Santa Clara, CA, USA). Previous comparisons using this array indicate that DNA derived from saliva samples yield SNP genotypes of quality comparable to DNA derived from blood samples . The recommended protocol described by Affymetrix was followed to construct DNA libraries. Samples were then injected into microarray cartridges and hybridized in a GeneChip® Hybridization Oven 640, followed by washing and staining in a GeneChip® Fluidics Station 450. Mapping array images were obtained using the GeneChip® Scanner 3000 7G (Affymetrix).
Genotypes of 302 microarrays that passed the initial QC were called with the Birdseed algorithm (version 2) in the Affymetrix Power Tools package (http://www.affymetrix.com/support/developer/powertools/index.affx) with default parameters. Because our samples contain no females, CEL files from 15 unrelated CEU female samples were included in the calling process following the manufacturer’s recommendation. After genotype calling, we calculated pairwise allele-sharing genetic distances between each pair of individuals. Five comparisons showed unusually small genetic distances, indicating close relatedness between these pairs of individuals. Therefore, one individual was excluded from each pair in order to retain a set of unrelated individuals. One additional individual was removed because of ambiguous population information. The remaining 296 samples from 13 populations compose our dataset for analyses.
Several criteria were applied to select SNPs for the analyses. First, we excluded all SNPs on the X, Y, and mitochondrial chromosomes, as well as SNPs whose chromosomal locations were unknown (38,456 SNPs). Then, SNPs with more than 10% missing data were removed (5,742 SNPs). We next divided all individuals into four major groups (Africa, Asia, Europe, and India) and tested each SNP for deviations from Hardy-Weinberg Equilibrium (HWE) for populations within each group using the hweStrata algorithm . The continent-level HWE p-values were combined using Stouffer’s z-average method , and 213 SNPs that deviated from HWE at p < 5.5×10−8 (Bonferroni correction: 0.05/900,000) were excluded from subsequent analyses. To combine our dataset with HapMap II samples, Affymetrix SNP Array 6.0 genotypes of the 210 unrelated HapMap samples were obtained from the HapMap project website (http://hapmap.ncbi.nlm.nih.gov), and the same SNP selection criteria were applied to HapMap samples. The filtered HapMap dataset was combined with the dataset generated in this study and a dataset from an earlier study  using the Affymetrix NspI 250K microarrays, resulting in a final dataset containing 246,554 autosomal loci genotyped in 850 individuals from 40 populations.
The microarray data for the 296 DNA samples were analyzed for CNVs using two complementary algorithms: a genomic segmentation algorithm (Partek, MO) and Birdsuite . The use of two complementary CNV detection algorithms increases the robustness of CNV detection . To minimize batch variability, an internal baseline was generated from all 296 samples and used in the segmentation CNV detection. A minimum of ten consecutive probes was required to detect a copy number change. CNVs were removed if the probe density was < 1 probe/5,000 bp, in order to remove potentially spurious CNV calls that cover centromeric regions. The Canary and Birdseye algorithms were used in Birdsuite version 1.5.3. We restricted our analysis to autosomal CNVs calls that had a LOD score greater than or equal to 10, and that were greater than 1 kb in length. To obtain a conservative set of CNV regions, we removed any CNVs not found by both algorithms, leaving a stringent set of copy number regions for each individual. Genotypes of all samples in the final dataset (including both SNP and CNV genotypes) are available as a supplemental file on our website (http://jorde-lab.genetics.utah.edu/) under Published Data. The pre-filtering raw dataset is available upon request.
To standardize the population sample sizes, we combined several closely related populations into population groups and excluded remaining populations that had fewer than 20 individuals (see Table 1 for details). The combined population groups are: Nilotic (Alur and Hema), Bantu (Nguni, Pedi, and Sotho/Tswana), Daghestani (Stalskoe and Urkarah), Mala/Madiga (AP Madiga and AP Mala), and Tongan/Samoan (Tongan and Samoan). Then, we randomly chose 20 individuals from each population group to equalize the sample sizes. The genome was divided into consecutive 100 kb windows, and the number of SNP loci in the dataset was determined for each window. Windows with fewer than 10 loci in the final dataset were excluded. For windows containing more than ten SNPs, we calculated the haplotype heterozygosity  in each population using the MATLAB Population Genetics & Evolution Toolbox .
Distances between populations were calculated from allele frequency data as Nei’s genetic distance implemented in the PHYLIP software package . The dataset contains 232,114 SNPs with known ancestral state for 40 world populations. Dendrograms were constructed using the neighbor-joining method. All ancestral allele states were obtained from the orthologous base in chimpanzee, or orangutan plus macaque if chimpanzee was unknown, as obtained from the UCSC database (hg19, snp130). Each dendrogram was rooted by this chimpanzee-orangutan-macaque outgroup. One thousand bootstrap runs were performed for each dataset to generate the consensus tree and obtain the confidence value for each branch.
FST estimates between populations were calculated by the method described by Weir and Cockerham . To obtain the confidence interval of FST values in each continental group, 60, 60, and 90 individuals were randomly sampled 1000 times (with replacement) from Africa, Europe, and Asia (to match the sample sizes of the HapMap II populations), respectively. Pairwise allele-sharing genetic distance calculation and PCA were performed using MATLAB (ver. r2008a).
A model-based algorithm implemented in ADMIXTURE  was used to determine the genetic ancestries of each individual in a given number of populations without using information about population designation. To eliminate the effect of SNPs that are in LD, we first filtered out SNPs that had r2 > 0.2 within 100 Kb using PLINK , as recommended by the authors of ADMIXTURE. The pruned data set contains 86,273 SNPs.
CNV data were analyzed using internally developed software (available upon request) and SPSS 15.0 (SPSS, IL). We required a minimum of 75% reciprocal overlap between pairs of CNVs to consider that two individuals shared the same CNV region. A pairwise comparison of shared CNV regions allowed us to identify those CNVs that were private to individuals, private to specific populations, and those CNVs that were shared across multiple populations. To adjust for outlier effects, individuals above the 95th percentile for CNV number were removed from the analysis. A principal components analysis performed on all individuals indicated that the DNA samples from different DNA sources (i.e., blood versus saliva) may have different CNV calling results (Supp. Figure S1). Because DNA only from the 22 Tongan and Samoan samples was derived from blood, we excluded these subjects from the CNV analysis.
We sampled 296 individuals from 13 world-wide populations, including populations from West Africa, Central Europe, West Asia, Central Asia, South Asia, Southeast Asia, Polynesia, and America (Figure 1, Table 1 populations in bold). All samples were genotyped using the Affymetrix 6.0 array and we will refer to this individual set as the “Affy6.0” set in the following analysis. We then combined these samples with 344 individuals from 23 populations in our previous study  (Figure 1, populations in black), in which the Affymetrix 250K NspI array was used (“Affy250K” set), and 210 individuals from four HapMap populations (YRI, CEU, CHB and JPT, “HapMap” set). The final dataset contains 246,554 autosomal loci genotyped in 850 individuals from 40 populations (See methods for details of SNP selection and merging criteria). To determine the effect of using only this subset of the SNPs, we compared pairwise FST between each pair of populations in the HapMap and the Affy6.0 sample set using the 246,554 SNP set and the whole SNP set (~866,000 SNPs). The FST values between all population pairs are virtually identical for the two SNP sets (overall correlation coefficient r = 0.99998, p 10−50), suggesting that the 250K SNP set is sufficient for examining inter-population relationships.
To assess the effect of more even sampling on the degree of population differentiation, we compared the FST values between three major continental groups (Africa, Europe and Asia) from three individual sets: HapMap, HapMap+Affy6.0, and HapMap+Affy6.0+Affy250K. To match the sample sizes of the HapMap set, we randomly sampled 60, 60, and 90 individuals from Africa, Europe and Asia, respectively, in each individual set. Our results show that the overall FST value decreases substantially with the inclusion of geographically intermediate populations, dropping from 15.9% for HapMap, to 11.2% for HapMap+Affy6.0+Affy250K with non-overlapping confidence intervals (Table 2). Adding the American and Polynesian individuals into the HapMap+Affy6.0+Affy250K set increased FST slightly (to 11.3%) because of substantial founder effects and genetic drift in these populations. Nevertheless, the FST value in all individuals is still significantly lower than the FST value of the HapMap individual set. These statistically significant FST differences illustrate the important effects of population sampling. A decrease in population differentiation with more even sampling is also demonstrated by an increase in the proportion of SNPs whose minor alleles are shared in all three continental groups. This value increases from 74.9% for HapMap to 88.2% for HapMap+Affy6.0+Affy250K (Table 2).
For individuals that were genotyped for more than 866,000 autosomal SNPs using the Affymetrix 6.0 array (HapMap and Affy6.0), we also determined the FST values and proportion of polymorphic SNPs using all genotyped autosomal SNPs. In both individual sets the FST values using all SNPs are comparable to FST values using the ~250,000 SNPs (Table 2). The difference between the two individual sets remains significant. The percentage of shared polymorphic SNPs decreases slightly in both datasets (Table 2), reflecting a relatively higher proportion of low-frequency SNPs in the Affymetrix 6.0 array.
To compare haplotype diversity across populations, we normalized the sample size across population groups by randomly choosing 20 individuals and excluding populations with fewer than 20 samples (see methods for details). The average haplotype heterozygosity is significantly higher in African populations than non-African populations (Table 1, Wilcoxon rank test p = 1.2×10−4), and haplotype diversity decreases as geographic distance to east Africa increases (Figure 2A, r = −0.76, p = 4.3×10−6). Despite the overall significant correlation, there appears to be little correlation within Africa between haplotype diversity and distance to east Africa (r = −0.13, p = 0.78). Indeed, when African populations were excluded from the analysis, a stronger correlation is obtained (r = −0.94, p = 4.3×10−10, Figure 2A upper panel).
We also compared the SNP and haplotype heterozygosity values in each population (Figure 2B). These two quantities are generally highly correlated, although there are several exceptions: First, SNP heterozygosity is higher than haplotype heterozygosity in European and Central Asian populations. This may reflect a SNP ascertainment bias, since many of these polymorphisms were historically selected to maximize heterozygosity in European populations. Second, the Pygmy sample shows a low SNP heterozygosity despite relatively high haplotype heterozygosity. This unusual pattern could be caused by stronger effects of SNP ascertainment bias in this population than in others. Indeed, a recent study of Khoisan individuals (another hunter-gatherer group from Africa) showed a similar pattern: despite high SNP heterozygosity (~60%) in whole-genome sequence data, a Khoisan individual showed low heterozygosity on the SNP microarray genotypes (~22%) . Alternatively, this difference could also reflect unique attributes of population history.
To examine inter-population relationships, we first constructed a neighbor-joining tree based on genetic distances (Figure 3A). Populations from major geographic regions are clustered, and most branches have very high (>99%) bootstrap support (Supp. Figure S2). New World populations (Totonac and Bolivian) are placed between Nepalese and Kyrgyzstanis, indicating higher affinity of these American samples to central Asians than to eastern Asians. A second neighbor-joining tree was constructed by adding 40 HGDP populations (46,260 common SNPs), producing similar patterns of population clustering (Supp. Figure S3).
We then performed a Principal Component Analysis (PCA) based on the pairwise allele-sharing distances among all pairs of individuals (Figure 3B). The majority of the genetic variation is found between African and non-African populations, as the first principal component (PC1) accounts for 78.7% of total variance. PC2 reflects genetic variation in Eurasia, and populations from Central and West Asia occupy the space between East Asia and Europe to form a relatively continuous distribution. The two Polynesian populations (Tongan and Samoan) show a close relationship to Southeast Asian populations (Figure 3B). PC3 distinguishes New World populations (Bolivian and Totonac) from other populations (Supp. Figure S4A).
At the sub-continental level, we focus first on Eurasia, where most of our samples have been selected (Figure 4A). Overall, PC1 and PC2 mainly reflect the geographic distribution of the populations, with the majority of genetic variation accounted for by their locations. PC1 (accounting for 62.7% of the variance) reflects an east-west gradient, while PC2 (3.3% of the variance) reflects a north-south gradient. Slovenians and Iraqi Kurds show close relationships to European populations. A closer examination (Supp. Figure S4B) shows that Kurds and eastern European Daghestani populations (Urkarah and Stalskoe) are clearly separated from western European populations. On the other hand, Slovenians show very little differentiation from western European populations (Supp. Figure S4B).
Some of our populations form less defined clusters than do the HapMap populations. The Nepalese samples, in particular, are highly diverse, with some individuals showing a closer relationship to East Asian populations, while others are closer to South Asian populations. An examination of the ethnicity of the Nepalese individuals reveals that individuals from the ethnic groups derived from the caste system, including Madhesi, Brahman, and Chhetri, show a closer relationship to South Asian populations (especially Indian Brahmins). Individuals from the two indigenous Nepal ethnic groups (Newar and Magar) are closer to Central/East Asian populations (Figure 4B). Kyrgyzstanis were also widely dispersed along the first PC, although to a lesser extent than the Nepalese samples. This dispersion is expected because Kyrgyzstan is on the trade route between Europe and Asia, where there has long been a high level of migration.
Distinctive patterns can also be observed at the sub-continental level in non-Eurasian populations. Within Africa, the first two PCs separate Mbuti Pygmy and !Kung from other African populations (Supp. Figure S4C). The remaining African populations appear to follow a north-south gradient, and the Dogon and Bambara from Mali show high similarity to the HapMap YRI from Nigeria (Supp. Figure S4C). Within America, the two populations showed contrasting patterns: Totonacs from Mexico form a tight cluster, while about half of the Bolivian samples are separated from the Bolivian cluster, which appears to reflect European admixture (Supp. Figure S4D).
We used the program ADMIXTURE  to assess the ancestry of each individual from 3–12 inferred populations (K) (Supplemental Table S1). The results from K=4 and K=12 are illustrated in Figure 3C. When K=4, four groups corresponding to Africa, America, Europe, and Asia are identified. Unlike individuals from Africa and America, who form two relatively distinct groups, individuals from Eurasia show a mixture of Asian and European ancestry components.
When K=12, a number of sub-continental patterns appear. In Africa, Mbuti Pygmy, !Kung, and Dogon are separated into distinct groups. Despite being sampled from neighboring regions in Mali, Bambaran and Dogon individuals show quite different ancestry. Most Dogon individuals appear to be composed of a single western African component, while Bambaran individuals contain more than 30% of a component prevalent in eastern Africa. Polynesian and American populations were separated into two distinct components. In agreement with the PCA result, some Bolivian individuals contain more than 20% European ancestry, suggesting admixture in these samples.
Within Eurasia, the patterns are more complex. To examine the relationships among Eurasian populations in detail, we performed ADMIXTURE analysis on the Eurasian individuals only and calculated the average ancestry components in each population. Major regional groups and geographic clines are best visualized with seven ancestral components (K=7, Figure 5). In Europe, a northern/western European component is predominant in HapMap CEU, the Utah Northern European, and the Slovenian samples. One Caucasus/Middle East component is predominant in Daghestani and Iraqi samples and appears to decrease clinally to the east through Pakistan and Nepal and to the west through southern and northern Europe. In southern India, this component is a major genetic signal in two independently sampled Brahmin groups (>20%) but is nearly absent in lower castes and Irula (a tribal group, < 1.5%). Notably, the central Asian populations of Nepal and Kyrgyzstan have the most genetic admixture. This result is consistent with our PCA results showing a high level of genetic variation within these two populations. Another interesting observation is that Buryats and Kyrgyzstanis share about 5% ancestry with native American populations (averages of 4.4% in Kyrgyzstanis and 5.8% in Buryats), while East Asian individuals have very little of the Native American ancestry component (average <1%).
As a complement to our SNP analysis, we also used the same array platform to determine each individual’s CNV profile. To investigate the overall inter-population differences due to CNVs, we determined the number of CNVs per person and the average CNV frequencies in each population (Figure 6A). The African populations (Bambaran and Dogon) have the highest number of CNVs among all populations (median of 44 and 42 CNVs per genome, respectively). Outside of Africa, median number of CNVs varies between 38 in Kyrgyzstani to 30 in Totonac (Figure 6A). These data are comparable with previous work, which found a higher number of CNVs in African populations [33; 34; 35; 36], suggesting a loss of low-frequency CNV alleles due to population bottlenecks during the out-of-Africa migration and the peopling of the Americas.
Next, we identified CNVs that are specific to each population and then counted the number of individuals within each population sharing the same population-specific CNV (Figure 6B). Within the Dogon, Bambaran, Pakistani, and Totonac populations, we found a high proportion of population-specific CNVs that were common to multiple members, with more than 12% observed in two or more individuals. The remaining populations had few population-specific CNVs in common among their members. More than 90% of detected population-specific CNVs in these populations are only present in one individual. Because most population-specific CNVs are relatively rare within each population, and there are only a small number of total CNV loci, samples from different populations do not form distinct clusters in a PCA (Supp. Figure S5).
We also investigated CNVs that are common between pairs of populations (Figure 6C). A comparison of the African populations (Dogon and Bambaran) revealed that 23% of CNVs were present in both populations, while both groups had little in common with any other population. There is also a relatively high proportion of CNVs in common between the Slovenian and Iraqi populations. Likewise, the Pakistanis, Kyrgyzstanis, Nepalese and Buryats all have a high percentage of CNVs in common (14–19%, Figure 6C). This pattern is consistent with the population affinities shown by PCA and ADMIXTURE analysis of the SNP data (Figure 4A and and5).5). Finally, the Totonac and Bolivian populations have the highest proportion of CNVs in common, with 27.1% of CNVs identified in both populations. This high proportion of CNV sharing and the relatively low number of CNVs identified in these populations may be due to the low genetic diversity in their common founding population.
Patterns of human genetic variation are influenced by mating patterns, and the latter are in turn influenced by geographic and cultural factors (e.g., mountain ranges, language, religious practices). Consequently, it is not surprising that human genetic variation, while correlated with geographic location, is not perfectly clinal [37; 38; 39]. However, between-population differences can be seriously exaggerated if human populations are sparsely sampled.
Consistent with previous studies [37; 39; 40], our analyses demonstrate that differentiation among human populations decreases substantially and genetic diversity is distributed in a more clinal pattern when more geographically intermediate populations are sampled. The reduction of FST values with further geographic sampling illustrates the limitations of a global FST estimate to capture the pattern of human genetic diversity. With a more comprehensive population samples, our data have also led to several new observations about human demographic history and genetic relationships among human populations.
As observed in previous studies [4; 5; 7], we find that SNP and haplotype variation is highest in African populations, and that heterozygosity in non-African populations declines with geographic distance from Africa. This decline in heterozygosity has been interpreted as evidence for a worldwide serial founder effect originating in East Africa [4; 41]. While serial founder effects may explain much of the pattern of worldwide variation, we note two interesting deviations from the prediction of a linear decline in heterozygosity. First, as demonstrated in Figure 2A, there appears to be little relationship between heterozygosity within Africa and distance from the hypothesized point of East African origin (r = −0.13, p = 0.78). Second, there is a drastic decrease in diversity for all Eurasian populations immediately outside of Africa. These observations are best explained by a single bottleneck out of Africa rather than by a series of founding emigrations from Africa (Figure 2A).
The OoA hypothesis, proposing a single OoA bottleneck followed by an expansion into Eurasia approximately 50,000 years ago, has gained extensive support from the archaeological record [42; 43] and genetic studies [4; 5; 7]. Nevertheless, many of the historical details of this diaspora remain unclear. A common interpretation is that the OoA bottleneck was the result of a migration of a small founding population into Eurasia. Given the difference in haplotype heterozygosity between African and non-African populations and the relationship between heterozygosity and effective population size, we can estimate the effective population size of such a founding population . Within Africa, the average 100-kb haplotype heterozygosity in our data is 0.91. Immediately outside of Africa in Europe, the Middle East, and Central Asia, the average haplotype heterozygosity is 0.82 (Figure 2). A reduction of heterozygosity from 0.91 to 0.82 in a one-generation bottleneck would require an effective population size of only 5.5 individuals. While a one-generation bottleneck is an oversimplification, these estimates indicate that an OoA bottleneck resulting from the migration of a small founding population would require an extremely small population size. However, given that the archaeological record indicates a rapid expansion of modern humans into Europe and Asia in just a few thousand years [42; 43], it seems unlikely that Eurasia could be populated so quickly by a such a small founding population.
A more likely explanation for the OoA bottleneck is that Eurasia was populated by a larger population that had been relatively isolated from other modern human populations for tens of thousands of years prior to the expansion. The first fossil evidence for modern humans outside of Africa is in the Middle East at Skhul and Qafzeh between 80,000–100,000 years ago, which is at least 20,000 years prior to the Eurasian diaspora . If a population of modern humans remained in the Middle East until the expansion into Eurasia, there would have been sufficient time for genetic drift to reduce heterozygosity dramatically before the Eurasia expansion. This “Middle East isolation” hypothesis provides a robust explanation for the relative homogeneity of European and Asian populations relative to African populations (see Figures 3A–B) and is supported by a recent maximum likelihood estimate of 140,000 years ago for the time of Eurasian-West African population separation . Interestingly, a recent study of the Neandertal genome suggests that the non-African individuals, but not the Africans, contain similar amount of admixture (1–4%) with the Neandertals . The authors suggest that the admixture must have happened between the Neandertals with an ancestral non-African population before the Eurasian expansion. Given the fossil, archaeological, and genetic evidence, the Middle East isolation hypothesis warrants rigorous evaluation as whole-genome sequence data become available.
In the ADMIXTURE analysis of Eurasia, we observed a clinal distribution of a Caucasus/Middle East genetic component (red component, Figure 5) in several South Asian populations. Evidence from mitochondrial DNA, Y-chromosome, and autosomal loci suggests that the genetic composition of India has been influenced by west Eurasians [7; 8; 48; 49]. We find that this ancestry component is most prevalent in West Asians (Iraqi Kurd) and Caucasus populations (Daghestani). The component extends eastward into Central Asia (Pakistan, Nepal, and Kyrgyzstan) and into South India, where it is more prevalent in higher castes than in lower castes. This ancestry component also extends into Europe and is more prevalent in southern Europeans than in northern Europeans. Our results suggest that the northern Indian genetic component proposed by Reich et al  could represent the dispersion of a genetic ancestry component originating near the Caucasus/Middle East region.
Containing more than 100 ethnic groups, Nepal is a geographically small but diverse country . Earlier genetic studies of Nepalese populations have suggested a northern Asian origin with subsequent gene flow from South Asia (e.g., Hindu caste-derived groups) [51; 52; 53]. Our results are in general agreement with this view and suggest that the most prevalent ancestry component in the Nepalese is the primary ancestry component found in Indians and Pakistanis. The Nepalese, however, are highly heterogeneous and also have substantial ancestry components from Central Asia, East Asia, and Southeast Asia. Moreover, individual Nepalese from different ethnic groups have substantially different genetic composition. Hindu upper-caste Nepalese Brahman and Chhetri individuals cluster in PCA and show affinity to Indian Brahmin samples (Figure 4B). In contrast, samples from the linguistically distinct Magar and Newar groups show affinity to populations from Central and East Asia. These results suggest that substantial population structure may exist between the major population groups of Nepal. Although our limited sample size prevents a detailed analysis of the genetic diversity among Nepalese ethnic groups, our observations suggest high levels of genetic diversity in South and Central Asian populations and underscore the need for additional genetic studies of this region.
The Americas, first peopled during the late Pleistocene, were the last continents to be colonized by modern humans. Despite general agreement that modern humans crossed a land bridge in the current Bering Strait region to populate the Americas (reviewed in [54; 55; 56]), the exact timing, routes of colonization, and origin of the ancestral population(s) remain unclear [57; 58; 59; 60; 61].
Earlier studies suggest that an ancestral American population may have lived in western Siberia, rather than eastern Siberia/Northern Asia [62; 63]. Congruent with this view, the two Native American populations (Totonac and Bolivian) in our samples show closer relationships to Central Asian populations (Kyrgyzstanis and Buryats from Mongolia) than East Asian populations (e.g., Chinese and Japanese). This result is most apparent in the ADMIXTURE plot (Figure 4B; k=12), where Kyrgyzstani and Buryat individuals share about 5% of the American ancestry component. In contrast, East Asian individuals share very little (< 1%) genetic ancestry with the American populations.
In previous studies, we have shown highly consistent patterns of population genetic structure when using different types of polymorphisms, such as restriction site polymorphisms, short tandem repeat polymorphisms, and Alu and L1 insertion polymorphisms [37; 64; 65; 66]. Similarly, despite a very different mutational mechanism, CNVs also reveal overall patterns of genetic structure that are highly similar to those of other types of polymorphisms: First, we find that populations from Africa harbor the greatest number of CNVs, and that the average number of CNVs decreases with increasing distance from Africa. Second, we find that the degree of CNV sharing between groups reflects their population relationships. Notably, the Totonac and Bolivian populations share a high number of CNVs. The Pakistani, Kyrgyzstani, Nepalese, and Buryat populations also exhibit a high number of shared CNVs. Previous studies have also shown general agreement in genetic structure patterns revealed by SNP and CNV data .
In this study, by sampling populations from previously under-sampled regions, we sought to assess the effect of more even sampling on human genetic diversity and to investigate the evolutionary history of these populations. We found support for a relationship between the initial founding populations of America and Central/North Asian populations. We demonstrated high genetic diversity in Central Asian and South Asian populations, especially in Nepal. We also found that Iraqi Kurds have a closer relationship to European populations than Asian populations. These results increase our understanding of human population relationships and evolutionary history. In addition, our data provide a resource for understanding patterns of linkage disequilibrium, natural selection and the differential distributions of SNP and CNV alleles among populations, all of which have important implications in genome-wide association studies and the identification of loci with functional, biomedical significance.
We thank Dr. Dashtseveg Tumen, National University of Mongolia; Dr. Sukkid Yasothornsrikul, Naresuan University; and Dr. Alejandro Escobar, State of Veracruz Department of Health for their help in collecting the samples. We thank Dr. Dennis O’Rourke for insightful discussion on the peopling of America. We also thank Diane Dunn and Edward Meenen for their technical support during the microarray hybridization and scanning process. This work was supported by grants from the National Institutes of Health (GM-59290 to LBJ) and the Sorenson Molecular Genealogy Foundation. Additional supports for this study were provided by grants from the Canadian Institutes for Health Research (DM). C.H. is supported by the University of Luxembourg – Institute for Systems Biology Program and the Primary Children’s Medical Center Foundation National Institute of Diabetes and Digestive and Kidney Diseases (DK069513). A.S. is supported by a Canadian Institutes of Health Research Frederick Banting & Charles H. Best Doctoral Studentship Award.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.