ENCODE region selection and SNP discoveries
We sequenced one 100-kb ENCODE region-ENr123 (hg18: Chr12 38,826,477-38,926,476) in four different Andhra Pradesh ethnic groups representing three castes, Brahmin, Yadava, and Mala/Madiga, and one tribal group Irula (Figure ). We chose ENr123 because it has a low gene density and should represent a selectively neutral region (gene density of 3.1% and non-exonic conservation rate of 1.7%). Among the 92 individuals that passed quality-control steps, a total of 453 SNPs were identified, corresponding to a SNP density of one SNP per 221 bp. To determine the accuracy of the newly identified SNPs, we carried out additional experiments using the Roche 454 sequencing platform to validate the Indian-specific SNPs in individuals with heterozygous genotypes (see Materials and methods for details). The validation results showed that the genotypes of new SNPs have a high confirmation rate (approximately 80% for heterozygous SNPs). For alleles that have been seen only once in the dataset, the confirmation rate is greater than 85% (Supplemental Table S1 in Additional file 1
Figure 1 SNP discovery in Indian populations. (a) Population samples. The number of individuals sampled from each Indian population is shown. (b) The number of SNPs found in HapMap non-Indian and Indian populations. (c) The number of SNPs found in south Indian, (more ...)
To generate a comparable dataset, we applied the same SNP calling criteria on 722 HapMap individuals who were sequenced using the same protocol in the ENCODE3 project [29
]. We then merged these two datasets (four Indian populations and eight HapMap populations (CEU, CHB, CHD, GIH, JPT, LWK, TSI, and YRI)) to obtain a final data set that consists of 1,484 SNPs in 722 individuals from 12 populations (see Materials and methods for SNP merging and filtering details).
Among the 1,484 total SNPs, 234 (15.8%) are specific to Indian populations (four Andhra Pradesh populations and the HapMap northern Indian GIH; Figure ). For Indian individuals, the average number of specific SNPs per individual is 1.5. This number is lower than in HapMap African individuals (2.4 SNPs), but higher than both HapMap European (1.3 SNPs) and HapMap East Asian individuals (1.1 SNPs). This result suggests that higher autosomal genetic diversity is harbored in Indian samples compared to other HapMap Eurasian samples. Among the 453 SNPs in the four newly sequenced south Indian populations, 137 (30%) are not present in any HapMap populations (Figure ), including one novel non-synonymous singleton variant (Supplemental text in Additional file 1
Genetic diversity in India
Because many genetic diversity measurements are influenced by sample size, we normalized the sample size of each group by randomly selecting a subset of HapMap individuals to match the sample size of the Indians. For convenience, we denote four groups of populations (African, East Asian, European, and Indian) as 'continental groups'. For continental groups, 152 unrelated individuals were randomly selected from HapMap African, European, and East Asian samples, respectively (matching the 152 Indian individuals in the dataset). At the population level, 24 individuals were randomly selected from each HapMap population, and all individuals from south Indian populations were included in the analyses. After sample size normalization, we measured genetic diversity using various summary statistics, including the number of segregating sites (S), Watterson's θ estimator, nucleotide diversity (π), and observed SNP heterozygosity (H) for each population and continental group (Table ). We also evaluated the haplotype diversity in each group by averaging the haplotype heterozygosity in ten 10-kb non-overlapping windows and tested the neutrality of the region using the Tajima's D test. The Tajima's D test result was consistent with neutrality, providing no evidence for either positive or balancing selection in this region (Table ), as expected given the low gene density in this region.
Genetic diversity in continental groups and populations
At the population level, π
indicate that some Indian populations have diversity levels comparable to or even higher than those of HapMap African populations. Specifically, Mala/Madiga, Yadava, and Irula have the highest π
among all populations (84.46 π 10-5
, 88.94 π 10-5
, and 82.77 π 10-5
, respectively). In contrast, Brahmins and HapMap GIH have lower diversity levels, comparable to HapMap European and East Asian populations (Table ). Due to small sample sizes, the confidence intervals of π
for all populations overlap. However, at the continental level, Indians have significantly higher nucleotide diversity than Europeans and East Asians, although θ
and haplotype diversity are similar among the three groups (Table ). Removal of unconfirmed genotypes in Indian individuals does not change the results (Supplemental text and Supplemental Table S3 in Additional file 1
Several studies have shown that heterozygosity decreases with increasing distance from eastern Africa, presumably due to multiple bottlenecks that human populations experienced during the migration [22
]. Among non-Indian populations, we observed a significant negative correlation between H
and the distance to eastern Africa (Figure ; r = -0.77, P
= 0.04). However, when the Indian populations were included, the correlation became non-significant (r = -0.33, P
= 0.29). This lack of correlation is due to large variation in H
among the Indian populations (60.02 π 10-5
in Brahmins to 95.12 π 10-5
in the Irula). This result demonstrates great variation in diversity among groups within India.
Population SNP heterozygosity as a function of geographic distance from eastern Africa. The correlation coefficient of HapMap non-Indian populations is shown.
Demographic history of Eurasian populations
To study the relationship among populations, we first performed principal components analysis (PCA) on the genetic distances between populations using the normalized dataset. When all populations are included in the analysis, the first principal component (PC1) accounts for 93% of the total variance and separates African and non-African populations (Supplemental Figure S1 in Additional file 1
). In PCA of only Eurasian populations, PC1 separates Indian populations from European and East Asian populations, and PC2 separates European and Asian populations (Figure ). Among Indian populations, the tribal Irula and HapMap GIH have the shortest distance to East Asian populations while Brahmin has the largest distance. The northern Indian GIH population diverges from south Indians and its closest relationship is with HapMap TSI populations. This observation is consistent with the general genetic cline in India observed in previous studies [13
]. We also performed PCA and ADMIXTURE
analysis at the individual level (Supplemental Figure S2 in Additional file 1
). Because of the relatively small size of our dataset, individuals are not tightly clustered as seen in studies with genome-wide data [19
]. The African individuals are separated from the Eurasian individuals, but Eurasian individuals from different populations are not separated into distinct clusters.
Principal components analysis of Eurasian populations. The first two principal components (PCs) and the percentage of variance explained by each PC are shown.
Next, we examined the divergence between Indian and non-Indian populations using pairwise FST
estimates. In comparing major continental groups, India and Europe have the smallest FST
value (Table ). At the individual population level, however, Indian populations show varying affinities to other Eurasian populations: the Indian tribal population (Irula) shows closer affinity to HapMap East Asian populations while the HapMap GIH and the Brahmin show a closer relationship to HapMap European populations. The Mala/Madiga and Yadava show a similar distance to the HapMap European and East Asian populations (Table ). Among Indian populations (Supplemental Table S2 in Additional file 1
), the smallest FST
value is between Yadava and Mala/Madiga (0.1%), and the largest FST
value is between HapMap GIH and the tribal Irula (10.4%).
Pairwise FSTvalues (%) between and among continental groups
Pairwise FSTvalues (%) between Indian and HapMap non-Indian populations
The complete sequence data allow us to obtain an accurate derived-allele frequency (DAF) spectrum. At both the continental and population levels, the DAF spectra in our dataset are characterized by a high proportion of low-frequency SNPs, as expected for sequencing data (Supplemental text and Supplemental Figure S3 in Additional file 1
). Based on the DAF spectra, we are able to infer the parameters associated with Indian population history, such as the divergence time, effective size, and migration rate between populations using the program ai
(Diffusion Approximation for Demographic Inference
can simultaneously infer population parameters in models involving three populations, we first estimated the parameters associated with the out-of-Africa event using the African continental group and two continental Eurasian groups. We started from a simplified three-population divergence model based on the out-of-Africa model described in ai
] and assessed the model-fitting improvement of adding different parameters to the model (Supplemental text in Additional file 1
). Our results suggest that allowing exponential growth in the Eurasian continental groups substantially improves the model. On the other hand, allowing migrations among groups provides little improvement in the data-model fitting, suggesting that little gene flow occurred between the continental groups (Supplemental Figure S5 in Additional file 1
). Therefore, we inferred the parameters from the three-population out-of-Africa model, allowing exponential growth in the Eurasian groups but no migration among groups (Figure ). Under this model, a one-time change in African population size occurs at time TAf
before any population divergence, and the population size changes from the ancestral population size NA
in Africa. At time TB
the Eurasian ancestral population with a population size of NB
diverges from the African population, while the African population size NAf
remains constant until the present. The two Eurasian groups split from the ancestral population NB
at time T1-2
, with initial population sizes of N1_0
, respectively. Both populations experience exponential population size changes from the time of divergence to reach the current population sizes N1
Figure 4 Illustration of the ai models. (a) Three-population out-of-Africa model. The ten parameters estimated in the model (NA, NAf, NB, N1_0, N1, N2_0, N2, TAf, TB, T1-2,) are shown. (b) Four-population out-of-Africa model. The ten parameters (more ...)
The inferred parameters between continental groups, along with confidence intervals (CIs) for each parameter, are shown in Table . When the mutation rate is set at 1.48 π 10-8 per base pair per generation (see Materials and methods for mutation rate estimate), the ancestral population size is estimated to be between 13,000 and 14,000 for all models (Table ). The African effective population size estimates (NAf, 18,036 to 18,976; CI, 15,077 to 22,673) are comparable to the size of the Eurasian ancestral population (NB, 12,624 to 21,371; CI, 7,360 to 32,843). At the time of the Eurasian population divergence, the population sizes of the two Eurasian continental groups in each model (N1_0 and N2_0) are consistently smaller than the African and the Eurasian ancestral population sizes, with one exception for the estimated European population size (25,543; CI 6,101 to 29,016) in the Africa-East Asia-Europe model. These results suggest that the Eurasian population experienced population bottlenecks at the time of their divergence. Among Eurasians, East Asians have the smallest effective population size at the time of divergence (approximately 1,500; CI, 779 to 3,703; Table ). The divergence time estimates between Africans and non-Africans range from 88.4 to 111.5 kya and the CIs of all three estimates overlapped, consistent with the existence of a single ancestral Eurasian population. The three non-African continental groups diverged from each other more recently than 40 kya: East Asians were separated from Indians (39.3 kya; CI, 29.7 to 59.1) and Europeans (39.2 kya; CI, 29.8 to 55.8) before the divergence of Indians and Europeans (26.6 kya; CI, 20.1 to 40.8). Overall, these results support a scenario in which the ancestors of the Indian, European, and East Asian individuals left Africa in one major migration event, and then diverged from one another more than 40,000 years later.
To further examine the population history among Eurasian populations, we constructed a four-population model containing all four continental groups (Figure ). Because parameters from only three populations can be estimated by ai
at the same time, we fixed the parameters of the out-of-Africa epoch (NAf
, and TB
) in the model based on the parameters estimated from the three-population model with the highest likelihood (Africa-East Asia-European), as described in ai
]. A model comparison again suggests that adding migrations to the model does not substantially improve the model-fitting (Supplemental text and Supplemental Figure S6 in Additional file 1
). Therefore, migrations were excluded from the model to reduce the number of inferred parameters and to improve the speed of computation. Among the three population divergence scenarios, two models ('East Asia first' and 'India first') showed similar maximum likelihood values (-1,278.9 and -1,278.7, respectively), indicating comparable fitting to the data. In contrast, the 'Europe first' model has a substantially lower maximum likelihood value (-1,280.7), suggesting that this model is less plausible. The estimated parameters for the 'East Asia first' and the 'India first' models are shown in Table . Consistent with the three-population models, the 'East Asia first' mode estimates that East Asians diverged from the ancestral Eurasian population approximately 44 kya, and Europeans and Indians diverged approximately 24 kya. Interestingly, the 'India first' model suggests that the divergence time among the three continental groups are similar, with Indians diverging only 0.2 kya before Europeans and East Asians. Under this model, the initial population size of the Indian population (N1_0
, 11,410; CI, 4,568 to 28,665) is comparable to the Eurasian ancestral population size (NB
, 12,345), consistent with the high diversity we observed in these Indian samples.
When individual populations are analyzed, the patterns are largely consistent with the results from continental groups (Supplemental text and Supplemental Table S4 in Additional file 1
). The CIs around the parameters are generally larger, indicating a loss of power due to the smaller sample sizes of the individual populations compared to the continental groups.