|Home | About | Journals | Submit | Contact Us | Français|
India has been underrepresented in genome-wide surveys of human variation. We analyze 25 diverse groups to provide strong evidence for two ancient populations, genetically divergent, that are ancestral to most Indians today. One, the “Ancestral North Indians” (ANI), is genetically close to Middle Easterners, Central Asians, and Europeans, while the other, the “Ancestral South Indians” (ASI), is as distinct from ANI and East Asians as they are from each other. By introducing methods that can estimate ancestry without accurate ancestral populations, we show that ANI ancestry ranges from 39-71% in India, and is higher in traditionally upper caste and Indo-European speakers. Groups with only ASI ancestry may no longer exist in mainland India. However, the Andamanese are an ASI-related group without ANI ancestry, showing that the peopling of the islands must have occurred before ANI-ASI gene flow on the mainland. Allele frequency differences between groups in India are larger than in Europe, reflecting strong founder effects whose signatures have been maintained for thousands of years due to endogamy. We therefore predict that there will be an excess of recessive diseases in India, different in each group, which should be possible to screen and map genetically.
The first systematic surveys of human variation in India focused on anthropometric traits, and found that India is structured along lines of ethnicity as well as geography1, a result that has since been confirmed by blood group, protein polymorphism2,3 and genetic analysis4. Genetic studies have further documented differences in relatedness to West Eurasians5,6,7,8, while mitochondrial DNA (mtDNA) studies have shown that India harbors deep rooted lineages that share no common ancestry with groups outside of South Asia for tens of thousands of years9. The most comprehensive survey of genetic variation in India to date analyzed 405 single nucleotide polymorphisms (SNPs) in 55 groups and identified distinct clusters correlated to language and geography10, while another study analyzed 1,200 polymorphisms in 15 Indian American groups11. However, neither study analyzed enough data to more finely discern patterns of genetic variation.
We genotyped 132 Indian samples from 25 groups. To survey a wide range of ancestries, we sampled 15 states and 6 language families (including 2 language families from the Andaman Islands12) (Table 1 and Figure 1). To compare traditionally “upper” and “lower” castes after controlling for geography, we focused on castes from two states: Uttar Pradesh and Andhra Pradesh. We genotyped all samples on an Affymetrix 6.0 array, yielding data for 560,123 autosomal SNPs after filtering (Methods). Allele frequency differentiation between groups was estimated with high accuracy (FST had an average standard error of ±0.0011; Tables Sl and S2). For some analyses, we also merged our data with HapMap13 and the Human Genome Diversity Panel (HGDP)14,15 (Methods).
We analyzed these data to address five questions. Does India harbor more substructure than Europe? Has endogamy been long-standing in Indian groups? Do nearly all Indians descend from a mixture of populations? Is the ancestry of tribal groups systematically different from castes? What is the origin of the indigenous Andaman Islanders?
We applied principal components analysis (PCA)16,17 to identify outlier groups (Figure S1). The first PC shows that the Siddi have African ancestry, consistent with their origin involving the Arab slave trade18. The second shows that the Nyshi and Ao Naga cluster with the Chinese (CHB), consistent with them speaking Tibeto-Burman languages. The third and fourth show that the Great Andamanese are dispersed, suggesting gene flow from the mainland in the last few generations19, but that the Onge cluster tightly, making them more useful for studying the relationship of the indigenous Andamanese to groups worldwide (Note S1). The Chenchu are a sixth outlier because of their high minimum FST of 0.052 from all other groups (Table S3).
The average pairwise FST of the remaining 19 groups is 0.0109. This is much larger than the 0.0033 in a recent study of 23 European groups20, although a strict comparison is difficult, since European studies have focused on cosmopolitan samples20,21, which could underestimate differentiation relative to our village-centered sampling. We considered the possibility that the high FST could be an artifact due to marriage between close relatives, which is known to be common in southern India22 and can exaggerate measurements of frequency differentiation. However, when we recalculated FST correcting for consanguinity23 (Appendix), the average differentiation decreased only marginally to 0.0100. We also determined that the high FST was not due to our strategy of sampling diverse groups. Restricting to the 9 pairs of groups that were from the same state and traditional caste level, the average inbreeding-corrected FST was 0.0069; much higher than the analogous 0.0018 in Europe when comparing within regions (Table S3).
We propose that the high FST among Indian groups could be explained if many groups were founded by a few individuals, followed by limited gene flow8,24. This hypothesis predicts that within groups, pairs of individuals will tend to have substantial stretches of the genome where they share at least one allele at each SNP. We find signals of excess allele sharing in many groups (Figure S2), which as expected tend to occur in the groups that have the highest FST’s from all others (P=0.002 for a correlation). To estimate the age of founder events, we measured the genetic distance scale over which allele sharing decays, a procedure that we verified by simulation (Figure S3). Six Indo-European and Dravidian speaking groups have evidence of founder events dating to more than 30 generations ago (Figure S2), including the Vysya at more than 100 generations ago. Strong endogamy must have applied since then (average gene flow less than 1 in 30 per generation) to prevent the genetic signatures of founder events from being erased by gene flow. Some historians have argued that “caste” in modern India is an “invention” of colonialism25 in the sense that it became more rigid under colonial rule26. However, our results suggest that many current distinctions among groups are ancient and that strong endogamy must have shaped marriage patterns in India for thousands of years24,27.
The high frequency differentiation among Indian groups is medically significant as it shows that “population stratification” (systematic ancestry differences between cases and controls that can lead to false-positive disease associations) may be a confounder in gene mapping studies. This is superficially at odds with a recent report that in Indian Americans, allele frequency differentiation is lower than among Europeans11. A potential explanation for the discrepancy is that the previous study pooled samples by state of origin, which can mask substructure. For example, when we performed PCA on an independent set of 85 Gujarati Americans28, we found that they separate into two distinct clusters with high differentiation (FST = 0.005) (Figure S4). Similarly, pairs of Uttar Pradesh and Andhra Pradesh groups in our data (excluding the outlying Chenchu) have an average FST of 0.0107, but their differentiation decreases to 0.0033 when we first pool by state. It was recently suggested that to correct for stratification in India, it may be adequate to adjust for membership in five broad genetic clusters10. However, our results show that many Indian groups have a degree of allele frequency differentiation from their neighbors that is at least as large as that between northern and southern Europeans, which is known to be sufficient to cause false positives associations if uncorrected29.
The widespread history of founder events in India is also medically significant because it predicts a high rate of recessive disease. In Finland, there is a high rate of recessive diseases that has been shown to be due to a founder event, and that has resulted in a minimum FST of 0.005 with other European groups20. Our data show that many Indian groups have a minimum FST with all other groups at least as large (Table 1). Haldane wrote 45 years ago that “if inter-caste marriages in India become common, various… recessive characters will become rarer”30. However, it has not been generally appreciated that this applies to groups throughout India, and not only to groups in the south where consanguinity is common22. We hypothesize that founder effects are responsible for an even higher burden of recessive diseases in India than consanguinity. To test this hypothesis, we used our data to estimate the probability that two alleles from a group share a common ancestor more recently than that group’s divergence from other Indians, and compared this to the probability that an individual’s two alleles share an ancestor in the last few generations due to consanguinity23. Nine of the 15 Indian groups for which we could make this assessment had a higher probability of recessive disease due to founder events than to consanguinity, including all the Indo-European speaking groups (Table 2). It is important to systematically survey Indian groups to identify those with the strongest founder effects, and to prioritize them for studies to identify recessive diseases and map genes.
An additional reason why some diseases are expected to occur at elevated frequencies in India is shared descent from a common Indian ancestral population9,10. An example is a 25 base pair deletion in MYBPC3 that increases heart failure risk by about 7-fold, and occurs at around 4% throughout India but is nearly absent elsewhere31. It has recently been shown that power to discover disease risk variants can be increased by modeling Indian genetic variation using a reference panel of European and Chinese chromosomes32. However, the example of MYBPC3 shows that this is an imperfect solution, since clinically significant alleles that are rare outside of India cannot be imputed by studying non-Indian genetic variation. It is important to specifically characterize Indian variation to permit full powered gene mapping in India, instead of relying on catalogs of variation compiled in distantly related groups.
To better understand the genetic ancestry that is only found in India, we carried out a PCA of Europeans (CEU) and Chinese (CHB) along with 22 Indian groups (Figure 3). The first PC distinguishes CEU from CHB, and the second reflects ancestry that is unique to India31. The most remarkable feature of the PCA is a gradient of proximity to West Eurasians (Figure S5) (an analogous PCA in Europeans did not produce a gradient of proximity to India; Figure S6). We call this the “Indian Cline”, and hypothesize that it reflects the fact that different Indian groups have inherited different proportions of ancestry from “Ancestral North Indians” (ANI) related to West Eurasians, and “Ancestral South Indians” (ASI). To model ANI-ASI mixture, we selected a subset of 18 groups that formed tight clusters along the Indian Cline, and included the Pathan and Sindhi from Pakistan14 since they were consistent with the Indian Cline in the PCA but showed greater proximity to West Eurasians (Note S2), providing additional information about ANI-ASI mixture.
To test whether any of the 18 Indian Cline groups is consistent with all ANI or all ASI ancestry, we applied a novel 3 Population Test (Methods). If group X is related to groups Y and W by a simple tree (through a history of divergence without subsequent mixture) then if we define the SNP allele frequencies as pX, pY, and pZ, the quantity (pX-pY)(pX-pW) averaged over SNPs, should be proportional to the variance in allele frequency since group X split from Y and Z and thus should be positive. However, this quantity can be negative if X descends from a mixture event (Note S3 and Appendix). We applied this test to each of the 18 Indian Cline groups in turn using CEU=Y and Santhal=W, and obtained significantly negative scores for 16 groups (Table 2) as assessed by a jackknife analysis33 (Methods). These results do not mean that the Indian groups descend from mixtures of European and Austro-Asiatic speakers, but only that they derive from at least two different groups that are (distantly) related to CEU and Santhal.
We verified the evidence of mixture by carrying out a 4 Population Test34. For any four groups there are 3 possible simple trees. If ((A,B),(C,D)) is correct, the allele frequency differences between A and B should be uncorrelated with those between C and D, which we can assess by averaging the quantity (pA-pB)(pC-pD) across SNPs (Appendix) and testing for consistency with 0 (Methods). No Indian Cline group could be related simply to CEU, Onge and West Africans (YRI) after testing all trees (Table S4).
We developed a model to study the historical relationship of Indian groups to those worldwide, based on the hypothesis that most groups can be approximated as a mixture of two ancestral populations followed by group-specific drift. To fit the model to the data, we computed the squared allele frequency difference between all pairs of groups, and chose parameters by minimizing the difference between observation and expectation (Note S4). The idea of fitting allele frequency differentiation to historical models was first explored by Cavalli-Sforza and Edwards35 and here we extend it to trees with mixture. This approach contrasts with the STRUCTURE algorithm, which fits data without a tree36, or a tree in which many groups split simultaneously from an ancestral population followed by mixture37. While STRUCTURE is accurate for estimating individual mixture proportions in recently mixed groups, it is not clear whether its estimates of ancient mixture are biased because it does not model hierarchical relationships among groups, leading to inaccurate modeling frequencies in ancestral populations. By contrast, we use a more realistic tree model, and provide a test of fit.
Applying our model-fitting procedure, we find that the tree (YRI,(CEU,ANI),(ASI, Onge))) provides an excellent fit to the data from Indian groups. In particular, when the Pathan, Vaish, Meghawal and Bhil are modeled as mixtures of ANI and ASI (Figure 4), the observed allele frequency differentiation statistics are all consistent with the theoretical expectation within three standard deviations (Note S4).
Two features of the inferred history are of special interest. First, the ANI and CEU form a clade, and further analysis shows that the Adygei, a Caucasian group, are an outgroup (Note S4). Many Indian and European groups speak Indo-European languages, while the Adygei speak a Northwest Caucasian language. It is tempting to hypothesize that the population ancestral to ANI and CEU spoke “Proto-Indo-European”, which has been reconstructed as ancestral to both Sanskrit and European languages38, although we cannot be certain without a date for ANI-ASI mixture.
Second, our analysis shows that the Onge form a clade with the ASI (Note S4), which we verified by running the 4 Population Test on ((YRI,Papuan)(Dai,X)), and finding that it is consistent when X=Onge (Z=1.7) but inconsistent for all Indian Cline groups (Z-9) (Table S4). Previous mtDNA analyses suggested that the Onge do not share any maternal ancestry with groups outside India within the last ~48,000 years19,39. While they do share ancestry with some rare haplogroups in some Indian tribal populations within the last ~24,000 years 39,40, this is consistent with our inferred Onge-ASI clade, as long as the gene flow predated the ASI-ANI mixture that later occurred on the mainland.
We caution that “models” in population genetics should be treated with caution. While they provide an important framework for testing historical hypotheses, they are oversimplifications. For example, the true ancestral populations of India were probably not homogeneous as we assume in our model but instead were likely to have been formed by clusters of related groups that mixed at different times. However, modeling them as homogeneous fits the data and appears to capture meaningful features of history.
Estimating the proportions of ANI and ASI ancestry in India is challenging, since we are unaware of any published methods that produce unbiased estimates of mixture proportion in the absence of accurate ancestral groups. We developed three methods for estimating ancestry, which we verified were accurate even in the face of SNP ascertainment bias and some inaccuracies in our phylogenetic model (Note S5), and which we found provided consistent estimates (Table S5). The 18 Indian Cline groups all have between 39% and 77% ANI ancestry based on f3 Ancestry Estimates (Methods), which we quote because it has the smallest standard errors (Table 2). ANI ancestry is significantly higher in Indo-European than Dravidian speakers (P=0.013 by a 1-sided test)5,6,7,8,41, suggesting that the ancestral ASI may have spoken a Dravidian language before mixing with the ANI42. We also find significantly more ANI ancestry in traditionally upper than lower or middle caste groups (P=0.0025)5,6,7,8,41, and find that traditional caste level is significantly correlated to ANI ancestry even after controlling for language (P=0.0048), suggesting a relationship between the history of caste formation in India and ANI-ASI mixture.
We compared our autosomal estimates of ANI ancestry to Y chromosome and mtDNA haplogroup frequencies. Y chromosome analysis has shown that traditionally upper caste and Indo-European speaking groups have elevated frequencies of alleles that are also common in West Eurasians5,6. However, mtDNA analysis shows elevated frequencies of haplogroups common in West Eurasians only in northwest India7,8,43. Comparing the autosomal estimates of ANI ancestry to the frequencies of haplogroups characteristic of West Eurasians, we find a significant correlation on the Y chromosome (P=0.04) and a more marginal correlation in mtDNA (P=0.08) (Table S6 and Figure S7). The stronger gradient in males, replicating previous reports, could reflect either male gene flow from groups with more ANI relatedness into ones with less, or female gene flow in the reverse direction. However, extensive female gene flow in India would be expected to homogenize ANI ancestry on the autosomes just as in mtDNA, which we do not observe. Supporting the view of little female ANI ancestry in India, Kivisild et al.44 reported that mtDNA ‘haplogroup U’ splits into two deep clades. ‘U2i’ accounts for 77% of copies in India but ~0% in Europe, and ‘U2e’ accounts for 0% of all copies in India but ~10% in Europe. The split is ~50,000 years old, indicating low female gene flow between Europe and India since that time.
We have documented a high level of population substructure in India, and have shown that the model of mixture between two ancestral populations ASI and ANI provides an excellent description of genetic variation in many Indian groups. A priority for future work should be to estimate a date for the mixture, which may be possible by studying the length of stretches of ANI ancestry in Indian samples45,46, and will shed light on the process leading to the present structure of Indian groups. A second priority should be to discern the details of the history of the ANI and ASI before they mixed, including the date of their separation and their history of expansion and contraction; this may be possible by analyzing allele frequency spectrum47 and linkage disequilibrium data45,48,49. Our findings finally have medical implications. By showing that a large proportion of Indian groups descend from strong founder events, these results highlight the importance of identifying recessive diseases in these groups and mapping causal genes.
Blood samples were collected with informed consent from volunteers. We designate groups by their anthropological name as well as their geographic location, since it has been shown that both are required to specify an effectively endogamous group in India1. All DNA samples were genotyped on Affymetrix 6.0 arrays. We restricted most analyses to samples that appeared to be unrelated, and to 560,123 autosomal SNPs for which there was good genotyping completeness and for which there were no signs of problematic genotyping. For some analyses we also intersected our data with Illumina 650Y genotyping of the Human Genome Diversity Panel14 and HapMap13,28, which produced a merged data set of 119,744 autosomal SNPs14. We carried out PCA using the EIGENSOFT software17, assessed allele frequency differentiation among groups using FST, assessed inbreeding in each group using Wright’s Fixation Index F23, and computed standard errors using a Block Jackknife33. To detect the signature of founder events in linkage disequilibrium data, we studied all possible pairs of samples for each group, and recorded whether they share 0, 1 or 2 alleles at each SNP (at SNPs where both individuals were heterozygous, we recorded 1 allele to be shared to account for the ambiguity in the haplotype phase). Long stretches of allele sharing can reflect regions that are shared identical by descent from a common founder, and by measuring the exponential decay of allele sharing with distance, we inferred the age of the founder event (Figure S3). To test for a history of mixture, we applied 3 and 4 Population Tests (Note S3). To infer the proportion of ancestry in each Indian Cline group in the absence of accurate ancestral populations, we used f3 Ancestry Estimation (Note S5).
We thank the volunteers from throughout India who donated DNA; A.G. Reddy, A. Shah and R. Tamang for generating the Y chromosome and mtDNA data; J. Neubauer for sample preparation; and A. Tandon for data curation. We thank B.N. Sarkar and A.G. Roy for helping with group census size estimates, and D. Falush, J. Novembre, A. Ruiz-Linares, and S. Watkins for comments on the manuscript. D.R., N.P. and A.L.P. were supported by NIH grant HG004168, and D.R. was supported by a Burroughs Wellcome Career Development Award in the Biomedical Sciences. K.T. and L.S. were supported by grants from the Council of Scientific and Industrial Research of the Government of India, and K.T. was supported by a UKIERI Major Award (RG-4772).
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
Author Contributions All authors collaborated in designing the study. K.T. and L.S. collected the DNA samples, D.R., K.T. and L.S. collected the genetic data, N.P. developed the mathematical theory for f-statistics, and D.R., K.T., N.P. and A.L.P. analyzed the data. D.R. wrote the manuscript and supplementary information with input from all authors.
Author Information The data used in this study are available on request from D.R. or L.S. Reprints and permission information is available at www.nature.com/reprints. The authors declare no competing financial interests.