|Home | About | Journals | Submit | Contact Us | Français|
We have analysed Y-chromosomal data from Indian caste, Indian tribal and East Asian populations in order to investigate the impact of the caste system on male genetic variation. We find that variation within populations is lower in India than in East Asia, while variation between populations is overall higher. This observation can be explained by greater subdivision within the Indian population, leading to more genetic drift. However, the effect is most marked in the tribal populations, and the level of variation between caste populations is similar to the level between Chinese populations. The caste system has therefore had a detectable impact on Y-chromosomal variation, but this has been less strong than the influence of the tribal system, perhaps because of larger population sizes in the castes, more gene flow or a shorter period of time.
“The caste system in India was the grandest genetic experiment ever performed on man” wrote Theodosius Dobzhansky in his book Genetic Diversity and Human Equality (1973, page 31). The wording – ‘man’ instead of ‘human’ – now seems outdated, but perhaps remains applicable to this review since it will be restricted to the male-specific variation carried by the Y chromosome. What were the genetic consequences of the caste system for Y-chromosomal variation? Of course, every experiment requires a control. A ‘control’ for this ‘experiment’ would need to be a population of similar size that does not have a caste system. In practice, caste populations can be compared with the somewhat less numerous non-caste populations in India or the slightly more numerous populations in the adjacent region of East Asia – China and its neighbours.
In this review, we therefore begin by considering the relevant properties of the caste system and Y-chromosomal genetics in order to identify effects that we might look for. We then consider the datasets from India, China and other nearby countries that are available in the literature. Finally, we present new analyses of these data and discuss the insights they provide into the comparative male genetics of India and East Asia, and the limitations of the conclusions that can be drawn.
The caste system divides society into endogamous groups. Key issues for this review are:
None of these questions is easy to answer in a precise way. The 2001 census provided a figure of ~1,028,700,000 for the population of India (Census of India 2001) while the People of India project has identified 4,635 communities (Singh 1993), suggesting an average size of around 220,000 for each of these communities. These communities can even be thought of as occupying distinct ecological niches (Gadgil and Malhotra 1983). However, the variation in size between different communities is enormous, and the communities defined in this project are not necessarily equivalent to the endogamous groups that the geneticist would be interested in. Nevertheless, these figures show that the Indian population is socially highly substructured.
Gene flow between castes is rare, and when it does occur consists principally of hypergamy, where a woman marries a man of higher caste and is absorbed into the new caste (Misra 2001). This does not result in any movement of Y chromosomes between castes. The equivalent practice for men, in which a man would marry a woman of higher caste and be absorbed into the higher caste (and a Y chromosome would thus move between castes), appears not to have been documented (Bhattacharyya et al. 1999). On the basis of social rules and historical records, therefore, Y chromosomes would be expected to remain strictly within their castes. Genetic data can provide independent insights into the level of undocumented hyperandry and interpretations have varied from low levels to the possibility of quite high levels (Reddy et al. 2005; Wooding et al. 2004; Zerjal et al. 2007).
The origins of the caste system are associated with the entry of Indo-Aryan speakers ~3,500 years ago (Thapar 1990; Wolpert 1997). Fortunately, a significant source of information about their society is available in the form of the Rig-Veda, a collection of over 1,000 hymns dating perhaps from as early as 1,500 BC. Indo-Aryan tribal society was organised into priests, warriors and commoners who formed the basis of the Brahmin, Kshatriya and Vaishya castes, with a fourth, sudras, added in India and further developments occurred later. One line from the Rig-Veda illustrates the fluidity of the early caste boundaries: “I am a poet, my father is a physician and my mother is a grinder of corn”, and there is debate about how rigid the system has really been over long periods of history (Thapar 1990). The caste system was abolished by the Government of India in 1949. Nevertheless, 3,500 years would represent ~117 generations at 30 years per generation and provide a timescale over which significant genetic changes could accumulate.
The properties of the Y chromosome that make it particularly suitable for such analyses have been reviewed elsewhere (e.g. Jobling and Tyler-Smith 2003) and need only a brief mention here. In addition to its male-specific inheritance, the lack of recombination over most of the length of the chromosome results in long stable haplotypes, which change only by accumulating mutations. The abundant Single Nucleotide Polymorphism (SNP) and Short Tandem Repeat (STR) markers now available allow these to be characterised in detail. As a result, haplotypes can be both clustered into haplogroups that usually reflect shared ancestry thousands or tens of thousands of years ago, and resolved into family-specific (if not individual-specific) haplotypes. In addition, the large variance in the number of children fathered by different men results in strong genetic drift, leading to large differences between populations.
If the influence of natural selection on Y-chromosomal haplotypes can be ignored (Jobling and Tyler-Smith 2000), the pattern of variation found within a set of populations that are largely isolated from one another will be dominated by loss of variation due to random genetic drift, counterbalanced to some extent by increases due to gene flow and mutation. The amount of genetic drift is measured by the long-term effective population size, which depends on the census number of males in the population, the proportion who father children, the variance in number of children, the generation time and the extent to which these factors are correlated between generations. While some of these factors can readily be measured or modelled, others, such as the correlations between generations, are poorly understood. We therefore take an empirical approach and consider next some examples of isolated populations outside India as guides to the amount of Y-chromosomal drift that can be found in different circumstances.
Tristan da Cunha lies in the South Atlantic Ocean and has been described as ‘the remotest island in the world’. Its population was established in 1816 by seven females and eight males, and currently numbers 269 with seven surviving surnames (Wikipedia: Tristan da Cunha 2007). A genetic survey published in 2003 identified eight main Y-chromosomal lineages, and a one-STR-step variant of one (Soodyall et al. 2003). Seven of these corresponded to seven of the eight founding males; the eighth founding male's surname and Y lineage had been lost by drift. The eighth extant Y lineage appeared to represent gene flow from outside. The Samaritans are a distinct religious and cultural community in the Middle East who split from mainstream Judaism around 2,500 years ago and numbered several thousand during the Roman period. A genetic survey, also published in 2003, found just four main Y lineages (and close STR variants of some) (Bonné-Tamir et al. 2003). Two of the four lineages shared a common ancestor estimated to date to approximately the time when the population was established, so it seems likely that all surviving Y lineages trace back to three founders ~2,500 years ago: a striking illustration of genetic drift. A rather larger population, that of Iceland, was established in approximately 870 AD by between 8,000 and 20,000 individuals, and now numbers around 280,000, mostly as a result of endogenous growth since there has been little subsequent immigration. Y-chromosomal diversity is grossly comparable to that of nearby European countries, but enhanced genetic drift is detectable by some measures (Helgason et al. 2003b), and large-scale genealogical studies reveal that the 71% of the contemporary male population whose ancestry can be traced back three hundred years (approximately eight generations) descend from only 10% of the population (Helgason et al. 2003a). We thus see lineage loss in all populations, but most markedly in the Samaritans, whose size, degree of endogamy and timeframe could provide a model for some Indian populations.
We sought datasets from India and East Asia that reported both Y-STR and Y-SNP genotypes from reasonably-sized population samples (Table 1). It was necessary to strike a balance between the number of markers and number of populations included. When we set a requirement for a minimum sample size of 17 males typed with 31 Y-SNPs and 9 Y-STRs, we were able to analyse 1,764 individuals: 784 from 31 populations in India (Sengupta et al. 2006; Zerjal et al. 2007) and 980 from 27 populations in East Asia, mainly China (Xue et al. 2006).
Diversity within individual populations or groups of populations was summarised by (i) Nei's gene (= STR haplotype) diversity (Nei 1987), (ii) the average squared distance between haplotypes (ASD), and (iii) the population mutation parameter θk (Ewens 1972); if the average mutation rate is similar in different populations, variation in θk will reflect variation in the male effective population size. Genetic distance measures between pairs of populations were (i) FST (for Y-SNPs), (ii) RST (for Y-STRs; Slatkin 1995), (iii) ASD and (iv) ρ (the distance between a haplotype in one population and the closest haplotype in the second population, averaged over all haplotypes). In comparisons of these measurements, we report the median value rather than the mean, because the measurements were often not normally distributed. Medians were compared using Mann-Whitney U tests and in some cases multidimensional scaling (MDS) plots were constructed; both analyses were performed using SPSS 14.0. Analysis of Molecular Variance (AMOVA) was carried out with Arlequin (Schneider et al. 2000).
Data were available from 19 caste and 12 tribal populations within India and 27 populations from East Asia. In considering these data, we concentrate mainly on Y-STRs because they are less affected by marker ascertainment bias than Y-SNPs; ‘variation’ thus implies ‘STR variation’ unless otherwise stated.
Variation within a population can be summarised by several statistics, and we used haplotype diversity, θk and ASD (Table 1). Median values of all these measures were lower in tribes than castes, and were lower in both Indian groups than in East Asia, except that ASD was slightly lower in East Asia than in castes (Table 2). A Mann-Whitney U test was used to assess the significance of the differences between the caste, tribal and East Asian groups and they were found to be significant in all comparisons, except for the East Asia-caste ASD difference mentioned above (Table 3). The measures used reflect related, but slightly different, features of the population variation. ASD takes into account the molecular differences between haplotypes and it is likely that the caste populations, who have significantly lower haplotype diversity than the East Asians but similar ASD, contain some highly divergent haplogroups and these molecular differences contribute more to the ASD statistic than to the diversity value. Indeed, a single predominant haplogroup, O, was noted in East Asia (Xue et al. 2006), but there was more variety of haplogroups in India (Sengupta et al. 2006; Zerjal et al. 2007). Overall, there is thus a strong and clear pattern of within-population variation: tribes<castes<East Asia.
For comparisons of variation between populations within the caste, tribal and East Asian groups, we used the measures FST, RST, ASD and ρ. We emphasise that all the comparisons are of genetic distances between one caste population and another, between one tribal population and another, or between one East Asian population and another; none of the distances are between the groups. ASDs were similar within all groups perhaps reflecting the presence of diverse ancient haplogroups within each group. All the other measures, however, differed between the groups and showed a common trend, but one that was different from the within-population trend: castes<East Asia<tribes (Table 4). Apart from ASD, all of these differences were significant (Table 5). The FST and RST results are illustrated in the MDS plots (Fig. 1), where the wide scatter of tribal Indian populations is particularly apparent.
While it is unsurprising to find that the patterns of within-population and between-population variation differ, we might expect that strong genetic drift would lead to both low variation within populations and large differences between populations. According to this simple model, if genetic drift were highest in tribal populations, intermediate in caste populations and lowest in East Asian populations, we would see the observed tribes<castes<East Asians pattern of within-population variation, but would see the converse pattern of East Asians<castes<tribes for between-population variation. We therefore investigated whether the observed between-population variation order of castes<East Asians<tribes, which did not fit this simple expectation, could result from the sampling strategy used. We repeated the analyses using only caste samples from local regions (i.e. excluding all of the mixed ‘Indian’ samples of Zerjal et al.) and restricting the East Asian samples to populations who have been resident in China for a long time (i.e. excluding the Mongolian, Korean and Japanese populations and also the Uygur and Hui who have entered China within historical times). These changes had no substantial effect on either the relative within-population variation (Table 2, lower section) or the comparison between either group and tribal populations, but did lead to the ‘local caste’ and ‘local Chinese’ groups being similar, except with the ρ measure (Tables (Tables44 and and5).5). This result seems the most reliable one: we therefore conclude that between-population variation follows the order [East Asians/castes]<tribes.
AMOVA analysis allows variation to be apportioned between categories in a quantitative way. We first analysed data from India and China separately, and calculated the percentage of variance within and between populations in each country (Table 6). With both Y-SNPs and Y-STRs, India shows more than twice the amount of variation between populations that is seen in China. With Y-STRs, for example, these results correspond to a FST of 0.21 in India, compared with 0.08 in China. When the Indian populations were grouped in caste and tribal groups, and compared with the control group from East Asia/China, substantial variation was seen both between populations within a group and between groups. The results were broadly similar for the different markers and group compositions, and always showed more variation in the ‘between-population within-group’ category than in the ‘between-group’ category (Table 6).
In summary, a simple pattern of Y-chromosomal variation emerges when Indian populations are compared with East Asian ones: in India, variation within populations is lower, and variation between populations is, on average, higher. The effect is more marked for the tribal samples analysed here than for the caste samples; indeed the variation between caste populations was similar to the variation between Chinese populations. Our results thus emphasise the unusual nature of the genetic structure in India and show that the so-called grandest genetic experiment has had detectable effects in this part of the world.
The caste system created social substructure for millennia within the Indian population. If the resulting subpopulations were sufficiently small and genetically isolated, and existed for long enough for genetic drift to be effective, this social substructure would lead to detectable genetic substructure (Fig. 2). Previous work has suggested that the conditions necessary for significant genetic drift were likely to have been met, at least for some populations. A study of caste populations from the Jaunpur district, for example, estimated male effective population sizes as small as 800 and 690 for Brahmins and Kshatriyas, respectively, although estimated sizes for Vaishyas and Panchamas were larger: 2,300 and 2,500 (Zerjal et al. 2007), and may well be different for all castes in other regions. The same study estimated gene flow from the Kshatriyas into all other castes in the same location at approximately 0.7% per generation, similar to the value of 1-2% per generation estimated by other workers (Wooding et al. 2004). In the present broader analysis, we found that individual caste populations generally contained significantly less variation than East Asian populations as would be expected if they had experienced more genetic drift, but this did not lead to them being more distinct from other caste populations, which would also be expected from a simple model of drift in a subdivided population (Fig.2). Interestingly, the observations of low within-population variation combined with high between-population variation were much more striking in the tribal population samples examined. This could reflect smaller effective population sizes in the tribes, less gene flow, a longer time period of population subdivision or any combination of these factors. However, another factor also needs to be taken into account when considering these results: the sampling strategy.
The criteria for choosing particular samples are often unclear, and may be opportunistic, reflecting the individuals and populations who wished to participate in a study. The sampling strategy adopted in any genetic survey is always very important, but can have a far greater influence on the conclusions in a highly substructured population than in one with low levels of structure. Consider the six hypothetical current populations illustrated in Figure 2. In sampling strategy 1, the investigators do not take account of the subdivision between populations 1-4, but combine them into a single population and compare them with populations 5 and 6. They conclude that all populations contain high levels of within-population variation and that differences between them are low. In contrast, in sampling strategy 2, investigators sample populations 1-4 separately and consequently detect the low levels of variation within some populations and high levels of variation between populations.
To illustrate the magnitude of this effect in an Indian context, where there is clear geographical structure (e.g. Gutala et al. 2006; Reddy et al. 2005), we re-analyse the published data from individual castes in Jaunpur (Zerjal et al. 2007) by pooling them into a single artificial ‘Jaunpur caste’ sample of 35, consisting of an arbitrary seven individuals from each of the castes combined into a single pseudo-population. The within-population variation measures of haplotype diversity, θk and ASD are no longer exceptionally low (Table 1, last row). The individual Jaunpur castes were very distinct from some other caste populations: for example, the RST distances between Jaunpur Brahmins, Kshatriyas, Vaishyas and Panchamas and the Vellalar middle caste sample of Sengupta et al. (2006) were 0.289, 0.550, 0.100 and 0.097 respectively. In contrast, the distance between the artificial Jaunpur caste sample and the Vellalar was 0.135. The overall effect on between-population distances can be seen in the MDS plots (Fig. 1): the artificial sample lies well within the cluster of caste populations, while some of the individual castes are extreme outliers. In this illustration, the populations considered were different castes, but the same effects could potentially be seen with tribes or breeding isolates within a caste.
The comparison of Indian with East Asian populations thus reveals several, but not all, of the features expected from a simple increase in genetic drift if the Indian population is more subdivided: variation within populations is lower in India, and variation between tribal populations is higher, as expected, but variation between caste populations is not higher than between East Asian populations. Sampling strategy is rarely described in detail and may have influenced this conclusion, for example if the caste samples do not correspond to true endogamous groups. Sampling procedures should be described in detail. Alternatively, from a Y-chromosomal perspective, the ‘grandest experiment ever performed’ may in fact have been the one which produced the tribal social and genetic structure, rather than the caste system.
We thank Mohan Reddy for helpful comments on the manuscript. DRC-S was supported by funds from the Arts and Humanities Research Board and the EC Sixth Framework Programme under Contract no. ERAS-CT-2003-980409. CT-S was supported by The Wellcome Trust.