|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: KT LS. Performed the experiments: AN VS VKS AGR. Analyzed the data: AN VS VKS ME. Contributed reagents/materials/analysis tools: LS. Wrote the paper: KT AN VS ME. Sample collection: PKP SS SR MD NV.
The phylogeny of the indigenous Indian-specific mitochondrial DNA (mtDNA) haplogroups have been determined and refined in previous reports. Similar to mtDNA superhaplogroups M and N, a profusion of reports are also available for superhaplogroup R. However, there is a dearth of information on South Asian subhaplogroups in particular, including R8. Therefore, we ought to access the genealogy and pre-historic expansion of haplogroup R8 which is considered one of the autochthonous lineages of South Asia.
Upon screening the mtDNA of 5,836 individuals belonging to 104 distinct ethnic populations of the Indian subcontinent, we found 54 individuals with the HVS-I motif that defines the R8 haplogroup. Complete mtDNA sequencing of these 54 individuals revealed two deep-rooted subclades: R8a and R8b. Furthermore, these subclades split into several fine subclades. An isofrequency contour map detected the highest frequency of R8 in the state of Orissa. Spearman's rank correlation analysis suggests significant correlation of R8 occurrence with geography.
The coalescent age of newly-characterized subclades of R8, R8a (15.4±7.2 Kya) and R8b (25.7±10.2 Kya) indicates that the initial maternal colonization of this haplogroup occurred during the middle and upper Paleolithic period, roughly around 40 to 45 Kya. These results signify that the southern part of Orissa currently inhabited by Munda speakers is likely the origin of these autochthonous maternal deep-rooted haplogroups. Our high-resolution study on the genesis of R8 haplogroup provides ample evidence of its deep-rooted ancestry among the Orissa (Austro-Asiatic) tribes.
India is a melting pot of multi-lingual populations with a unique complex genome diversity . The linguistic diversity prevalent among Indian populations is associated with the presence of four linguistic families: Dravidian (DR), Indo-European (IE), Austro-Asiatic (AA) and Tibeto-Burman (TB) . Of these four groups, AA tribes are considered to be the first settlers of the Indian subcontinent, representing about 30 endogamous tribal populations . The AA linguistic family is traditionally divided into two basic subfamilies: Mon-Khmer and Mundari . Among these two subfamilies, Mundari speakers, the traditional hunter-gatherers, are exclusively found in the Indian subcontinent –. Because Mundari populations are considered to be the earliest inhabitants of the Indian subcontinent, their migration during demic expansion of the agriculturalists in the Neolithic era, as has been suggested for Mon-Khmer speaking Nicobarese , appears doubtful.
Numerous studies employing evolutionary-informative markers have demonstrated the origin of various linguistic populations in India –. The phylogeny of Indian mitochondrial DNA (mtDNA) is characterized predominantly by several indigenous haplogroups dispersed exclusively throughout the Indian subcontinent and partially by West Eurasian lineages –. The autochthonous mtDNA haplogroups in Indian populations include: U2a-c R5-8, R30, R31, N1d and N5 in haplogroup N and M2-M6, M30-50 in haplogroup M –. Among the two founder haplogroups, M and N, the former is more prevalent than the latter in the Indian populace , . An extrapolation of studies on the N haplogroup led to the discovery of R and several other haplogroups such as U2a-c, R5-R8, R30, R31, N1d and N5 , –. It has been estimated that the first footprint of haplogroup R in India took place ~65Kya and is known as the third most frequent haplogroup, encompassing 11% of the total haplogroups in India after M and N –. Significantly, haplogroups R6 and R7 are more frequent among AA speakers than among other linguistic groups .
Though numerous studies have been carried out on the phylogenetic characterization of haplogroup R, there is a dearth of research on its subhaplogroups. To the best of our knowledge, only eight complete mtDNA sequences of haplogroup R8 are available in the database , . Therefore, we aim to more accurately trace the genealogy and pre-historic expansion of haplogroup R8 into the Indian subcontinent.
We analyzed a total of 5,836 samples from 104 populations across the Indian subcontinent (Figure 1) and identified 54 samples containing haplogroup R8 (Figure 2 & 3). The R8 haplogroup is defined by 13215-9449-7759-3384-2755 sites in the coding regions and single site (195) in the control region. Those HVS-I motifs of Indian populations previously defined as West Eurasian haplogroup H, when matched with revised Cambridge Reference Sequences (rCRS) ,  are now redefined as haplogroup R8. The topology of the previously characterized R8 samples A165, A190, S4,  and recently classified Ko74, CoB41, Ko30, Ko37 and Lam10 samples  deviates significantly with our samples. A190  grouped with our samples of Panika, Mudiraj, Dommari and Sugali, whereas S4 grouped with Lam10. Upon complete sequencing of the 54 samples, we identified 9 novel sub-haplogroups of haplogroup R8. The coalescent age for haplogroup R8 is 41.7±7.3 Kya while for the two subclades R8a and R8b are 15.4±7.2 Kya and 25.7±10.2 Kya, respectively (Figure 2). R8a is characterized by the motif 709-5510-13782, whereas R8b is characterized by 16390-15326-13194-12007-6485-456. One of the most diverse subclades of R8a, R8a1a1, is separated from the other R8a subclades by a transition at 8646 position. Both of these subclades split into various fine branches (Figure 2). Most R8 subclades are present predominantly in members of the AA language family. The spatial autocorrelation analysis revealed that the highest frequency of this haplogroup occurred towards East India, especially within Orissa (12%) (Figure 4), whereas low frequencies occurred in the Gujarat (1.8%), Madhya Pradesh (0.53%), Uttar Pradesh (0.22%), Andhra Pradesh (0.9%), Chhattisgarh (6%), Jharkhand (1.04%) and Tamil Nadu (0.18%) populations (Figure 4). The Spearman's rank correlation analysis demonstrated a significant correlation between R8 haplogroup frequency and latitude and longitude with r=−0.398 and 0.241 (p<0.05), respectively.
HVS-I sequences of the individuals within the R8 haplogroup and who belonged to 30 different ethnic populations, were subjected to estimate intra-population diversity. The diversity indices and neutrality test values are presented in Table 1. The Tajima's and Fu's Fs values showed significantly negative values in 18 and 26 populations, respectively (Table 1). Most of the populations showed similar sequence diversity values ranging from 0.8995 (0.051) in Malayan to 0.9940 (0.009) in Kanwar. Orissa populations showed relatively higher values than other populations: Savara 0.9810 (0.022), Bhumia 0.9708 (0.027), Gadaba 0.9667 (0.035), Dhurva 0.9619 (0.039) and Bonda 0.9631 (0.023). A similar trend was also observed in the mean number of pairwise differences: Savara 6.561 (3.23), Bhumia 6.269 (3.11), Gadaba 4.808 (2.47), Dhurva 4.933 (2.54) and Bonda 4.837 (2.43).
We have carried out principal component analysis (PCA) to explore the affinities among the populations possessing haplogroup R8, based on the frequency distributions. The PCA plot identified close affinities among the Orissa tribes belonging to the Austro-Asiatic linguistic family (Figure 5). Combined, PC1 and PC2 accounted for a 63.70% variance in the data.
High-resolution analysis of the R8 haplogroup in a total of 5,836 HVS-I (16000–16400) and 54 complete mtDNA sequences characterized two subclades: R8a and R8b. We have further refined these subclades into several subhaplogroups (R8a1, R8a1a1, R8a1a2, R8a1a3, R8a1b, R8a2, R8b, R8b1 and R8b2) based on 38 novel R8 sequences.
Existence of a comparatively high frequency of R8 in Orissa populations, especially among the AA-speaking Mundari tribes, strongly suggests that this haplogroup might have originated among the maternal ancestors of the contemporary AA speakers of the region. To substantiate this hypothesis, we estimated the coalescence time and corroborated with archeological evidence. The time for most recent common ancestors (TMRCA) of R8 (41.7±7.3 Kya) and its subclades R8a (15.4±7.2 Kya) and R8b (25.7±10.2 Kya) divulges the ancient demographic history of this haplogroup (Figure 2).
This haplogroup (R8) is also present in low frequency among the Dravidian and Indo-European speaking family, which can be explained by a language shift or local admixture with the AA-speaking family. Interestingly, this haplogroup was not found in any of the Tibeto-Burman populations analyzed in the present study.
A contour map of the R8 haplogroup revealed its distribution in different geographical regions (Figure 4). It is quite evident from the map that the frequency of this haplogroup is concentrated towards Orissa, Gujarat, Chattisgarh and Jharkhand with highest frequency in Orissa (12%). The Spearman's rank correlation analysis demonstrated a significant correlation of R8 haplogroup frequency to latitude and longitude (p<0.05), strong evidence for the relation of genes and geography to this group.
The significant negative values obtained from neutrality tests support the hypothesis of population growth. The PCA plot (Figure 5) found close affinities among the Orissa (AA tribe) population, perhaps due to the high frequency and influence of the R8 haplogroup.
High-resolution study on the origin of the R8 haplogroup provides abundant evidence of its deep-rooted ancestry among the Orissa (AA) tribes. The TMRCA estimates revealed that the initial maternal colonization of this haplogroup occurred during the mid-to-late Paleolithic period, roughly 40 to 45 Kya. The significant relation between the genes and geography is attributed by the spatial analysis of this haplogroup. Moreover, the absence of haplogroup R8 and its subhaplogroups among the Tibeto-Burman speaking populations studied implies socio-cultural practices existing among the populations to be the principle factor for genetic demarcation. Thus, the phylogeographic reconstruction of 54 complete mitochondrial sequences containing haplogroup R8 furnished a better understanding of this partially-characterized haplogroup. Our high-resolution analysis again provided a detailed coding region information for proper classification of a sample, especially in the case of the South Asian haplogroups, which contain several deep-rooted lineages sharing identical coding region mutations with the exception of the HVS-I –.
All DNA samples analyzed in the present study were derived from blood samples collected with informed written consent according to protocols approved by the Institutional Ethical Committee of CCMB, Hyderabad.
The samples used in this study were obtained from the DNA bank of CCMB. We have screened a total of 5,836 individuals belonging to 104 ethnic populations from 17 states of India (see Figure 1; Supplementary information Table S1), initially for HVS-I (16000 to 16400) followed by nucleotide position at 3384. Among the 5,836 mtDNA screened, 54 were found to contain basal mutations 13215-9449-7759-3384-2755 which define haplogroup R8. 24 sets of primers were used in sequencing the complete mtDNA. Sequencing of PCR amplicons was performed using the BigDye terminator cycle sequencing kit and ABI 3730XL DNA analyzer (Applied Biosystems, Foster City, USA). The sequences were edited and assembled using AutoAssembler (version 1.4) software (Applied Biosystems, Foster City, USA) to obtain a consensus sequence. These sequences were aligned with rCRS and the mutations were noted .
NETWORK (version 4.5) software (www.fluxusengineering.com) was used for phylogenetic reconstruction . The phylogeny obtained was reconfirmed by means of a neighbor-joining tree (1000×bootstrapped) , using MEGA (version 4.0) software . We followed the nomenclature system of Richards et al.  for reconstructing the phylogenetic tree of haplogroup R8. The isofrequency map for haplogroup R8 was constructed using the Kringing method  in the Surfer (version 8.0) program designed by Golden software (Golden Software Inc., Golden, Colorado). Spearman's Rank correlation coefficients between mtDNA haplogroup frequency and latitude and longitude were calculated in StatistiXL (version 1.8) software (StatistiXL, Nedlands, Western Australia) with a p-value<0.05 considered statistically significant.
Principal Component (PC) analysis of R5-R8, R30 and R31 lineages in different Indian populations was performed using SPSS (version 11) software (SPSS Inc., Chicago, IL, USA) with mtDNA haplogroup frequencies as an input vector. Coalescence time was calculated using sequence positions between nucleotides 577 to 16023 considering one base substitution per 5,140 years, excluding insertions and deletions . Standard deviation of the rho (σ) estimate was calculated based on Saillard et al. . Descriptive statistical indices and Neutrality tests (Tajima's D, Fu's Fs) for HVS-I sequences were calculated using Arlequin (version 2.0) software . Complete mtDNA genome sequences generated in this study were submitted to GeneBank (accession numbers FJ467940–FJ467993).
List of the caste and tribal population studied.
(0.03 MB XLS)
We thank all voluntary donors for providing blood samples, and all students and institutions that contributed to sample collection.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was supported by the Council of Scientific and Industrial Research (CSIR) and Indian Council of Medical Research (ICMR), Government of India. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.