|Home | About | Journals | Submit | Contact Us | Français|
We analyzed 40 SNP and 19 STR Y-chromosomal markers in a large sample of 1,525 indigenous individuals from 14 populations in the Caucasus and 254 additional individuals representing potential source populations. We also employed a lexicostatistical approach to reconstruct the history of the languages of the North Caucasian family spoken by the Caucasus populations. We found a different major haplogroup to be prevalent in each of four sets of populations that occupy distinct geographic regions and belong to different linguistic branches. The haplogroup frequencies correlated with geography and, even more strongly, with language. Within haplogroups, a number of haplotype clusters were shown to be specific to individual populations and languages. The data suggested a direct origin of Caucasus male lineages from the Near East, followed by high levels of isolation, differentiation and genetic drift in situ. Comparison of genetic and linguistic reconstructions covering the last few millennia showed striking correspondences between the topology and dates of the respective gene and language trees, and with documented historical events. Overall, in the Caucasus region, unmatched levels of gene-language co-evolution occurred within geographically isolated populations, probably due to its mountainous terrain.
Since the Upper Paleolithic, anatomically modern humans have been present in the Caucasus region, which is located between the Black and Caspian Seas at the boundary between Europe and Asia. While Neolithization in the Transcaucasus (south Caucasus) was stimulated by direct example and possible in-migration from the Near East, in the North Caucasus archaeologists have stressed the cultural succession from the Upper Paleolithic to the Mesolithic and Neolithic (Bader and Tseretely, 1989; Bzhania, 1996). Neolithic cultures developed in the North Caucasus ~7,500 years before present (YBP) from the local Mesolithic cultures (microlithic stone industries indicate a gradual transition), and, once established, domesticated local barley and wheat species (Masson et al., 1994; Bzhania, 1996). Only in the Early Bronze Age (5,200-4,300 YBP) did cultural innovations from the Near East become more intensive with the emergence of the Maikop archaeological culture (Munchaev, 1994). These could have occurred alongside migratory events from the area between the Tigris River in the east and northern Syria and adjacent East Anatolia in the west (Munchaev, 1994: 170). The Late Bronze Age Koban culture (3,200-2,400 YBP), which predominated across the North Caucasus until Sarmatian times (2,400-2,300 YBP), became the common cultural substrate for most of the present-day peoples in the North Caucasus (Melyukova, 1989: 295).
After approximately 1,500 YBP, this pattern changed, and most migration came to the North Caucasus from the East European steppes to the north, rather than from the Near East to the south. These new migrants were Iranian speakers (Scythians, Sarmatians and their descendants, the Alans) who arrived around 3,000-1,500 YBP, followed by Turkic speakers about 500-1,000 YBP. The new migrants forced the indigenous Caucasian population to relocate from the foothills into the high mountains. Some of the incoming steppe dwellers also migrated to the highlands, mixing with the indigenous groups and acquiring a sedentary lifestyle (Abramova, 1989; Melyukova, 1989; Ageeva, 2000). The defeat of the Alans by the Mongols, and then by Tamerlane in the 14th century, stimulated the expansion of the indigenous populations of the West Caucasus into former Alan lands (Fedorov, 1983). A later expansion of Turkic-speaking Nogais took place about 400 YBP.
The genetic contributions of these three components (indigenous Upper Paleolithic settlement; Bronze Age Near Eastern expansions; Iron Age migration from the steppes) is unknown. Physical anthropologists have traced the continuity of local anthropological (cranial) types from the Upper Paleolithic, but data reflecting the influences from the Near East and Eastern Europe are contradictory (Abdushelishvili, 1964; Alexeev, 1974; Gerasimova, Rud’, and Yablonskij, 1987).
Linguistically, the North Caucasus is a mosaic consisting of more than 50 languages, most of which belong to the North Caucasian language family. Iranian languages (Indo-European family) are represented by Ossets, and there are a variety of Turkic-speaking groups (Altaic family) as well (Ruhlen, 1987). A number of studies of North Caucasian languages (Trubetzkoy, 1930; Gigeneishvili, 1977; Shagirov, 1977; Talibov, 1980; Bokarev, 1981; Chirikba, 1996) collected linguistic data and established regular phonetic correspondences between North Caucasian languages. The resulting classification (Nikolaev and Starostin, 1994) became generally accepted with some modifications (Kuipers, 1963; Comrie, 1987; Ruhlen, 1987; www.ethnologue.com).
This classification was based on the common innovation method and particularly on the glotto-chronological method (Starostin, 1989), which is now widely used by the Evolution of Human Languages Project coordinated by the Santa Fe Institute (http://ehl.santafe.edu/intro1.htm, http://starling.rinet.ru/main.html). For example, application of this method to the modern Romance languages (Spanish, Italian, Romanian, etc.) produced a date for their split to about 1,600 years BP. This date corresponds to the time of the disintegration of the Roman Empire, when Latina Vulgata (the common language in the Empire’s provinces) became subdivided into regional dialects (Blazhek and Novotna, 2008). Thus, this result validates the methodology, at least for this example.
The present work employs Starostin’s methodology, and we made special efforts to create the high-quality linguistic databases required for this analysis. Thus, based on significantly extended and revised linguistic databases, we have applied a glotto-chronological approach to the North Caucasian languages. As a result, our study provides a unique opportunity to make direct comparisons of linguistic and genetic data from the same populations. Lexico-statistical methods have also been applied to a number of language families using a Bayesian approach to increase the statistical robustness of language classification (Gray and Atkinson, 2003; Kitchen et al., 2009; Greenhill et al., 2010). Using these methods with the Caucasus languages under study here will be the focus of future work.
Previous studies of genetic diversity in the Caucasus (Nasidze et al., 2003, 2004a, b) noted that geography, rather than language, provides a better (but statistically non-significant) explanation for the observed genetic structure. However, while the sample sizes for Y-chromosomal markers in those studies were large for southern populations, they were substantially smaller for North Caucasus populations (average n = 28, with the exception of the Ossets). Several subsequent papers have explored the genetic composition of the Dagestan in the east Caucasus (Bulaeva et al., 2006; Tofanelli et al., 2009; Caciagli et al., 2009). A later survey of the Y-chromosomal composition of the Caucasus was published in Russian only, with phylogenetic resolution no deeper than the designation of haplogroups G-M201, J1-M267 and J2-M172 (Kutuev et al., 2010). Some data can also be retrieved from papers that focused other regions (Rosser et al., 2000; Semino et al., 2000; Wells et al., 2001; Zerjal et al., 2002; Di Giacomo et al., 2004; Semino et al., 2004; Cruciani et al., 2007; Battaglia et al., 2009). To the best of our knowledge, no other Y-chromosome data from the Caucasus have been published, and, except for Georgians and the Ossets, reliable data sets are very few.
Our study presents a much more extensive survey of Y-chromosomal variation in the Caucasus. All geographic subregions are covered and all large ethnic groups are represented by large sample sizes (navg = 109). We did not include Turkic-speaking populations, as their recent immigration from Eastern Europe and Central Asia could possibly blur the deeper genetic patterns. Instead, we subtyped previously analyzed samples from the Near East (El-Sibai et al., 2009; Haber et al., 2010) and new samples from Eastern Europe for comparative purposes. Our genotyping strategy included the deepest known level of phylogenetic resolution for the common Caucasus haplogroups, as well as 19 STRs to facilitate the dating of these genetic lineages.
Overall, the present study sets out to draw a precise and reliable portrait of the Y-chromosomal and linguistic variation in the Caucasus, and to use this information to generate a more comprehensive history of the peoples of this area.
A total of 1,525 blood samples from fourteen Caucasus populations (Table 1) were collected in 1998-2009 under the supervision of Elena Balanovska using a standardized sampling strategy. All sampled individuals identified their four grandparents as members of the given ethnic group, and were unrelated at least up to the third degree of relation. Informed consent was obtained under the control of the Ethics Committee of the Research Centre for Medical Genetics, Russia.
DNAs were extracted from white cells using an organic extraction method (Powell and Gannon, 2002). DNA concentration was evaluated via quantitative real-time PCR using the Quantifiler DNA Quantification Kit (Applied Biosystems) and normalized to 1 ng/μl. Samples were SNP-genotyped using the Applied Biosystems 7900HT Fast Real-Time PCR System with a set of 40 custom TaqMan assays (Applied Biosystems). The samples were additionally amplified at 19 Y-chromosomal STR loci in two multiplexes and read on an Applied Biosystems 3130xl Genetic Analyzer. The first multiplex was the 17 STR loci Y-filerTM PCR Amplification Kit (Applied Biosystems). The remaining two STR loci, DYS388 and DYS426, along with six insertion-deletion polymorphisms (M17, M60, M91, M139, M175, M186) were genotyped in a separate custom multiplex also provided by Applied Biosystems. Quality control procedures included checking SNP genotypes for phylogenetic consistency, comparing with the haplogroup predicted from STR profiles (http://www.hprg.com/hapest5/index.html), and independent replication of 20 samples at the University of Arizona.
Haplogroup frequency maps were created with the GeneGeo software using algorithms described previously (Balanovsky et al., 2008). Nei’s genetic distances between populations were calculated using the DJ software (Balanovsky et al., 2008) and visualized using multidimensional scaling plots and tree diagrams constructed with Statistica 6.0 (StatSoft. Inc., 2001). Geographic distances between populations were obtained from geographic coordinates in DJ software using spherical formulae.
To estimate correlation and partial correlation coefficients between matrices of genetic, geographic and linguistic distances, we conducted Mantel tests using Arlequin 3.11 (Schneider et al., 2000). The same software was used for the hierarchical analysis of molecular variation (AMOVA). Genetic boundaries were identified by Barrier 2.2 software (Manni and Guerard, 2004).
The phylogenetic relationships between haplotypes within a haplogroup were estimated with the Reduced Median (RM) network algorithm in the program Network 126.96.36.199 (Bandelt et al., 1995), with the reduction threshold equal to 1. RM networks were visualized with Network Publisher (Fluxus Engineering, Clare, U.K.). No available algorithm automatically identifies haplotype clusters within the network; doing this by hand (the common practice) is inevitably arbitrary to some degree. Therefore, we applied the following rules to minimize variation between individuals when identifying the clusters: i) Because all the networks used had a clear center (the probable root), we considered as clusters only those groups of haplotypes that were linked to the root via an individual nodal haplotype (put differently, monophyletic branches in the network); ii) This cluster-specific shared node was considered to be the founder haplotype (selecting the founder is important for the age calculation using the ρ estimator); iii) To avoid using small sample sizes, we only considered clusters consisting of 10 or more samples; iv) Finally, we required that a cluster should be highly (above 80%) specific to a given population or group of closely-related populations.
To estimate the age of particular Y-chromosomal lineages in Caucasus populations, we applied four commonly used methods. First, the ρ (rho) estimator (Forster et al., 1996; Saillard et al., 2000) was used to date haplotype clusters. Second, BATWING (Wilson et al., 2003) was used to obtain independent dates for these haplotype clusters. Third, BATWING was also used to estimate the possible sequence and dates of population splits. Fourth, the standard deviation (SD) estimator (Sengupta et al., 2006) was used to estimate the age required to accumulate the observed diversity within populations for entire haplogroups (not for clusters within the haplogroup as in the first and second analyses).
The Time to the Most Recent Common Ancestor (TMRCA) of the clusters of STR haplotypes that appeared to have evolved within specific populations, and which were identified in the networks, was estimated with the ρ statistic according to Saillard et al. (2000). Since haplotype clusters are population-specific, the resulting age estimations serve as lower bounds for the time that a population may have been isolated following a split.
BATWING provides a mechanism to identify bounds, established by coalescent events, between which a unique event polymorphism (UEP) may have emerged. UEPs are typically identified by SNPs, but can be generalized to include “virtual UEPs” that mark clusters of phylogenetically related STR haplotypes identified by Network, following the methods of Cruciani et al. (2004, 2006). Populations within which the clusters are observed were identified, and BATWING computations included all samples representing those populations in order to estimate UEP dates. Markers DYS385a and DYS385b were excluded from calculations. We used prior distribution parameters obtained from genealogical mutation rates reported in Ge et al. (2009). BATWING was configured to start with an ancestral effective population size that began an exponential expansion at a date that BATWING estimated.
BATWING was also employed to model the sequence of the population splits and to estimate the split times among populations. These estimates did not employ the virtual UEPs identified using Network. As such, this procedure provides an independent check for the times of the population splits.
Standard deviations (SD) of microsatellite variances for four major haplogroups, G2a3b1-P303, G2a1a-P18, J2a4b*-M67(xM92) and J1*-M267(xP58), were calculated for each population with a sample size of at least five individuals from a given haplogroup. The confidence interval was estimated based on the standard error of the SD. This method is based on average squared difference (ASD) in STR variation. It does not estimate the population divergence time but instead the age (amount of time) required to produce the observed microsatellite variation within the haplogroup at each population, under the assumption of limited gene flow. All haplotypes of the haplogroup were used in the calculations and not just specific clusters. Calculations were carried out both including and excluding DYS385a and DYS385b because these markers were not typed in a locus-specific manner in our data. In the majority of the cases, the inclusion or the exclusion of these markers did not influence the estimates. The DYS385 and DYS388 loci were thus only excluded from the reported values when the microsatellite variances associated with them were 10-fold higher than the average variance.
When using ρ and SD estimators, we applied both an evolutionary mutation rate (6.9 × 10−4 per locus per generation; Zhivotovsky et al. 2004) and a genealogical rate (2.1 × 10−3; Gusmao et al., 2005; Sanchez-Diz et al., 2008; Ge et al., 2009) to convert the observed variation into a number of generations. The results obtained by using these two rates were compared with linguistic and historical evidence. The BATWING prior distribution parameters were based on only the genealogical rate because BATWING models mutations in each generation of a genealogy and the genealogical rate seems therefore to be most suitable for this analysis. In all methods, when converting the number of generations into calendar years we had to use a different generation time for each: 25 years with the evolutionary rate because this rate was initially estimated in years and converted into generations using 25 years/generation (Zhivotovsky et al., 2004), and 30 years for the genealogical rate because this is the approximate male generation time measured in demographic and anthropological studies (Fenner, 2005) and shown for the Caucasus populations in the genetico-demographic study (Pocheshkhova, 2008).
Linguistic dates were calculated from the number of word substitutions that have accumulated after a language split (Starostin, 1989; Starostin, 2000; Embleton, 2000). The basic principles of this method, its applications, and the formulas used in our study are described in the Supplementary Note 1. This glottochronological approach was first used by Starostin with North Caucasus languages (Nikolaev and Starostin 1995). The present study continues this analysis with updated linguistic databases (word lists). The Caucasian word lists used in our study were significantly modified by Mudrak, while the Ossetian word lists were recorded by Ershler and analyzed by Dybo (this study).
Note that both the linguistic and genetic dating methods used in our study provide a most recent (lower) estimate of the population split time (Supplementary Note 1).
We analyzed 1,525 Y-chromosomal haplotypes from fourteen Caucasus populations. Table 2 presents the haplogroup frequencies, while the Y-STR genotypes are provided in the Supplementary Table 1. We additionally subtyped 121 haplogroup G-M201 samples and 133 haplogroup J-M304 samples from previously-analyzed Near Eastern populations (Supplementary Table 2). Overall, the most frequent haplogroups in the Caucasus were G2a3b1-P303 (12%), G2a1a-P18 (8%), J1*-M267(xP58) (34%), and J2a4b*-M67(xM92) (21%), which together encompassed 73% of the Y chromosomes, while the other 24 haplogroups identified in our study comprise the remaining 27% (Table 2).
However, these average frequencies masked the real pattern that became apparent when regional populations were considered (Figure 1). Each of these four haplogroups had its own focus within the Caucasus. More specifically, haplogroup G2a3b1-P303 comprised at least 21% (and up to 86%) of the Y chromosomes in the Shapsug, Abkhaz and Circassians (Figure 1 and Table 2). These populations live in the western part of the Caucasus, and linguistically belong to the Abkhazo-Adyghe language group. The frequency of this haplogroup was below 10% (average 2%) in all other populations investigated from the Caucasus. Similarly, haplogroup G2a1a-P18 comprised at least 56% (and up to 73%) of the Digorians and Ironians (both from the Central Caucasus Iranic linguistic group), while not being found at more than 12% (average 3%) in other populations. Again, haplogroup J2a4b*-M67(xM92) comprised 51-79% of the Y chromosomes in the Ingush and three Chechen populations (North-East Caucasus, Nakh linguistic group), while, in the rest of the Caucasus, its frequency was not higher than 9% (average 3%). Finally, haplogroup J1*-M267(xP58) comprised 44-99% of the Avar, Dargins, Kaitak, Kubachi, and Lezghins (South-East Caucasus, Dagestan linguistic group) but was less than 25% in Nakh populations and less than 5% in the rest of Caucasus.
A genetic boundary analysis revealed the same pattern (Figure 1). This methodology (Womble, 1951; Rosser et al., 2000) identifies zones of abrupt changes in haplogroup frequencies. The first (most significant) boundary A separated Nakh-Dagestan speaking populations of the east Caucasus from other populations of the region. Boundary B separated the four Nakh-speaking populations from the five Dagestan-speaking ones, while Boundary C separated the Iranian-speaking Ossets (Central Caucasus) from Abkhazo-Adyghe speakers of the west Caucasus. Overall, the first three boundaries divided the Caucasus into four regions, each of which coincided with areas of prevalence of one of the four major Caucasus haplogroups and with areas of different major linguistic groups. The other genetic boundaries D, E and F subdivided Abkhazo-Adyghe speaking populations and separated the Lezghins from populations speaking Dargin languages.
The observed pattern of genetic variation can be interpreted in two ways. On the one hand, it shows excellent correlation with geography, as each of four major haplogroups is prevalent in a specific region of the Caucasus. On the other hand, this pattern also closely fits the classification of Caucasus languages (with no exceptions).
To test these patterns, we computed three matrices of pairwise distances between all studied populations: genetic (from haplogroup frequencies), geographic (in kilometers) and linguistic (percentage of words in common). The Mantel test results showed a significant (p ≤ 0.002) correlation both between genetics and language (r = 0.64) and between genetics and geography (r = 0.60) (Table 3). Partial correlations were calculated between genetics and both factors (language or geography) separately, holding the alternative factor constant. Unfortunately, the analysis was complicated by the fact that linguistics and geography in Caucasus are closely linked with each other (r = 0.78). For this reason, genetic distances exhibited insignificant partial correlation with both, although the correlation with linguistics was almost twice as strong (Table 3).
AMOVA was used to further investigate which factor might be the major driving force behind this degree of differentiation (Table 4). When populations were grouped geographically, the proportion of variation in haplogroup frequencies between geographic groups was 0.146. Linguistic classification of the same populations provided nearly two times the extent of variation between linguistic groups (0.268). Therefore, linguistics explained a larger part of Y-chromosomal variation in the Caucasus.
These analyses indicated that linguistic diversity is at least as important as geography in shaping the Y-chromosomal landscape, and suggested that the pronounced genetic structure of the Caucasus might have evolved in parallel with the diversification of the North Caucasus languages.
Figure 2 compares the Y-chromosomal pool of the Caucasus with its neighboring regions (Balkans, South-East Europe and the Near East) by presenting frequency maps for haplogroups which predominate in any of these regions. Four haplogroups, G2a3b1-P303, G2a1a-P18, J2a4b*-M67(xM92) and J1*-M267(xP58), exhibit their highest documented frequencies in the Caucasus. Haplogroup G2a3b1-P303 predominates in the West Caucasus (Table 2), although we also found it in the Near East (Table 2) and in one Russian population (data not shown), and it has been reported in Western Europeans (www.ysearch.org). The second haplogroup G2a1a-P18 is almost absent outside of the Caucasus (Figure 2), although its ancestral clade, G2a1-P16, is present in the Near East (Cinnioglu et al., 2004; Flores et al., 2005). Similarly, J1*-M267(xP58) was found mainly in the Caucasus (Figure 2) with the ancestral J1-M267 being common in the Near East. The fourth haplogroup J2a4b*-M67(xM92) is prevalent in a region spanning the Near East and the Caucasus (Figure 2). Note that only a few Near Eastern haplogroups are represented in the Caucasus, and other major components of the Near Eastern Y-chromosomal pool (e.g., J2b-M12) are virtually absent there.
When haplogroups common in Europe were examined, we observed that the typical Balkan haplogroup I2a-P37.2 was virtually absent in the Caucasus (Figure 2). The recently-defined sub-branch R1a1a7-M458 (Underhill et al., 2010) was found among only the Circassians and Shapsug (Table 2). R1a*-M198(xM458) has an average frequency in the Caucasus as low as 5%, but was found in 20% of the Circassians and 22% of the Dargins, two populations that occupy opposite parts of the Caucasus. STR haplotypes from these Circassian and Dargins samples formed distinct clusters in a network (Supplementary Figure 1). Similarly, two different haplotype clusters within R1b1b2-M269 (Supplementary Figure 1) were found in the Lezghins (30%) and in Ossets-Digor (16%). These concentrations of (presumably European) haplogroups R1a*-M198(xM458), R1a1a7-M458 and R1b1b2-M269 found in few locations in the Caucasus might indicate independent migrations from Europe that were too small to make any significant impact on Caucasus populations.
The MDS plot of pairwise genetic distances (Figure 3) showed strong regional clustering separating populations from Europe, the Near East, and the Caucasus, with samples from the Caucasus grouping closer to the Near East samples than to the European ones. Of those three clusters, the Caucasus appeared to be the most diverse, with distinguishable subgroup variation within it. The first subgroup included Dagestan speakers (Avar, Dargins, Kubachi, Kaitak), the second Nakh speakers (three Chechen populations and the Ingush), and the third Abkhazo-Adyghe speakers (Abkhaz, Circassians, Shapsug). Ossets also joined this cluster because we combined their predominant haplogroup G2a1a-P18 with Abkhazo-Adyghe predominant haplogroup G2a3b1-P303 to achieve compatibility with the less phylogenetically resolved European and Near Eastern data. This plot illustrates both the common Near Eastern background of all Caucasus populations and the pronounced intra-Caucasus genetic differentiation into groupings that corresponded well with their linguistic affiliation.
More insights into the relationships between Caucasus populations were obtained from a tree based on haplogroup frequencies (Figure 4, left). Another tree (Figure 4, right), representing the linguistic classification, had the same topology except for the Dargins, who joined the Kubachi/Kaitak cluster before the Avar did. The Indo-European-speaking Ossets were outliers in the Caucasus linguistic tree, and the genetic tree also placed them separately, with slight similarity to the Abkhaz. Generally, the tree based on genetic distances mirrored the linguistic tree in its overall pattern and in most details.
To analyze this parallelism further, we constructed phylogenetic networks for Caucasus haplogroups (Figure 5, Supplementary Figure 1). While all of the results presented above were obtained using only haplogroup frequencies, in this and the following section we analyze the variation of STR haplotypes within haplogroups.
Haplogroup G2a1a-P18 (Figure 5), which was found almost exclusively in the Caucasus, consisted of distinct branches of STR haplotypes rooted in the central reticulated zone. The larger cluster α includes mainly Ossets-Iron and demonstrated a star-like pattern with a central founder haplotype and a few subfounders. The cluster β comprised many samples of Ossets-Digor, while smaller cluster γ comprised both Ossetian populations. The diverse cluster in the upper part of the network comprised different Caucasus populations, and a few non-Caucasus G2a1a-P18 samples also belonged to this branch.
Generally, haplogroup G2a1a-P18 seemed to have a long history in the Caucasus, being spread across the region and forming many branches. However, two of them (clusters α and β, found in the Ironians and Digorians, respectively) showed in their star-like structure signs of relatively recent expansion. The average number of mutation steps (ρ estimator) was similar for both clusters (1.46 for α and 1.41 for β), indicating that both clusters had expanded at around the same time, possibly because of the same event.
Reduced median networks for other haplogroups (Supplementary Figure 1) revealed similar patterns of branching. Some branches were shared between different Caucasus populations (and often included also Near Eastern samples), while many others were absent from the Near East and, moreover, specific to individual Caucasus populations (e.g. clusters α, β, and γ in Figure 5 are specific to the Ossets).
Applying the formal criteria, we identified 18 population-specific clusters (average specificity 95%). Table 5 lists these clusters with their ages, suggests the population event relevant to each, and indicates the linguistic date of this event (the tree of the North Caucasian languages obtained in our study is presented in Supplementary Figure 2).
Because of the controversy between “evolutionary” and “genealogical” mutation rates, we set out to reconstruct the population history of the Caucasus in two phases. The first was based solely on genetic evidence without consideration of any mutation rates, while the second converted genetic diversity into time using both rates and then compared them with the linguistic times.
In phase 1, we grouped populations according to the predominant haplogroup, which yielded four branches (Supplementary Figure 3). To explore the intra-branch relationships, we examined the population-specific STR-clusters (Table 5). Cluster P303-β was shared between Shapsug and Circassians, while no cluster linked any of these populations with Abkhaz. We concluded that Abkhaz separated first, while Shapsug and Circassians maintained a shared ancestry for a longer period, during which the P303-β cluster originated. The cluster P303-α, which is present in Shapsug but absent in Circassians, marks the next split on the population tree. In terms of mutation steps, the first split (Abkhaz vs Shapsug-Circassians) occurred 1.5 “mutations before present”, while the second split (Shapsug vs Circassians) took place 0.6 “mutations before present” (Table 5).
Applying the same methodology to another three branches of populations led to a tree of population splits (Supplementary Figure 3). This tree is based solely on genetic data and yet shows good agreement with the topology of both linguistic trees: the classical way of grouping North Caucasian languages (Figure 4, right) and the quantitative lexico-statistical tree (Supplementary Figure 2). Up to this point, we have avoided using any mutation rate. As a result, we compare the topology of the trees (sequence of splitting events) without reference to a time scale.
In phase 2, to introduce time estimates, we used both “genealogical” and “evolutionary” mutation rates. The ρ estimator using the genealogical rate provided a good fit between genetics and linguistics, as the genetic dates were similar to, or younger than, the linguistic dates. (Because clusters can expand at any time after the split, they are expected to be younger than the respective languages; however, they should not be older, as described in Supplementary Note 1. Estimates based on the “evolutionary” mutation rate were too old to be in agreement with the linguistic dates (Table 5).
BATWING computations of the ages of the same clusters based on genealogical rates showed similar results to those indicated by the linguistic analysis. The age for the four major haplogroups in individual populations obtained by using SD estimator (Supplementary Table 3) are close to the Neolithic epoch, and might be interpreted as signs of population expansion due to the shift to a farming economy.
The BATWING tree of population splits (Supplementary Figure 4) was based on all STR data from Supplementary Table 1 disregarding the haplotype clusters to which they belong. This tree therefore provides an independent test of the other genetic trees presented in this study (the tree on the left of Figure 4 is based on the haplogroup frequencies, while trees in Figure 6 and Supplementary Figure 3 are based on the ages of haplotype clusters). This tree again showed a striking resemblance to the linguistic tree: one observes an initial split into west and east Caucasus populations, and then the separation of Abkhaz from Circassian-Shapsug on the western branch and Nakh populations from Dagestan ones on the eastern branch; disagreements could be found only within the Dagestan group. Ossets (linguistic outliers) showed a slight similarity to Abkhaz, as they did on the haplogroup-based tree, as well (Figure 4 left). When the topology of this tree is similar to the linguistic one, the BATWING dates were on average 1.5 times younger. If the evolutionary mutation rate were applied (data not shown), the topology of the BATWING tree would remain the same but the dates would become on average 1.5 times older than corresponding linguistic dates.
Four haplogroups are predominant in the Caucasus (Table 2), and each of them has its own domain (recognizable geographically and also linguistically), where it represents the lion’s share of the regional gene pool. In all other domains, the given haplogroup is infrequent or absent. The robustness of this conclusion is enhanced by the fact that each domain is occupied by more than one population in our data set whose characteristic haplogroup is prevalent in each population of the domain (Table 1). This pronounced structuring of the Y-chromosomal pool of the Caucasus has not previously been reported. In the study of Nasidze et al. (2004), AMOVA computation revealed a lack of correlation with language, while the correlation with geography did not reach the significance threshold. The increased phylogenetic resolution and large sample sizes used in our study were necessary to reveal the links between haplogroups, regions and language groups (Figure 1). Methodologically, the analysis of correlations between geography, language and genetics (Table 3) parallels an earlier study performed with European populations (Rosser et al., 2000), where geography was shown to be the leading factor in Europe. In our analysis, the correlation was much stronger (maximum value 0.64) than in the European study (0.39).
Strikingly, language rather than geography tended to have a larger influence on the genetic structuring in the Caucasus (Tables 3 and and4).4). Language and geography are also closely linked with each other, probably because of the mountainous nature of the Caucasus region where languages are often restricted to a few valleys. In this context, the slightly higher dependence of genetic structure on language could be explained by marriage and individual migration practices, linking linguistically similar populations in preference. For example, Circassians, who are geographically situated between Adyghes and Ossets, might receive more gene flow from Adyghes, who speak a similar language, than from Ossets who differ in their language and culture.
Ossets, who speak an Indo-European language, find their place among populations of the North Caucasus language family. This genetic association is consistent with the physical anthropological evidence (Abramova, 1989; Melyukova, 1989) that Ossets are mainly descendants of indigenous Caucasus populations, who were assimilated by Alans and received from them the present language. Little is known about the language that these populations initially spoke. Note that, on the genetic trees, Ossets join the western (Abkhaz-Adyghe) branch of the North Caucasus family.
Although occupying a boundary position between Europe and the Near East, all four major Caucasus haplogroups show signs of a Near Eastern rather than European origin (Figure 2, Supplementary Figure 1). These four haplogroups reach their maximum (worldwide) frequencies in the Caucasus (Table 2, Figure 2). They are either shared with Near East populations (G2a3b1-P303 and J2a4b*-M67(xM92)) or have ancestral lineages present there (G2a1*-P16(xP18) and J1*-M267(xP58)). Typical European haplogroups are very rare (I2a-P37.2) or limited to specific populations (R1a1a-M198; R1b1b2-M269) in the Caucasus.
This pattern suggests unidirectional gene flow from the Near East towards the Caucasus, which could have occurred during the initial Paleolithic settlement or the subsequent Neolithic spread of farming. Archaeological data do not indicate a Near Eastern influence on the Neolithic cultures in the North Caucasus (Bader and Tseretely, 1989; Bzhania, 1996; Masson et al., 1994), while Neolithization in the Transcaucasus was part of a Neolithic expansion that perhaps paralleled those occurring in Europe (Balaresque et al. 2010) and North Africa (Arredi et al. 2004). However, the current genetic evidence does not allow us to distinguish between Paleolithic and Neolithic models in shaping the genetic landscape of the North Caucasus.
All of these genetic findings are based solely on Y-chromosomal data. This choice was prompted by the high inter-population variation in the data set (and therefore the best detection of the differences) compared with mtDNA and autosomal markers. However, one may wonder if the pattern of the entire gene pool is different from its Y chromosomal subset. In the context of this study, languages are typically learned from the maternal side (we say “mother tongue”; Beauchemin et al., 2010). Thus the observed similarity between the distributions of languages and genes might become even more evident if full-genome data, incorporating maternally-inherited information as well, become available; this possibility may be explored in future studies.
To estimate the ages of population splits, we employed both “evolutionary” and ”genealogical” mutation rates for calibration and used four different methods, namely the ρ estimator, BATWING dating of the clusters and the population splits, and ASD microsatellite variation.
The reliability of the ρ estimator has been explored by Cox (2008) using simulations under a number of demographic models. He found that the mean age is biased only slightly, but the confidence intervals might not contain the true value in 34% cases for the simplest model (constant effective population size, Ne = 1000). In some demographic conditions, the error rate increases, particularly when samples sizes are below 25, or when Ne is large, unstable (bottlenecks) or growing. The Ne estimates available for the Caucasus populations are small (Ne = 187 on average; Pocheshkhova, 2008), which might indicate that the error rate should be low. However, Caucasus populations did grow and bottlenecks could not be excluded. In fact, the most pronounced demographic feature of the Caucasus populations is their high degree of subdivision; fortunately, “error rates of molecular dating with the ρ statistic are unaffected by simple population subdivision” (Cox, 2008). Therefore, we might expect ~34% of our clusters (Table 5, Figure 6) to have actual ages falling outside the indicated confidence intervals. This factor, which randomly affects only one-third of the clusters, would not eliminate the overall agreement with linguistics seen from the Figure 6, although it highlights the fact that genetic dates for each particular branch should be taken with caution.
We found that “evolutionary” estimates of most clusters fall far outside the range of the respective linguistic dates, while “genealogical” estimates gave a good fit with the linguistic dates. At least two population events in the Caucasus are documented archaeologically, which allows additional comparison with these “historical” dates. In both cases, the historical (archaeological) date is similar to a genetic estimate based on the “genealogical” mutation rate (Supplementary Note 2). In this regard, a study of the link between Y-chromosomes and British surnames working with time intervals close to those analyzed here obtained a mutation rate of 1.5 × 10−3 (King and Jobling, 2009). This rate is similar to the “genealogical” rather than the “evolutionary” rate, and provided good agreement with the historical dates for the surname ages.
The “evolutionary” rate (Zhivotovsky et al., 2004) was calibrated using two contrasting populations (Maori and Roma). The fact that, in the Caucasus, the genealogical rate provides a better fit with history and linguistics might be partly explained by the dependence of the estimated intra-lineage variance (and therefore age) on the way that clusters are selected in the network. We selected larger clusters, containing 11 haplotypes on average. However, when the evolutionary calibration was performed (Zhivotovsky et al., 2004), large data sets were not available and clusters contained only 2-4 haplotypes. For example, the Zhivotovsky et al. (2004) study chose two founders in the Polynesian network (Figure 5B), while, in our study, we considered similar topologies as a single cluster (see cluster γ in Figure 5A for comparison). These subclusters (Figure 5B) might be justified in the case of the peopling of New Zealand because they could originate in the homeland before the migration to New Zealand. In our case, however, we considered clusters within the networks because we were interested in the whole history of the clusters that had arisen in situ within the Caucasus. To avoid arbitrarily identifying the clusters, we followed a set of formal rules, as described above.
It should be mentioned here that, for the BATWING tree (which does not require identifying the clusters), applying the genealogical rate underestimates the dates, while applying evolutionary rates overestimates the dates. These comparisons were made with 14 linguistic dates, but more sophisticated modeling and calibrations in other regions are needed to find the most appropriate way to incorporate mutation rate estimates into population-genetic applications. Our study shows that the results could be affected by the method of identifying the clusters and particularly by the chosen methods of dating (ρ, BATWING, SD).
Combining genetic and linguistic findings, we now propose a model of the evolution of the Caucasus populations. The final tree (Figure 6) was obtained by merging the genetic clusters with the background linguistic tree. We conclude that the Caucasus gene pool originated from a subset of the Near Eastern pool due to an Upper Paleolithic (or Neolithic) migration, followed by significant genetic drift, probably due to isolation in the extremely mountainous landscape. This process would result in the loss of some haplogroups and the increased frequency of others. The Caucasus meta-population underwent a series of population (and language) splits. Each population (linguistic group) ended up with one major haplogroup from the original Caucasus genetic package, while other haplogroups became rare or absent in it. The small isolated population of the Kubachi, in which haplogroup J1*-M267(xP58) became virtually fixed (99%, Table 2), exemplifies the influence of genetic drift there. During population differentiation, haplotype clusters within haplogroups emerged and expanded, often becoming population-specific. The older clusters became characteristic of groups of populations. Many younger clusters were specific to individual populations (typically speaking different languages).
We note that the method of inferring the topology of the genetic tree of the Caucasus populations does not require the inference of any mutation rates, and the result is strikingly concordant with the topology of the linguistic tree. Mutation rates are required only for adding a time scale to both trees. Based on the topologies of trees generated from both the genetic and linguistic data, the inference of the parallel evolution of genes and languages in Caucasus is supported, despite controversies about the mutation rates. This study of Caucasus Y-chromosomal variation demonstrates that genetic and linguistic diversification were two parallel processes or, perhaps more precisely, two sides of the same process of evolution of the Caucasus meta-population over hundreds of generations.
We thank the people from the Caucasus populations who provided their DNA for the present analysis, Andrey Pshenichnov and Roman Sychev for help in compiling Y-base, Vadim Urasin for collecting STR data from DNA-genealogical websites, Valery Zaporozhchenko for a pilot study of the STR variation of haplogroup G-M201 in the West Caucasus, Kazima B. Bulaeva for consultation about endogamy levels in the East Caucasus, Vladimir Kharkov for help with first version of the phylogenetic network, Matt Kaplan for replication of 20 samples as part of the quality control procedures, Janet Ziegle for technical assistance and two anonymous reviewers for helpful comments. The first author thanks Richard Villems for instruction in phylogenetic methods and helpful discussions.
FUNDING This work received primary support from the Genographic Project and additional support from the Russian Foundation of Basic Research (grants 10-07-00515, 10-06-00451, 10-04-01603). CTS was supported by The Wellcome Trust.
Genographic Consortium members: Syama Adhikarla (Madurai Kamaraj University, Madurai, Tamil Nadu, India), Christina J. Adler (University of Adelaide, South Australia, Australia), Danielle A. Badro (Lebanese American University, Chouran, Beirut, Lebanon), Jaume Bertranpetit (Universitat Pompeu Fabra, Barcelona, Spain), Andrew C. Clarke (University of Otago, Dunedin, New Zealand), David Comas (Universitat Pompeu Fabra, Barcelona, Spain), Alan Cooper (University of Adelaide, South Australia, Australia), Clio S. I. Der Sarkissian (University of Adelaide, South Australia, Australia), Matthew C. Dulik (University of Pennsylvania, Philadelphia, Pennsylvania, United States), Christoff J. Erasmus (National Health Laboratory Service, Johannesburg, South Africa), Jill B. Gaieski (University of Pennsylvania, Philadelphia, Pennsylvania, United States), ArunKumar GaneshPrasad (Madurai Kamaraj University, Madurai, Tamil Nadu, India), Angela Hobbs (National Health Laboratory Service, Johannesburg, South Africa), Asif Javed (IBM, Yorktown Heights, New York, United States), Li Jin (Fudan University, Shanghai, China), Matthew E. Kaplan (University of Arizona, Tucson, Arizona, United States), Shilin Li (Fudan University, Shanghai, China), Begoña Martínez-Cruz (Universitat Pompeu Fabra, Barcelona, Spain), Elizabeth A. Matisoo-Smith (University of Otago, Dunedin, New Zealand), Marta Melé (Universitat Pompeu Fabra, Barcelona, Spain), Nirav C. Merchant (University of Arizona, Tucson, Arizona, United States), R. John Mitchell (La Trobe University, Melbourne, Victoria, Australia), Amanda C. Owings (University of Pennsylvania, Philadelphia, Pennsylvania, United States), Laxmi Parida (IBM, Yorktown Heights, New York, United States), Ramasamy Pitchappan (Madurai Kamaraj University, Madurai, Tamil Nadu, India), Lluis Quintana-Murci (Institut Pasteur, Paris, France), Daniela R. Lacerda (Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil), Ajay K. Royyuru (IBM, Yorktown Heights, New York, United States), Fabrício R. Santos (Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil), Himla Soodyall (National Health Laboratory Service, Johannesburg, South Africa), Pandikumar Swamikrishnan (IBM, Somers, New York, United States), Kavitha Valampuri John (Madurai Kamaraj University, Madurai, Tamil Nadu, India), Arun Varatharajan Santhakumari (Madurai Kamaraj University, Madurai, Tamil Nadu, India), Pedro Paulo Vieira (Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil), Janet S. Ziegle (Applied Biosystems, Foster City, California, United States).