Given the X chromosome's disproportionate sensitivity to female demography, it is possible that X-linked genomic variation has a different underlying population structure than autosomal variation. To investigate this, we analyzed the X chromosome data with frappe
], a maximum likelihood based method that establishes K ancestry groups based on allele frequency patterns and then assigns each individual K percentages that correspond to his or her proportional membership in each group. As about two-thirds of the individuals in our sample are haploid for the X chromosome while one-third are diploid, we ran frappe
on individual X chromosomes rather than on individuals. The results of this analysis with K = 7 are shown in Figure . The X chromosomes are partitioned into seven clusters that correspond to the seven major continental cohorts - Africa, the Middle East, Europe, Central Asia, East Asia, Oceania, and America - represented in the CEPH-HGDP sample set. (In contrast, the data contained in 20 X-linked microsatellites was only able to resolve the CEPH-HGDP samples into 5 distinct groups [2
]). These are the same 7 groups that were observed when frappe
was run on 640,698 autosomal markers [3
]. The major difference between these previous autosomal results and Figure is the failure of the Eurasian X chromosomes to cleanly separate into Middle Eastern, European, and Central Asian groups. While most Middle Eastern, European, and Central Asian X chromosomes have their largest contributions from their respective continents of origin, most also have sizable contributions from the other two Eurasian continents. This suggests a lack of clear genetic distinction between chromosomes originating from these three continents. Nonetheless, the X chromosome still carries sufficient genetic information to reveal certain details of population structure that were previously noted for the large autosomal dataset. For instance, in both figures the Adygei have significant European and Central Asian contributions, the Hazara and Uygur have primarily East Asian ancestry, and a handful of Sindhi, Makrani, Brahui, and Balochi individuals have sizable African contributions. Next, to assess whether the differences we observed between our X chromosome frappe
results and the Li et al.
] autosomal results were due to the number of markers used in each analysis, we ran frappe
on just the 19,632 markers found on chromosome 16. As with the X chromosome, the analysis was conducted using haploid chromosomes as opposed to diploid individuals. The results of the frappe
run for K = 7 are shown in Figure . Overall, the results for chromosome 16 appear very similar to those for the X chromosome. There are some minor differences between the two figures, particularly in the way some admixed populations are partitioned among the seven groups (note, for instance, the larger European component in the Yakuts and the larger Middle Eastern component in the Adygei for the autosomes). However, these differences may be artifacts of the failure of both datasets (the X chromosome and chromosome 16) to cleanly separate into the three Eurasian continental groups rather than robust differences in the population structure of autosomal and X-linked SNP genotypic variation. For completeness, we also ran frappe
on diploid individuals. We did this first for the X chromosome by running frappe
on all CEPH-HGDP females plus additional 'pseudofemales' created by randomly pairing two male X chromosomes from the same population. We then ran frappe
on diploid individuals for chromosome 16 using the same number of females and 'pseudofemales'. This time 'pseudofemales' were created by randomly selecting one chromosome 16 from each male and then pairing these chromosomes within populations. The results of these analyses are shown in Additional file 1
and are quite similar to the results from running frappe
on individual chromosomes. Also, to ensure that our choice of chromosome 16 to represent the autosomes did not bias our results, we ran frappe
on individual chromosomes for chromosome 17. The results are largely the same (Additional file 1
), except that we observe less resolution between Middle Eastern and European chromosomes for chromosome 17. We conclude from this analysis that there are no major differences in the population structure suggested for the CEPH-HGDP populations by approximately 16,000 X-linked SNPs and a similar number of autosomal SNPs.
Figure 1 Structure of the CEPH-HGDP Populations as estimated using frappe. Figures drawn using Distruct . (a) Population structure estimated using 16,297 X chromosome SNP genotypes with K = 7. (b) Population structure estimated using 19,632 chromosome 16 SNP (more ...)
Pairwise allele frequency differences
The AMOVA scores calculated above provide an estimate of how differentiated populations within a particular continental or supracontinental group are from one another. We would expect, though, that the effects of drift and selection would be most pronounced between two genetically distant populations, given the time that these forces have had to affect allele frequencies in each population independently. Because of this, we selected three pairs of distantly related populations - Yoruba-Han, Yoruba-French, and French-Han, and calculated for each autosomal and X-linked marker the pairwise allele frequency difference (termed 'delta' or 'δ' by Shriver et al
]) for each pair. We found that the average delta value was higher for X-linked than for autosomal markers for all three pairs. We also noted that the distributions of X-linked delta values all have a longer 'tail' region than the autosomal distribution for the same population pair. To examine these tail regions more closely, we tallied the number of SNPs for which delta exceeded 0.9 (hereafter referred to as high-delta SNPs) for each population pair (Table ). On the X chromosome, there were no SNPs for which delta > 0.9 in the French-Han comparison, so for this population pair we tallied the number of X-linked SNPs for which delta > 0.8. High-delta SNPs on the autosomes and on the X chromosome often occur in clusters, with each cluster presumably representing a single event, be it drift or selection. To gain a rough estimate of the number of such events, we divided the autosomes into 13,395 200-kb regions; each region containing at least one high-delta SNP was deemed a high-delta region. While some high-delta regions did contain only one high-delta SNP, many contained multiple high-delta SNPs. We carried out the same process with the X chromosome, where there were a total of 744 200-kb regions.
Results of the delta analysis for three population comparisons
Overall, we observed that there were proportionally more high-delta SNPs on the X chromosome than on the autosomes for population pairs with one African and one non-African population (25 out of 16,297 compared to 62 out of 640,698 and 159 out of 16,297 compared to 265 out of 640,698 for the Yoruba-French and Yoruba-Han comparisons, respectively; Table ). For the French-Han comparison, this excess of high-delta SNPs on the X chromosome was not observed. This apparent disparity between the three population pairs could be explained by a female-specific bottleneck during the out of Africa migrations as recently suggested by Keinan et al
]. When there are equal numbers of males and females, the X chromosome is more heavily influenced by drift than the autosomes due to its smaller population size; this effect is exaggerated when there are fewer females than males. But is drift alone sufficient to explain the excess X-linked high-delta SNPs found for the Yoruba-Han and Yoruba-French pairs? To address this question, we utilized an equation developed by Segurel et al.
] that expresses the expected relationship between X-linked and autosomal Fst values in terms of Nf
/N, the female proportion of the effective population size, and mf
/m, the female proportion of the total migration rate. This equation was derived from known relationships between Fst values and male and female migration rates and effective population sizes under the infinite island model with populations of equal and constant size. We used the equation to obtain expected delta values for the X-linked SNPs from the observed autosomal delta values. If autosomal and X-linked markers differed collectively only by the relative effects of drift, transformed autosomal delta values (expected X-linked values) should not differ statistically from observed X-linked values. We applied this transformation to our three lists of autosomal delta values varying Nf
/N and mf
/m from 0.01 to 0.99. As the female portion of the effective population size and migration rate in humans has likely varied widely across time and geographical distance, we wanted to test across all possible values of Nf
/N and mf
/m, including 'Nf
/m' pairs where Nf
/N < 0.5, as such pairs represent female specific bottlenecks (that is, more than half of the population is male).
Having transformed each of our three lists of autosomal delta values for all possible pairs of Nf
/N and mf
/m such that 0.01 ≤ Nf
/m ≤ 0.99, we tabulated the number of high-delta SNPs in each of the resulting lists of transformed autosomal/expected X chromosome (hereafter referred to as TA/EX) values (for the French-Han pair, we tabulated the number of SNPs with delta exceeding 0.8). The results are shown in Figure for the Yoruba-Han population pair and in Additional file 3
for the Yoruba-French and French-Han pairs. (In the transformation, the values of Nf
/N and mf
/m are combined into a single term, given by (1 + mf
/m)/(2 - Nf
/N). Because Nf
/N and mf
/m are combined this way, there are multiple Nf
/m value pairs that produce the same TA/EX delta values. This feature of the Segurel et al
] transformation creates the diagonal bands of color in Figure and Additional file 3
). We see that for the TA/EX delta values to contain the same number of high-delta SNPs (or SNPs where delta exceeds 0.8 for the French-Han pair) as were observed on the X chromosome, extreme values must generally be used for both Nf
/N and mf
/m (for the Yoruba-French pair, Nf
/N must be less than 0.08 and mf
/m must be less than 0.05, and for the Yoruba-Han pair, there are, in fact, no such values). Having transformed our three lists of autosomal delta values for all pairs of Nf
/N and mf
/m such that 0.01 ≤ Nf
/m ≤ 0.99 and re-tabulated the number of high-delta SNPs in each (alternatively, the number of SNPs with delta exceeding 0.8 for the French-Han pair), we also assigned these SNPs to one of the 13,395 autosomal regions. The resulting tallies of high-delta regions represented by each list of TA/EX delta values are shown in Figure for the Yoruba-Han pair and in Additional file 3
for the Yoruba-French and French-Han pair. Again we see that for the TA/EX delta values to contain the same number of high-delta regions (or regions containing a SNP where delta exceeds 0.8 for the French-Han pair) as were observed on the X chromosome, low values must generally be used for both Nf
/N and mf
/m (for the Yoruba-French pair, Nf
/N must be less than 0.52 and mf
/m must be less than 0.36, while for the Yoruba-Han pair, Nf
/N must be less than 0.29 and mf
/m must be less than 0.18).
Figure 2 Number of high-delta SNPs and regions represented by TA/EX values for the Yoruba-Han comparison. (a) The female proportion of the effective population size and the female proportion of migration were both varied over a range from 0.01 to 0.99. For each (more ...)
It is possible, of course, that we observe a large number of X-linked high-delta SNPs because the populations under study here were characterized by low values for Nf
/N and mf
/m (due to, for instance, population bottlenecks; Additional file 4
). To assess which values of Nf
/N and mf
/m are most consistent with the distributions of autosomal and X-linked delta values that we observe, we again varied Nf
/N and mf
/m from 0.01 to 0.99. We then compared each resulting list of TA/EX delta values to the observed X-linked values using a two-sided Wilcoxon test. The results of this analysis are shown in Figure for the Yoruba-Han pair and Additional file 5
for the Yoruba-French and French-Han pairs. By comparing the results shown in Figures and , one can see that the overall distributions of TA/EX and observed X-linked delta values are most similar for sets of TA/EX delta values with proportionally fewer high-delta SNPs than were observed on the X chromosome. This indicates that while there are Nf
/m value pairs that produce TA/EX delta values with proportionally similar numbers of high-delta SNPs compared to what was observed for the X chromosome, these Nf
/m pairs are not consistent with the distributions of autosomal and X-linked delta values that we observe. Overall, our results here suggest that even after accounting for the differential effects of drift on the X chromosome and the autosomes, there have been proportionally more events affecting the X chromosome that cause significant allele frequency changes resulting in high-delta SNPs. The above analyses were also carried out using pairwise Fst values in place of delta with similar results (Additional files 6
); an excess of high Fst SNPs and regions was observed on the X chromosome for the Yoruba-Han and Yoruba-French pairs and an excess of SNPs with Fst > 0.8 was observed on the X chromosome for the French-Han pair.
Figure 3 Comparison of TA/EX and observed X-linked delta values. The female proportion of the effective population size and the female proportion of migration were both varied over a range from 0.01 to 0.99. For each of the 9,800 possible pairs of these values, (more ...)
Previous studies (Coop et al
]; Barreiro et al
]) have also noted that a disproportionate number of high-delta and high-Fst SNPs lie within coding regions. We did not necessarily expect to make the same observation for the X chromosome, since the hitchhiking of non-coding variants on selected genic alleles is likely to be more common on the X chromosome. Indeed, while 32% (5,213 out of 16,297) of our X-linked markers are in genes, we found that only 26.4% of all Yoruba-Han high-delta SNPs were located within genes on the X chromosome. However, after removing a large cluster of high-delta SNPs (one that contained 68 high-delta SNPs, including 65 non-coding ones) from consideration, this percentage jumped to 44.4%. SNPs with large allele frequency differences in the other two population comparisons were also commonly found in genes. Of the SNPs with delta > 0.8 in the French-Han comparison, 52% were genic, as were 76% of the high-delta SNPs from the Yoruba-French comparison (Table ). In general, we observed that bins of X-linked high-delta SNPs were enriched for genic SNPs, while bins of X-linked SNPs with delta values closer to 0 were not (Additional file 8
). This observation could be explained by an excess of genic SNPs with a minor allele frequency ≤ 0.1. However, we detected no such excess but noted that high-delta SNPs simply occur more frequently among genic SNPs where the minor allele frequency ≤ 0.1 than among non-genic SNPs meeting the same criterion. These findings suggest that at least some of the high-delta regions we have identified on the X chromosome have undergone selective sweeps, as selection is more likely to have targeted coding variants than non-coding variants; drift acting alone would be expected to influence coding and non-coding variation equally.
Characteristics of X-chromosomal high-delta SNPs
For each of the X chromosomal high-delta SNPs, we determined which allele was derived and which ancestral using information from two chimpanzees that were genotyped along with the HGDP samples in Li et al.
] and information from the NCBI website [18
]. We were able to determine the ancestral state for the majority of the autosomal and X-linked high-delta SNPs. For the Yoruba-French comparison, 3 out of 25 (12%) high-delta SNPs had a high derived frequency in the Yorubans, and for the Yoruba-Han comparison, 26 out of 159 (16.4%) high-delta SNPs had a high derived frequency in the Yorubans. For the autosomes, we found that only 5 out of 58 (8.6%) high-delta SNPs had a high derived allele frequency in Africa in the Yoruba-French comparison; that figure was 18 out of 247 (7.3%) in the Yoruba-Han comparison (Table ). The percentage of X-linked high-delta SNPs with high derived allele frequency in Africa significantly exceeds (chi square test, P
< 0.001) that for the autosomes in the Yoruba-Han comparison; this could be explained by a higher incidence of hitchhiking on the X chromosome compared to the autosomes. An alternative, and intriguing, possibility is that the X chromosome has been affected by a disproportionate number of selective sweeps or drift events (for example, bottlenecks) involving derived alleles in Africa. Looking back to our identification of genic and non-genic high-delta SNPs, we found some evidence that selection may indeed be a player in this observation. Recall that for the Yoruba-Han comparison (when we excluded the one exceptional high-delta region, 65.5 to 67 Mb), 44.4% of all high-delta SNPs were in genic regions. If we take only those high-delta SNPs that have high derived allele frequency in Africa, this increases to 50%. Similarly, all three high-delta SNPs from the Yoruba-French comparison with high derived frequency in the Yorubans are found in genes.
Tests of selection (iHS, CLR, XP-EHH)
To investigate the relative importance of drift and selection in creating large interpopulation allele frequency differences on the X chromosome, we wanted to ascertain whether X-linked high-delta SNPs tend to occur in regions where the haplotype structure is consistent with the past influence of selection. We subjected our dataset to three tests - integrated haplotype score (iHS), combined likelihood ratio (CLR), and cross population extended haplotype homozogysity (XP-EHH) - that were designed to produce high scores in chromosomal regions that have been involved in selective sweeps. Although we will refer to iHS, CLR, and XP-EHH as 'tests of selection', it should be remembered that these tests identify regions where selection may have influenced allele frequencies or haplotype patterns; demographic forces are always a possible explanation for one high iHS, CLR, or XP-EHH score or an entire set of elevated scores, including the scores we report below. CLR and XP-EHH are most sensitive to nearly completed sweeps [19
], while iHS is useful for detecting on-going, partial sweeps [21
]. iHS, CLR, and XP-EHH were run on each of the eight continental groups - African agriculturists, African hunter-gatherers, Middle Eastern, European, Central Asian, East Asian, Oceanian, and American - individually (CLR and XP-EHH were also calculated for selected individual populations; Additional file 9
). Then, following recommendations from previous work [10
], we divided the X chromosome into 372 400-kb regions and, for each continental group, calculated one iHS, CLR, and XP-EHH score for each region using the raw scores from that region (see Materials and methods for details). In order to briefly characterize the results of these calculations, we selected the top ten regions with respect to test value for each test in each continental group and displayed them in Figure . As can be seen, the distribution of top regions across the X chromosome and the relationship between the lists of top regions across continents is rather different for iHS than for CLR and XP-EHH. Top iHS regions are rarely consecutive for any given continent and the same region is typically not highlighted for more than one continent. The observation that high iHS signals are often not shared across geographical regions has been commented on previously [10
]. As top iHS signals do not tend to cluster in adjacent chromosomal regions, and as iHS results do not generally overlap with CLR and XP-EHH scores (since iHS alone detects sweeps in progress), we suggest that it is difficult to use iHS by itself to detect targets of past selection; here we use iHS results only as additional, complementary evidence to argue for past selection at a given site on the X chromosome. Unlike iHS, sharing of top signals between certain continents is noticeable with CLR and XP-EHH. Despite a deep phylogenetic split between the two groups (see Figure 1B in [3
]), the top CLR and XP-EHH signals for African agriculturists and hunter-gatherers cluster in the same two regions, 62.2 to 63 Mb and 91.4 to 92.2 Mb, respectively. Neither of these regions produces top CLR or XP-EHH signals for any of the other continental groups. The sharing of these top signals despite long-standing genetic separation could suggest a genomic response to a selective pressure produced by a common African environment. Eurasian groups also tend to produce top CLR and XP-EHH scores in the same X-chromosomal regions. This is not surprising given their close genetic relationship, and was also observed for autosomal CLR and XP-EHH scores [10
]. However, for several consecutive chromosomal regions, 109.4 to 111.4 Mb, top CLR and XP-EHH signals appear not only in Eurasia, but also in East Asia, Oceania, and America. Specifically, this segment of the X chromosome produces the top two CLR scores for Europe, Central Asia, and East Asia, the top two XP-EHH scores for Central Asia, East Asia, Oceania, and America, and top iHSs in the Middle East and Central Asia; it will be discussed in more depth in a subsequent section.
Figure 4 Top ten regions for iHS, CLR, and XP-EHH scores and all high-delta regions. The numbers on the left side of each figure represent the beginning position of a region in megabases. Each chromosomal region is 400-kb in length. The letters across the top (more ...)
Figure displays all of the 61 400-kb X chromosome regions that contain either a high-delta SNP in the Yoruba-French or Yoruba-Han comparison, or a SNP with a delta value > 0.8 in the French-Han comparison. We found that 31 of these regions also produced a top iHS, CLR, or XP-EHH score for at least one continent. As iHS and XP-EHH are based on haplotype frequencies, scores for these two tests and delta values are not expected to be totally independent of one another (although the overall correlation between delta values and test scores seems to be fairly low; for example, the Pearson correlation between Yoruba-Han delta values and raw XP-EHH scores in East Asia is only 0.1764). However, the presence of a high Fst SNP in a genomic region producing a high XP-EHH score has previously been taken as evidence that the region is a true target of selection rather than a false positive [20
]. It is also interesting to note that these 31 regions were not a random sample of the 61 high-delta regions. There were 4 clusters along the X chromosome - 18.8 to 20.8 Mb, 65 to 67.4 Mb, 72.2 to 74.2 Mb, and 108.6 to 110.6 Mb - of 4 or more consecutive high-delta regions, and of the 23 individual 400-kb regions in these clusters, 20 produced top iHS, XP-EHH, or CLR scores. Conversely, of the 19 high-delta regions that occurred in isolation (that is, they were not bordered on either side by another high-delta region), only 3 produced a top iHS, CLR, or XP-EHH score. It seems then that X-linked high-delta SNPs, particularly those that occur in clusters along the chromosome, tend to be found in chromosomal regions where iHS, CLR, and XP-EHH suggest that the haplotype structure is consistent with selection at that site.
Finally, we wanted to evaluate whether our results support the conclusions of any of the previous studies of selection on the X chromosome. As most of these studies have been conducted with a few populations at the most, we were particularly interested in whether there was evidence for selection at previously implicated chromosomal sites but in populations not previously studied. Several X-linked genes have been suggested as selection targets in earlier studies (Table ). Many of them belong to a class known as cancer/testis, or CT genes. While the molecular functions of many CT gene products are not well understood, most are believed to play a role in spermatogenesis [22
]. The remaining genes (listed in Table as 'other genes') were investigated in single gene studies and are associated with a particular Mendelian trait of interest. The numbered regions listed in Table were the top X-linked regions based on XP-EHH score identified as part of a full genome survey [20
]. These XP-EHH scores were calculated using 3 million SNPs typed in the HapMap samples. To determine whether our work supported the hypothesis of selection acting on these regions and the genes discussed above, we tabulated and averaged all the CLR scores, XP-EHH scores, and delta values occurring within a given gene or region. We then compared this average score to averages of all other sets of consecutive scores of the same size (for instance, if there are x
CLR scores in region A, we calculated the average CLR score for all other x
-sized sets of consecutive CLR scores). If the average score for our region or gene of interest was higher than the averages of 95% of the other such regions, we considered this evidence of selection on this chromosomal region. The results of this analysis are outlined in Figure . We saw no evidence of selection (by our criteria) for most of the non-CT genes from our literature survey. The only two exceptions for this were DMD, which produced high XP-EHH scores in both African groups, and G6PD, which produced high CLR scores in Oceania. Several CT genes did contain high CLR and/or XP-EHH scores, most notably MAGEA10, which contained high XP-EHH scores in all three Eurasian continents and Oceania, although these were not accompanied by significantly elevated CLR scores. We also found some evidence for selection in seven of the regions outlined by Sabeti et al
]. Importantly, for six out of seven of these regions, evidence for selection was found in continental groups not represented in the HapMap samples, allowing us to more fully define the geographic extent of these putative selective events.
Chromosome position of X-linked genes and regions found to be under selection in previous studies
Figure 5 Previously implicated X-linked selection targets with elevated CLR, XP-EHH, or delta scores. Each row represents a particular delta comparison or a particular test as labeled. Each column represents a X-chromosomal region or locus that previous research (more ...)