We analyzed genome-wide SNP data from Phase 1 of the International HapMap Project [18
]. These data consist of ~800,000 polymorphic SNPs in a total of 309 unrelated individuals. For the purpose of our analyses, we grouped the data into three distinct population samples of unrelated individuals, as follows (see Materials and Methods
): 89 Japanese and Han Chinese individuals from Tokyo and Beijing, respectively, henceforth denoted as East Asian (ASN), 60 individuals of northern and western European origin (CEU), and 60 Yoruba (YRI) from Ibadan, Nigeria. Except where stated, our analyses focused on the autosomes only. Our analysis was based on haplotypes estimated by the HapMap Consortium using the program Phase 2 [18
]. We estimated genome-wide, high-resolution LD-based recombination maps separately for all 3 samples, using our implementation of the Li and Stephens algorithm [24
The goal in our study is to identify loci where strong selection has driven new alleles up to intermediate frequency. Such alleles might be on their way to fixation, or might become balanced polymorphisms. The classic signal of strong directional selection is that because the favored allele increases in frequency very fast, it tends to sit on an unusually long haplotype of low diversity. Meanwhile, chromosomes that do not carry the selected allele have levels of diversity and LD that are more typical of the genome as a whole. This type of signal has been used in the past to argue for selection in a number of genes in humans and in Drosophila
Since our data consist of pre-ascertained SNPs, they do not directly contain information about the underlying levels of nucleotide diversity. Nonetheless, we can expect that favored alleles will generally sit within large shared haplotypes, and that these haplotypes will be in sharp contrast with the more variable haplotypes on the unselected background (A). In order to pursue this type of signal for genome-wide SNP data, we have developed a new test statistic that we denote iHS (integrated haplotype score). The iHS was chosen after performing extensive simulations to determine the most powerful statistic from a number of new and previously published test statistics (see below and Figure S1
Decay of EHH in Simulated Data for an Allele at Frequency 0.5
Our new test begins with the EHH (extended haplotype homozygosity) statistic proposed by Sabeti et al. [5
]. The EHH measures the decay of identity, as a function of distance, of haplotypes that carry a specified “core” allele at one end. For each allele, haplotype homozygosity starts at 1, and decays to 0 with increasing distance from the core site (B). As shown in the figure, when an allele rises rapidly in frequency due to strong selection, it tends to have high levels of haplotype homozygosity extending much further than expected under a neutral model. Hence, in plots of EHH versus distance, the area under the EHH curve will usually be much greater for a selected allele than for a neutral allele. In order to capture this effect, we compute the integral of the observed decay of EHH away from a specified core allele until EHH reaches 0.05. This integrated
EHH (iHH) (summed over both directions away from the core SNP) will be denoted iHHA
, depending on whether it is computed with respect to the ancestral or derived core allele. Finally, we obtain our test statistic iHS using
When the rate of EHH decay is similar on the ancestral and derived alleles, iHHA
1, and hence the unstandardized iHS is
0. Large negative values indicate unusually long haplotypes carrying the derived allele; large positive values indicate long haplotypes carrying the ancestral allele. Since in neutral models, low frequency alleles are generally younger and are associated with longer haplotypes than higher frequency alleles, we adjust the unstandardized iHS to obtain our final statistic which has mean 0 and variance 1 regardless of allele frequency at the core SNP:
The expectation and standard deviation of ln(iHHA/iHHD) are estimated from the empirical distribution at SNPs whose derived allele frequency p matches the frequency at the core SNP. The iHS is constructed to have an approximately standard normal distribution and hence the sizes of iHS signals from different SNPs are directly comparable regardless of the allele frequencies at those SNPs. Since iHS is standardized using the genome-wide empirical distributions, it provides a measure of how unusual the haplotypes around a given SNP are, relative to the genome as a whole, and it does not provide a formal significance test.
In our data analysis, iHS is computed for every SNP with minor allele frequency > 5%, treating each SNP in turn as a core SNP. The iHS at each SNP measures the strength of evidence for selection acting at or near that SNP. However in simulations we have found that instead of treating each SNP separately, it is more powerful to look for windows of consecutive SNPs that contain numerous
extreme iHS scores (Figure S2
). This is because selective sweeps tend to produce clusters of extreme iHS scores across the sweep region, while under a neutral model, extreme iHS scores are scattered more uniformly (unpublished data).
In principle, we might expect that large negative iHS scores, indicating that a derived allele has swept up in frequency, are of the most interest. However, in simulations, a sweep can also produce large positive iHS values at nearby SNPs if ancestral alleles hitchhike with the selected site. Furthermore, it is plausible that selection may sometimes switch to favor an ancestral allele that has been segregating in the population. For these reasons, we will treat both extreme positive, and extreme negative iHS scores as potentially interesting.
It is now well-known that recombination rates are extremely heterogeneous across the genome, even at fine scales [30
]. Such rate variation is a potential source of false positives when looking for regions with unusual haplotype structure as we are here. Our test is designed to control for rate variation in two ways. First we estimated high-resolution genetic maps based on LD patterns and used the estimated genetic distances when calculating iHS. By basing analysis on the fine scale genetic maps, the lengths of haplotypes that extend across large recombination coldspots are appropriately downweighted, and haplotypes that cross hotspots are upweighted. Second, since iHS is based on a ratio of haplotype homozygosities, the two alleles serve as internal controls for each other [5
]. Hence, inaccuracies in the estimated genetic map will tend to cancel out of the ratio, as will any other factors that cause the extent of haplotype homozygosity to be heterogeneous across the genome.
plots the power of iHS and of two standard tests of selection based on summaries of the frequency spectrum. These simulations, as well as the simulations shown below, are designed to match the properties of the data as closely as possible (Materials and Methods
). As shown, the iHS outperforms the frequency spectrum tests, as well as other EHH-based statistics, across a broad range of frequencies of the selective sweep (, Figure S2
). Furthermore, iHS is robust to regional variation in SNP ascertainment while tests of the frequency spectrum may not be. Nonetheless, iHS has limited ability in the HapMap data to detect low frequency sweeps and to detect sweeps that are very near fixation.
Power to Detect Sweeps-in-Progress at a p-Value of 0.01, Using Various Statistics
The iHS statistic is constructed to provide a tool for identifying SNPs, or genomic regions, that are unusual relative to the genome as a whole, and not to provide formal significance testing relative to a theoretical model. We will show that in all populations there is an excess of extreme iHS signals relative to simulated models. However, since there is considerable uncertainty in simulated models, we prefer not to assign formal p-values to the signals that we find.
Widespread Signals of Recent Selection
We calculated iHS for all SNPs with minor allele frequency >5%. shows a summary of the extreme values on Chromosome 2, with similar plots across all autosomes and the X chromosome provided in Figures S3
. The plotted points are those with |iHS| > 2.5, and correspond to the most extreme 1% of iHS values.
Plots of Chromosome 2 SNPs with Extreme iHS Values Indicate Discrete Clusters of Signals
There is clear clustering of extreme values into distinct regions where many SNPs show evidence for selection. One such region, at 135–138 Mb in Europeans, contains the lactase gene (LCT),
previously noted as a target of strong selection [6
]. In principle, clustering of unusual iHS scores might occur even under a completely neutral model. However, several lines of evidence indicate that selection is indeed producing widespread signals in the data.
First, simulations show that we observe more extreme values of the unstandardized iHS scores than expected under a range of neutral models (). For each population, we performed neutral simulations that matched the observed SNP density and allele frequency spectrum, which included extensive recombination rate variation, including hotspots, and utilized a range of demographic models consistent with previous studies of demographic parameters (see Materials and Methods
). The demographic models included a variety of bottleneck models for East Asians and Europeans, and models of constant size with recent growth for Yoruba [34
]. Since previous genetic studies indicate that the Yoruba are likely to have the least complex demographic history [34
], we focus mainly on simulation results for that population. We find that extreme values of the unstandardized iHS are more frequent in the real Yoruba data than in any
of the simulated models, as expected if the largest iHS values are frequently due to selection. There is also an excess of extreme values in the Europeans, but in the East Asians some demographic models show as much variance as the real data (unpublished data).
Central 99% Range of Unstandardized iHS for SNPs in the Yoruba Data and for SNPs in Matched Neutral Simulations
Second, in all three populations, there is greater clustering of extreme (standardized) iHS values in the real data than in neutral simulations with heterogeneous recombination rates. For example, in simulations matching aspects of the Yoruba data, only 0.1% of windows of 50 consecutive SNPs had more than 16 SNPs with |iHS| > 20. In the actual YRI data, we observed a 14-fold enrichment of such windows. Since this calculation is based on empirically standardized scores, the signal of extra clustering is distinct from the previous signal of overdispersion of the unstandardized scores. Simulations designed to match the East Asian and European datasets also indicate that there is excess clustering of extreme iHS values in the real data, though for these populations the relative enrichment is quantitatively smaller due to the extra variance seen in neutral bottleneck models (the data show 2.3-fold and 2.7-fold enrichment of the top 0.1% of windows, respectively). In summary, the visual sense of clustering of high |iHS| scores in does indeed exceed the level of clustering expected under neutrality, supporting a model in which distinct selective events produce large |iHS| scores across discrete regions.
Third, in our data, extreme iHS scores frequently occur in regions where the frequency spectrum also indicates the action of selection (). As shown in , a version of Fay and Wu's H
] for ascertained SNPs (Hasc)
provides a useful method for detecting sweeps where the selected site has reached high frequency. In simulations we find that high frequency selected sites tend to have both strongly negative iHS scores and strongly negative values of Fay and Wu's Hasc
. However in neutral simulations, iHS and Hasc
are essentially uncorrelated. In the Yoruba data, as many as one half of high frequency–derived SNPs with large iHS scores fall into the most extreme 1% of windows for Hasc
, genome-wide. This correlation argues strongly that many of our extreme iHS scores are in fact the result of positive selection. In our data, an excess of significant Hasc
scores is also seen for low-frequency alleles with positive iHS scores. This probably results from ancestral alleles that have hitchhiked to high frequency. Similar, but weaker, correlations are present in both the European and East Asian data (unpublished data).
Strong Correlation between iHS and Hasc for the Yoruba Data
Fourth, there is a highly significant enrichment for extreme iHS values in genic regions (p
in all populations; Table S1
). This is to be expected if selection occurs most often in (or near) genes, though one might not expect a dramatic difference in rates, since simulated selective events tend to produce signals over quite wide regions which would include both genic and non-genic SNPs. The proportion of SNPs with |iHS| > 2 is 1.23-fold higher in genic SNPs than non-genic SNPs in Yoruba, 1.16-fold higher in Europeans, and 1.13-fold higher in East Asians. A further enrichment in extreme iHS values is found in SNPs in overlapping genes compared with SNPs in non-overlapping genes. To check that these results are not an artifact of the higher HapMap SNP density in genes, we thinned genic SNPs at random so that the proportion of genic SNPs matched the proportion of the genome containing genes. After reanalysis, the difference between genic and nongenic iHS values remained about the same as before (unpublished data).
Last, various regions identified previously as likely targets of sweeps-in-progress are detected by our survey, including signals in Europeans in the lactase region [6
], in the 17q21 inversion [20
], and at CYP3A5 [7
]; in the ADH cluster on Chromosome 4 in East Asians [38
]; and in olfactory receptor clusters on Chromosomes 11p15 and 11q11 in Yoruba [39
]. However, we do not detect all previously identified selection candidates, for example failing to find signals at G6PD in Yoruba [3
], and in two genes involved in brain development that were recently reported to be under recent positive selection [8
Overview of Selection Regions
To identify the strongest signals of selection, we divided the genome into non-overlapping windows of 100 kb. In this analysis windows were defined by physical location to facilitate the comparison of signals across populations. For each population, we identified the windows in the highest 1% of the empirical distribution for proportion of SNPs with |iHS| > 2. The positions of these windows on Chromosome 2 are indicated by vertical bars below each panel of . Henceforth, we consider these windows to be candidates for containing selective sweeps. We find that 8 of 14 genomic regions listed as selection candidates by the HapMap Project are among the top 5% of our signals (Table 10 in [18
]). A summary of some of our strongest signals is shown in , and a complete list is provided online in Protocols S1
). As an illustration, depicts the haplotype patterns, decay of EHH with distance, and plot of iHS scores for three strong candidate regions identified by our genome-wide scan.
Summary of Some of the Strongest iHS Signals Genome-Wide
Signals of Selection for Three Candidate Selection Regions Discussed in the Text
Analysis of the haplotype structure in candidate selection regions indicates that the events detected are typically very recent. We calculated the lengths of the haplotypes around the SNP with the largest negative iHS value in each region (Table S2
). The average total distance between the first point to the left, and to the right of the core SNP at which EHH on the selected haplotypes drops below 0.25 is 0.52 cM in both East Asians and Europeans, and 0.32 cM in Yoruba. Hence, candidate sweep regions tend to be narrower in Yoruba than in the non-African populations, indicating that typical sweep events may be substantially younger in the non-African populations. (The size of the area affected by a strong sweep depends only weakly on the effective population size [40
] and so the larger effective population size in Yoruba is not an explanation of the smaller average sweep size.)
A fully rigorous estimation of the ages of the candidate sweeps is difficult with the current data. However, making the simplistic assumption of a star-shaped genealogy for the favored haplotypes and assuming a generation time of 25 y, suggests average ages of
6,600 years and
10,800 years in the non-African, and African populations, respectively (Materials and Methods
). Simulations using SelSim (Materials and Methods
) suggest that these haplotype spans are consistent with selection coefficients of 0.01–0.04, assuming a population size of 10,000. These estimated ages should not be taken to imply a burst of selection at a particular time; instead, these ages and selection coefficients might represent areas in the parameter space in which we have good power. The longest haplotypes around derived alleles at >50% frequency extend 1.39 cM in East Asians (near the Gaucher disease gene, GBA
), 1.25 cM in Europeans (near NKX2–2,
which is involved in insulin regulation), and 0.97 cM in Yoruba (in a gene desert on Chromosome 5p15). These long haplotypes indicate extremely strong selection on recent mutations, though it is difficult to be confident about the actual genetic target of the selective events. In summary, the selection events that we detect are generally very recent, substantially postdating the separation times of these populations, and falling mainly within the agricultural phase of human evolution.
shows the extent of overlap of sweep regions across the three populations. Most of the sweep regions are found in only one of the three populations, consistent with the estimates indicating that these events generally postdate population separation. Nonetheless, there is a clear excess of sweeps that are shared between pairs of populations, or among all three populations. In principle, sharing of signals between populations might also be due to haplotypes that are inherited from the ancestral populations. However, this is probably a small effect since such unusually long haplotypes would be unlikely to survive the effects of recombination for >1,000 generations, separately in each population. Instead, the data suggest that most of the selective events that we detect are local to a single population, but that a significant fraction of the selective events are experienced by more than one population.
Sharing of iHS Signals between Populations
This view is further supported by the relationship between iHS and allele frequency divergence measured using FST
). SNPs with high |iHS| in one population, but low |iHS| in another are likely to have high FST
, indicating that the SNP has changed frequency rapidly since population separation. Among the modest number of SNPs that have extreme positive iHS or extreme negative iHS in both populations, there is not an excess of high FST
, perhaps due to recent gene flow of alleles (and haplotypes) that are favored in multiple populations.
Types of Genes under Selection
Next, we modified our analysis to study what types of genes are most commonly involved in recent adaptation. For every gene we determined the number of SNPs with high |iHS| in a 50 SNP window centered on the gene (Materials and Methods
). The genes in the upper 10% of the empirical distribution for number of significant SNPs were then considered to be candidate targets of selection. Our procedure was designed to be robust to variation in gene size and SNP density across gene categories.
The PANTHER Gene Ontology database provides a classification of genes into 222 nested categories according to biological process [41
]. We tested whether any of these categories showed an enrichment of genes with signals of recent selection (Materials and Methods
). In this analysis, one might be concerned that there is low power to detect enrichment, since the expected counts for many categories are low, despite considering such a large fraction (10%) of the genes as candidates for selection.
Nonetheless, several categories show up as strongly significant in one or more populations (), including the related categories of chemosensory perception and olfaction; as well as gametogenesis, spermatogenesis, and fertilization. These types of processes have been identified as targets of selection in previous studies of human-chimpanzee divergence [10
]. Overall, there is a modest enrichment for genes that show signals both of very recent selection in our study, and selection on the human lineage as a whole [10
] (Table S3
p-Values for Enrichment of GO Categories among Genes Showing Evidence for Partial Sweeps
In addition to the areas of overlap, we find enrichment in new categories not previously identified as targets of selection, including categories related to metabolism of carbohydrates, lipids, and phosphates, as well as vitamin transport. For some categories, the p
-values are imprecise as there is clustering of related genes that are all significant (Table S4
). We now describe in greater detail some of the categories of gene functions that show enrichment of selective signals. Except when noted, OMIM or EntrezGene provide references for the gene information given below [42
]. The genes listed below are generally in the top 1%, and all are in the top 5% of signals in the relevant populations. However, some caution is required since the strongest signals often span both the target of selection as well as neighboring genes.
Recent reports have shown that genes involved in fertility and reproduction are subject to rapid adaptive evolution in primates due to sexual competition and perhaps defense against pathogens [10
]. We observe signals for selection targeting several aspects of fertility and reproduction, including the basic protein structure of sperm (RSBN1
in East Asians and Yoruba), sperm motility (SPAG4
in Europeans and East Asians; ODF2
in Europeans), sperm and egg viability (ACVR1
in Europeans, CPEB2 in Yoruba), regulation of the female immune response to sperm (TGM4
), egg fertilization (the CRISP
gene cluster near 6p21.3 in Europeans), and testis determination (NR0B1
Some of the strongest signals of recent selection appear in various types of genes related to morphology. For example, four genes involved in skin pigmentation show clear evidence of selection in Europeans (OCA2, MYO5A, DTNBP1, TYRP1)
. All four genes are associated with Mendelian disorders that cause lighter pigmentation or albinism, and all are in different genomic locations, indicating the action of separate selective events. One of these genes, OCA2,
is associated with the third longest haplotype on a high frequency SNP anywhere in the genome for Europeans. A fifth gene, SLC24A5,
has recently been shown by another group to impact skin pigmentation and to have a derived, selected allele near fixation in Europeans [45
]. Though iHS has reduced power for alleles near fixation, SNPs near this gene also show strong iHS signals in Europeans (Table S2
Various genes involved in skeletal development have also been targets of recent selection. Three related proteins involved in bone morphogenesis show signals of selection in Europeans (BMP3 and BMPR2) and in East Asians (BMP5). In addition, GDF5, a gene in which mutations cause skeletal malformations, shows strong signals of selection in both Europeans and East Asians. Other morphological features also appear to be targets of selection, including hair formation and patterning in Yoruba (the keratin cluster near 17q12; and FZD6).
An important type of selective pressure that has confronted modern humans is the transition to novel food sources with the advent of agriculture and the colonization of new habitats [19
]. As noted above, we see a strong signal of selection in the alcohol dehydrogenase (ADH)
cluster in East Asians, including the third longest haplotype around a high frequency allele in East Asians. A variety of genes involved in carbohydrate metabolism have evidence for recent selection, including genes involved in metabolizing mannose (MAN2A1
in Yoruba and East Asians), sucrose (SI
in East Asians), and lactose (LCT
in Europeans). Processing of dietary fatty acids is another system with signals of strong selection, including uptake (SLC27A4
in Europeans), oxidation (SLC25A20
in East Asians) and regulation (NCOA1 in Yoruba and LEPR
in East Asians). The latter gene (LEPR)
is the leptin receptor and plays an important role in regulating adipose tissue mass.
Recent articles have proposed that genes involved in brain development and function may have been important targets of selection in recent human evolution [8
]. While we do not find evidence for selection in the two genes reported in those studies (MCPH1
we do find signals in two other microcephaly genes, namely, CDK5RAP2
in Yoruba, and CENPJ
in Europeans and East Asians [46
]. Though there is not an overall enrichment for neurological genes in our gene ontology analysis, several other important brain genes also have signals of selection, including the primary inhibitory neurotransmitter GABRA4,
an Alzheimer's susceptibility gene PSEN1,
in Yoruba; the serotonin transporter SLC6A4
in Europeans and East Asians; and the dystrophin binding gene SNTG1
in all populations.
Several other biological processes that have not previously been proposed as targets of selection also show an enrichment for signals of selection. For example, the category of electron transport genes is significant in Europeans, due in large part to selection in CYP genes. CYP genes are mainly expressed in the liver and catalyze many reactions involved in breaking down foreign compounds, including the majority of pharmaceutical agents. Genes in this class with evidence for selection include four genes in the CYP450 gene cluster on Chromosome 1p33, as well as CYP genes in other genomic locations including CYP3A5, CYP2E1, and CYP1A2. Another category showing enrichment of selection signals is phosphate metabolism in East Asians and Europeans. Genes in the phosphatidylinositol pathway seem to be particularly overrepresented among the significant genes in this category, including INPP5E, PI4K2B, IHPK1, IHPK2, IHPK3 in East Asians and IMPA2 and SYNJ1 in Europeans.
Curiously, Yoruba appears to have a greater number of signals on the X chromosome that map to genes compared with the other two populations. For example, the top 1% of 100-kb windows contain 15 genes in YRI (the largest of these windows containing only four genes), but six genes and two genes in Europeans and East Asians, respectively.