An increasing amount of information about genetic variation, together with new analytical methods, is making it possible to explore the recent evolutionary history of the human population. The first phase of the International Haplotype Map, including ~1 million single nucleotide polymorphisms (SNPs)7
, allowed preliminary examination of natural selection in humans. Now, with the publication of the Phase 2 map (HapMap2)1
in a companion paper, over 3 million SNPs have been genotyped in 420 chromosomes from three continents (120 European (CEU), 120 African (YRI) and 180 Asian from Japan and China (JPT + CHB)).
In our analysis of HapMap2, we first implemented two widely used tests that detect recent positive selection by finding common alleles carried on unusually long haplotypes2
. The two, the Long-Range Haplotype (LRH)8
and the integrated Haplotype Score (iHS)9
tests, rely on the principle that, under positive selection, an allele may rise to high frequency rapidly enough that long-range association with nearby polymorphisms—the long-range haplotype8
—will not have time to be eliminated by recombination. These tests control for local variation in recombination rates by comparing long haplotypes to other alleles at the same locus. As a result, they lose power as selected alleles approach fixation (100% frequency), because there are then few alternative alleles in the population (Supplementary Fig. 2 and Supplementary Tables 1–2
We next developed, evaluated and applied a new test, Cross Population Extended Haplotype Homozogysity (XP-EHH), to detect selective sweeps in which the selected allele has approached or achieved fixation in one population but remains polymorphic in the human population as a whole (Methods, and Supplementary Fig. 2 and Supplementary Tables 3–6
). Related methods have recently also been described10-12
Our analysis of recent positive selection, using the three methods, reveals more than 300 candidate regions1
(Supplementary Fig. 3 and Supplementary Table 7
), 22 of which are above a threshold such that no similar events were found in 10 Gb of simulated neutrally evolving sequence (Methods). We focused on these 22 strongest signals (), which include two well-established cases, SLC24A5
, and 20 other regions with signals of similar strength.
The twenty-two strongest candidates for natural selection
The challenge is to sift through genetic variation in the candidate regions to identify the variants that were the targets of selection. Our candidate regions are large (mean length, 815 kb; maximum length, 3.5 Mb) and often contain multiple genes (median, 4; maximum, 15). A typical region harbours ~400–4,000 common SNPs (minor allele frequency >5%), of which roughly three-quarters are represented in current SNP databases and half were genotyped as part of HapMap2 (Supplementary Table 8
We developed three criteria to help highlight potential targets of selection (Supplementary Fig. 1
): (1) selected alleles detectable by our tests are likely to be derived (newly arisen), because long-haplotype tests have little power to detect selection on standing (pre-existing) variation14
; we therefore focused on derived alleles, as identified by comparison to primate outgroups; (2) selected alleles are likely to be highly differentiated between populations, because recent selection is probably a local environmental adaptation2
; we thus looked for alleles common in only the population(s) under selection; (3) selected alleles must have biological effects. On the basis of current knowledge, we therefore focused on non-synonymous coding SNPs and SNPs in evolutionarily conserved sequences. These criteria are intended as heuristics, not absolute requirements. Some targets of selection may not satisfy them, and some will not be in current SNP databases. Nonetheless, with ~50% of common SNPs in these populations genotyped in HapMap2, a search for causal variants is timely.
We applied the criteria to the regions containing SLC24A5
, each of which already has a strong candidate gene, mutation and trait. At SLC24A5
, the 600 kb region contains 914 genotyped SNPs. Applying filters progressively ( and ), we found that 867 SNPs are associated with the long-haplotype signal, of which 233 are high-frequency derived alleles, of which 12 are highly differentiated between populations, and of which only 5 are common in Europe and rare in Asia and Africa. Among these five SNPs, there is only one implicated as functional by current knowledge; it has the strongest signal of positive selection and encodes the A111T polymorphism associated with pigment differences in humans and thought to be the target of positive selection5
. Our criteria thus uniquely identify the expected allele.
Localizing SLC24A5 and EDAR signals of selection
At the LCT
locus, we found similar degrees of filtration. Within the 2.4 Mb selective sweep, 24 polymorphisms fulfil the first two criteria (, and Supplementary Fig. 4
), with the polymorphism thought to confer adult persistence of lactase among them. However, this SNP was only identified as functional after extensive study of the LCT
. Thus LCT
shows both the utility and the limits of the heuristics.
Given the encouraging results for SLC24A5 and LCT, we performed a similar analysis on all 22 candidate regions (). Filtering the 9,166 SNPs associated with the long-haplotype signal, we found that 480 satisfied the first two criteria. We identified 41 out of the 480 SNPs (0.2% of all SNPs genotyped in the regions) as possibly functional on the basis of a newly compiled database of polymorphisms in known coding elements, evolutionarily conserved elements and regulatory elements (Methods; B.F., unpublished), together containing ~ 5.5% of all known SNPs.
Eight of the forty-one SNPs encode non-synonymous changes ( and Supplementary Table 9
). Apart from the well-known case of SLC24A5
, they are found in EDAR
. The remaining 33 potentially functional SNPs lie within conserved transcription factor motifs, introns, UTRs and other non-coding regions.
To identify additional candidates, we reversed the process by taking non-synonymous coding SNPs with highly differentiated high-frequency derived alleles; these SNPs comprise a tiny fraction of all SNPs and have a higher a priori probability of being targets of selection. Of the 15,816 non-synonymous SNPs in HapMap2, 281 (Supplementary Table 10
) have both a high derived-allele frequency (frequency >50%) and clear differentiation between populations (FST
is in the top 0.5 percentile). We examined these 281 SNPs to identify those embedded within long-range haplotypes16
, and identified 26 putative cases of positive selection. These include the eight non-synonymous SNPs identified in the genome-wide analysis above.
Interestingly, analysis of the top regions and the non-synonymous SNPs together revealed three cases of two genes in the same pathway both having strong evidence of selection in a single population.
In the European sample, there is strong evidence for two genes already shown to be associated with skin pigment differences among humans. The first is SLC24A5
, described above. We further examined the global distribution () and the predicted effect on protein activity of the SLC24A5
A111T polymorphism (Supplementary Fig. 5, 6
). The second, SLC45A2
, has an important role in pigmentation in zebrafish, mouse and horse4
. An L374F substitution in SLC45A2
is at 100% frequency in the European sample, but absent in the Asian and African samples. A recent association study has shown that the Phe-encoding allele is correlated with fair skin and non-black hair in Europeans4
. Together, the data support SLC45A2
as a target of positive selection in Europe10,17
Global distribution of SLC24A5 A111T and EDAR V370A
In the African sample (Yoruba in Ibadan, Nigeria), there is evidence of selection for two genes with well-documented biological links to the Lassa fever virus. The strongest signal in the genome, on the basis of the LRH test, resides within a 400 kb region that lies entirely within the gene LARGE
. The LARGE protein is a glycosylase that post-translationally modifies α-dystroglycan, the cellular receptor for Lassa fever virus (as well as other arenaviruses), and the modification has been shown to be critical for virus binding3
. The virus name is derived from Lassa, Nigeria, where the disease is endemic, with 21% of the population showing signs of exposure18
. We also noted that the DMD
locus is on our larger candidate list of regions, with the signal of selection again in the Yoruba sample. DMD
encodes a cytosolic adaptor protein that binds to α-dystroglycan and is critical for its function. We hypothesize that Lassa fever created selective pressure at LARGE
. This hypothesis can be tested by correlating the geographical distribution of the selected haplotype with endemicity of the Lassa virus, studying infection of genotyped cells in vitro
, and searching for an association between the selected haplotype and clinical outcomes in infected patients.
In the Asian samples, we found evidence of selection for non-synonymous polymorphisms in two genes in the ectodysplasin (EDA) pathway, which is involved in development of hair, teeth and exocrine glands6
. The genes are EDAR
, which encode the key receptors for the ligands EDA A1 and EDA A2, respectively. Notably, the EDA signalling pathway has been shown to be under positive selection for loss of scales in multiple distinct populations of freshwater stickleback fish19
. A mutation encoding a V370A substitution in EDAR
is near fixation in Asia and absent in Europe and Africa (). An R57K substitution in EDA2R
has derived-allele frequencies of 100% in Asia, 70% in Europe and 0% in Africa.
polymorphism is notable because it is highly differentiated between the Asian and other continental populations (the 3rd most differentiated among 15,816 non-synonymous SNPs), and also within Asian populations (in the top 1% of SNPs differentiated between the Japanese and Chinese HapMap samples). Genotyping of the EDAR
polymorphism in the CEPH (Centre d'Etudie du Polymorphisme Humain) global diversity panel20
shows that it is at high but varying frequency throughout Asia and the Americas (for example, 100% in Pima Indians and in parts of China, and 73% in Japan) (, and Supplementary Fig. 7
). Studying populations like the Japanese, in which the allele is still segregating, may provide clues to its biological significance.
has a central role in generation of the primary hair follicle pattern, and mutations in EDAR
cause hypohidrotic ectodermal dysplasia (HED) in humans and mice, characterized by defects in the development of hair, teeth and exocrine glands6
. The V370A polymorphism, proposed to be the target of selection, lies within EDAR
's highly conserved death domain (Supplementary Fig. 8
), the location of the majority of EDAR
polymorphisms causing HED21
. Our structural modelling predicts that the polymorphism lies within the binding site of the domain ().
Structural model of the EDAR death domain
Our analysis only scratches the surface of the recent selective history of the human genome. The results indicate that individual candidates may coalesce into pathways that reveal traits under selection, analogous to the alleles of multiple genes (for example, HBB
) that arose and spread in Africa and other tropical populations as a result of the partial protection they confer against malaria2,12
. Such endeavours will be enhanced by continuing development of analytical methods to localize signals in candidate regions, generation of expanded data sets, advances in comparative genomics to define coding and regulatory regions, and biological follow-up of promising candidates. True understanding of the role of adaptive evolution will require collaboration across multiple disciplines, including molecular and structural biology, medical and population genetics, and history and anthropology.