|Home | About | Journals | Submit | Contact Us | Français|
Patterns of DNA sequence variation in the genome contain a record of past selective events. The ability to collect increasingly large data sets of polymorphisms has allowed investigators to perform hypothesis-driven studies of candidate genes as well as genome-wide scans for signatures of adaptations. This genetic approach to the study of natural selection has identified many signals consistent with predictions from anthropological studies. Selective pressures related to variation in climate, diet, and pathogen exposure have left strong marks on patterns of human variation. Additional signals of adaptations are observed in genes involved in chemosensory perception and reproduction. Several ongoing projects aim to sequence the complete genome of 1000 individuals from different human populations. These large-scale projects will provide data for more complete genome scans of selection, but more focused studies aimed at testing specific hypotheses will continue to hold an important place in elucidating the history of adaptations in humans.
From a genetic perspective, evolution can be defined as changes in allele frequencies over time due to mutation, genetic drift, migration, and natural selection. The investigation of natural selection and genetic adaptations in humans has held a central place in biological anthropology as well as in disciplines such as human genetics and evolutionary biology. Although these disciplines differ slightly in the questions they address and the approaches they use, they intersect in the field of human population genetics, which uses genetic variation data to learn about the past demographic events and the history of natural selection in human populations.
Traditional approaches to identifying adaptations to local environments have often relied on comparing the distribution of a phenotype to an environmental variable hypothesized to mirror a selective pressure. This was, for example, pertinent to the malaria hypothesis, which explains the geographic distribution of the β-thalassemia (Haldane 1949) and HbS (Allison 1954) alleles in terms of the resistance they confer to malaria. Additional notable examples of this approach focused on quantitative phenotypes, such as the relationship between body mass and temperature (Katzmarzyk & Leonard 1998, Roberts 1978), skin pigmentation and solar radiation (Jablonski & Chaplin 2000), and oxygen saturation of arterial hemoglobin and altitude (Beall et al. 1997). If the phenotypes are heritable, the correlation between the distribution of phenotype and environmental variable constitutes evidence for an adaptation at the genetic level (as opposed, for example, to acclimatization). The argument for positive selection is especially strong if a correlation can also be shown between the phenotype of interest (e.g., high oxygen saturation at high altitude) and a measure of fitness (e.g., number of surviving offspring) (e.g., Beall et al. 2004).
Although these traditional approaches directly test the hypothesis that natural selection acted on a given genetic variant, their application on a large scale is hampered by several limitations. They require investigators to collect phenotype data, which is expensive and time consuming, and analyze large samples to achieve adequate statistical power. More recently, an approach based on the genetic signature of natural selection on patterns of variation has found wide applicability in studies of adaptations in humans and other species. This approach is based on the idea that natural selection introduces a local perturbation in the patterns of neutral genetic variation surrounding an advantageous allele relative to regions of the genome where variation is shaped only by genetic drift. This approach has two major advantages: First, it does not require investigators to collect phenotype information, and second, it can detect adaptive changes resulting from selection coefficients, e.g., 1%, that would be very hard to detect by traditional phenotype-based approaches (Gillespie 1991). The recent development of technologies and resources for studying human genetic variation on an unprecedented scale has allowed investigators to scan the entire human genome for signals of positive natural selection. Moreover, as the same technologies are being applied to the study of the genetic bases of common diseases, signals of natural selection can be connected to genotype-phenotype association signals, where both signals arise from unbiased genome-scale analyses.
Despite the somewhat separate trajectories of anthropological, human genetic, and evolutionary genetic studies of human adaptations, by weaving together strands from these disciplines, it may soon be possible to reconstruct the history of selective events that occurred during human evolution and to infer the role of human adaptations in health and disease. Here, we briefly review models of selection and methods for detecting its signature. In describing the main signals of selection reported so far, we focus on those that pertain to variation that is still segregating in humans, rather than to mutations that became fixed between humans and closely related primates. Our goal is to provide a starting point for the synthesis of findings from the different disciplines that are contributing to our understanding of human adaptations.
Natural selection can be defined as the process by which beneficial heritable traits increase in frequency over time and unfavorable heritable traits become less common. This process can occur according to a variety of models; the resulting signatures vary considerably across selection models and for different values of the relevant parameters (e.g., strength of selection, mode of inheritance, age of the selective pressure). Therefore, consideration of these models is crucial for interpreting empirical patterns of variation in regions of the genome that experienced positive selection.
Standard selection models involve loci that have two alleles, and selection occurs if the fitness differs across the three genotypes. Directional selection occurs when one of the two alleles is favored over the other so that its frequency may increase until it reaches fixation. The time needed for a new beneficial allele to become fixed is a function of the selection coefficient and mode of inheritance, with favored dominant and additive alleles increasing in frequency much faster than recessive ones (see Figure 1a). Under balancing selection, the heterozygote has fitness greater than that of both homozygotes; in this case, a new allele quickly reaches a stable equilibrium frequency and fluctuates around that frequency for as long as the selective pressure is present (see Figure 1b). Therefore, although both directional and balancing selection will result in the rapid increase in frequency of a new beneficial allele, the former ultimately eliminates variation from a population while the latter tends to maintain it. If balancing selection is due to a long-standing selective pressure, a polymorphism may be maintained in the population well beyond the expected lifetime of a neutral polymorphism (i.e., 4Ne generations for an autosomal locus, where Ne is the effective population size). Another classical model is diversifying selection in which two or more phenotypes are simultaneously favored. This model generally predicts an increase in levels of genetic variation.
In addition to the models above, scenarios with temporally and spatially varying selection may be particularly relevant to humans (Gillespie 1991). Temporal variation in selective pressures probably resulted from the climatic changes of the Ice Ages, the diversity of habitats to which humans became exposed during their dispersal, and the environmental changes arising from the impact of human activities, e.g., the spread of agriculture. The new selective pressures may favor new alleles—as described in the models above—or standing (previously neutral) alleles, which may afford a faster adaptive response to environmental shifts. These two models make different predictions about the population dynamics of the favored allele and the expected signature of selection. The difference in predictions is mainly because a beneficial allele that was previously neutral underwent an initial phase of random drift before being driven to high frequency by selection; this initial neutral phase has important consequences for the expected patterns of linked variation. In addition to environmental and selective changes over the time span of human evolution, there is also ample evidence for spatial variation in selective pressures; this may be the consequence of variation in features of the physical environment, such as climate, or of the different subsistence strategies adopted by humans, which in turn influence other aspects of the environment, e.g., diet or pathogen exposure. In some cases, selective pressures may vary spatially in a graded fashion, e.g., climate or ultraviolet (UV) exposure, with important implications for the expected geographic distribution of favored alleles. Finally, differences across environments may lead to spatial variation in the intensity of purifying selection (i.e., selection removing deleterious mutations); for example, a gene that was under strong purifying selection in a given environment may undergo relaxation of selective constraints when the population migrates to a different location or adopts a different lifestyle.
As shown in Figure 1, natural selection acting on a beneficial variant has a dramatic impact on its rate of frequency change over time, resulting in a trajectory that is distinct from that expected for neutral variants. Under directional selection or during the initial phase of a balanced polymorphism, beneficial alleles are characterized by a rapid rise in frequency. Because of the low recombination distance, the histories of the beneficial allele and the nearby neutral alleles are strongly correlated. Because of this correlation, natural selection generates a local perturbation in the pattern of variation tightly linked to the selected site. In contrast, patterns of variation in neutrally evolving genomic regions are influenced only by genetic drift and, therefore, by properties of the population, including the history of population size and population structure. Therefore, detecting the signature of natural selection essentially consists in distinguishing the patterns of variation that are shaped solely by demographic history from those that are influenced by natural selection, in addition to demography.
Briefly, as shown in Figure 2, when a new advantageous allele is introduced into the population by the mutation process, it will be associated with a particular haplotype background. As this allele is driven quickly to an intermediate or high frequency, the neutral alleles that define the haplotype and that are tightly linked to the selected site will also tend to increase in frequency. Owing to the rapid increase in frequency of the advantageous allele, recombination does not erode the association between the selected site and the surrounding neutral sites, thereby generating a local pattern of extended identical haplotypes [referred to as extended haplotype homozygosity (EHH)] that occur at intermediate to high frequencies. This process is often referred to as a partial or incomplete selective sweep. A number of statistical tests have been developed to detect this haplotype pattern (Hanchard et al. 2006, Hudson et al. 1994, Innan et al. 2005, Sabeti et al. 2002, Voight et al. 2006). If selection is directional, the advantageous allele may go to fixation. In this case, which is referred to as a complete selective sweep, all variation near the selected site will also be fixed and only new mutations arising during the sweep will segregate in the population at low frequencies. Therefore, the expected pattern near a fixed advantageous allele consists of a reduction in polymorphism levels and a relative abundance of rare alleles. At a greater distance from the selected site, recombination events will tend to uncouple the advantageous allele from the neutral alleles. This may result in a pattern characterized by high frequency derived alleles (i.e., new mutations); because such alleles tend to be rare under neutrality, the occurrence of multiple high frequency derived alleles within a small region constitutes a strong signal of selection. Therefore, different aspects of the data (i.e., polymorphism levels, allele frequency spectrum, haplotype structure) will be informative to detect complete versus incomplete selective sweeps. A number of statistical tests have been developed to capture the impact of positive natural selection under different models (see Appendix); we refer the reader to several recent reviews for details (Biswas & Akey 2006, Nielsen 2005, Nielsen et al. 2007, Sabeti et al. 2006).
When selection acts on a variant that is advantageous only in a subset of populations, the frequency of that variant may differ across populations to a greater extent than predicted for variants evolving neutrally in all populations. Several approaches have been devised to detect such adaptations to local environments. Historically, the most widely used is based on the statistic FST and its modifications (Beaumont & Balding 2004, Consortium 2005), which simply summarize allele frequency differences between pairs of populations or among multiple populations (Weir 1996). Variants with unusually large FST values are typically interpreted as being the targets of local selective pressures (Lewontin & Krakauer 1973).
If the intensity of selection varies spatially in a graded fashion, advantageous allele frequencies may vary across populations following the geographic distribution of the selective pressure. Therefore, if the selective pressure—or a good proxy for it—is known, a test of the correlation between the advantageous allele frequency and the value of the environmental variable may provide evidence for spatially varying selection. This approach is particularly appropriate when the advantageous phenotype is known to have a clinal distribution; for example, this is true for human skin pigmentation as well as body mass and proportionality, which are correlated with UV radiation and temperature, respectively (Jablonski & Chaplin 2000, Katzmarzyk & Leonard 1998).
Additional information regarding the action of natural selection on genetic variation may be extracted by comparing the pattern of variation within species and among species (i.e., humans and a close outgroup) for synonymous and nonsynonymous variants within a coding region. If a gene is evolving neutrally, the ratios of nonsynonymous to synonymous mutations within species and among species are expected to be the same. However, if natural selection drives multiple advantageous mutations to fixation within the same gene, this action may generate an excess of nonsynonymous relative to synonymous changes among species compared with the ratio observed within species. Alternatively, an excess of nonsynonymous mutations may be observed in the variation within species relative to among species. One possible interpretation for this pattern is that diversifying selection maintains a higher number of nonsynonymous variants; this explanation likely applies to the HLA genes (see below). However, a more common interpretation of this pattern, which is often observed in human genes, is that weak purifying (rather than positive) selection acts on nonsynonymous polymorphisms. Under this scenario, the nonsynonymous variants in excess are slightly deleterious mutations that reach nontrivial frequency in populations but are unlikely to become fixed.
From the statistical standpoint, two main approaches have been used to detect the signature of natural selection: model-based and empirical approaches. In the model-based approach, a theoretical model of population history in which all variation is neutral is used to develop quantitative expectations about patterns of variation (quantitative expectations for different models of demography under neutrality can be readily obtained using simulations). If a test locus exhibits variation patterns that are significantly different from those expected under the model, the null hypothesis of neutrality is rejected. A variant of this approach is to compare the likelihood of the data under neutrality with that under a specified model of selection, using the statistical framework of the likelihood ratio test (Kim & Stephan 2002, Nielsen et al. 2005, Przeworski 2003). The drawback of model-based methods is that, if the assumptions about population history are violated, one may falsely reject neutrality. Although much progress has been made in reconstructing the broad picture of human population history (Schaffner et al. 2005, Stajich & Hahn 2005 Voight et al. 2005), it is widely recognized that the details of the true history are too complex to be captured by simple models. Unlike the model-based method, the empirical method is agnostic about population history, and, under the assumption that most loci in the human genome evolve neutrally, it aims at identifying the loci with the most unusual patterns compared with large-scale data sets of genetic variation. For example, empirical patterns of variation may be summarized by means of one or more test statistics, which are used to rank loci in the genome; those loci falling above an arbitrary cut-off (typically the top 5%) are identified as unusual and are often referred to as outliers. A recently developed composite likelihood method attempts to integrate the advantages of both approaches by comparing a model without selection that is fitted to genome-wide variation data with a model in which selection acted in a particular region of the genome (Nielsen et al. 2005). This method, which is tailored to the detection of complete selective sweeps, is remarkably robust to deviations from the demographic assumptions (Williamson et al. 2007).
Because it relies on large-scale variation data, the outlier approach has gained wider applicability since the completion of genome-wide projects, such as the International HapMap Project (Consortium 2005, Frazer et al. 2007) and the Perlegen Project (Hinds et al. 2005). This approach already generated many interesting candidate targets of selection. However, to reconstruct accurately the history of selective events, all or the vast majority of adaptations must be detected, and this is unlikely to occur with outlier approaches alone. This is because there is a trade-off between false positives and false negatives, the extent of which is sensitive to a number of factors that affect the statistical power of neutrality tests (see below) (Teshima et al. 2006). To complement outlier approaches, investigators can compare the strength of the evidence for positive selection across groups of genes in different biological processes; a significant excess in one group of genes suggests that the biological process as a whole evolved genetic adaptations, even though the signals at individual genes may not reach genome-wide significance (Akey et al. 2002, Tang et al. 2007, Voight et al. 2006, Wang et al. 2006).
The power of neutrality tests based on polymorphism levels, the spectrum of allele frequency, and haplotype structure is affected by a number of factors, most notably the age of the selective event and the strength of selection, whereby younger and more strongly advantageous alleles are more easily detected. In addition, it is important to note that the process illustrated in Figure 2 applies most directly to mutations that are immediately advantageous and that are dominant or codominant. If selection acts on standing—previously neutral—mutations or on recessive new mutations, the expected pattern of variation may not be distinguishable from that typical of neutrally evolving regions of the genome, leading to a marked reduction in power (Hermisson & Pennings 2005; Pennings & Hermisson 2006a,b; Przeworski et al. 2005). The similarity in expected variation patterns under neutrality and under the above selection models is due to the fact that the frequency trajectory of previously neutral or recessive new mutations is characterized by a longer initial phase, prior to the rapid rise in frequency, during which mutation and recombination events can generate substantial diversity in the chromosomal background of the mutation. For the case of selection on standing variation, the signature of selection is unlikely to be detectable if the frequency of the variant at the onset of selection is higher than 5% (Przeworski et al. 2005).
The above factors affect the power of different tests to a varying degree. Therefore, the results of genome-wide selection scans are inherently biased toward the type of signals most powerfully detected by the test used in the scan. This bias complicates the analysis of the overlap of signals across studies and potentially distorts the reconstruction of the overall history of selective pressures. Approaches that combine different summaries of the data have also been implemented in hopes of achieving greater power and accuracy (Tang et al. 2007, Zeng et al. 2006).
Here, we review briefly the main signals of selection observed in studies of genetic variation within and among human populations. Some of these studies tested specific hypotheses arising from the anthropological literature on phenotypic variation and on its adaptive significance (see, for example, Biasutti 1959; Roberts 1953, 1978; Roberts & Kahlon 1976). With the advent of genome-wide scans for selection signals, investigators can evaluate the evidence for positive selection in an unbiased manner. Genome-wide scans have identified many selection targets that are consistent with previous anthropological studies. In addition, these scans implicated other, perhaps less expected signals of selection.
That climate and the physical environment played an important role in shaping human phenotypic variation is evident from the strong correlations between traits such as body mass, basal metabolic rate, or skin reflectance and variables such as temperature, latitude, and UV radiation (Henry & Rees 1991, Jablonski & Chaplin 2000, Leonard et al. 2005, Roberts & Kahlon 1976). Other phenotypes likely to have similar geographic distributions include those related to sodium homeostasis, salt and water retention, and thermogenesis. Overall, studies of phenotypic variation have provided strong support for the notion that heat and cold stress as well as UV radiation have exerted strong selective pressures on humans; genetic studies of selection have largely supported these predictions.
Among the phenotypes with complex patterns of inheritance, skin pigmentation is one of the more completely characterized phenotypes at the genetic level (Rees 2003 and references therein), with contributions from studies of animal models [e.g., Kelsh et al. 1996, Mammalian phenotype ontology database (see Related Resources)], Mendelian pigmentation disorders, and genotype-phenotype association (Bonilla et al. 2005; Duffy et al. 2004, 2007; Flanagan et al. 2000; Graf et al. 2005, 2007; Kanetsky et al. 2002; Lamason et al. 2005; Palmer et al. 2000; Rebbeck et al. 2002; Smith et al. 1998; Stokowski et al. 2007; Sulem et al. 2007).
An early study of the genetic signature of selection on skin pigmentation examined variation at the melanocortin 1 receptor (MC1R) locus, which influences variation in skin and hair color. The authors found a larger proportion of nonsynonymous variants in non-African compared with African samples, which was interpreted as evidence for strong purifying selection against nonsynonymous mutations in Africa and relaxation of constraints outside Africa (Harding et al. 2000). A recent study focused on a strong candidate gene for skin pigmentation, SLC24A5, which, when mutated, causes a pigmentation defect in zebrafish that can be rescued by the human gene (Lamason et al. 2005). In humans, a nonsynonymous alanine to threonine polymorphism at position 111 is associated with variation in skin pigmentation among African American and African Caribbean individuals. The light pigmentation allele is fixed or nearly fixed in Europeans, and, consistent with the expectations for a complete selective sweep, the genomic region spanning SLC24A5 exhibits low levels of polymorphism.
The availability of extensive variation data allowed investigators to search for selection signatures in candidate genes for skin pigmentation on the basis of strong population differentiation (i.e., high FST) and EHH (Izagirre et al. 2006, Lao et al. 2007, Myles et al. 2007, Norton et al. 2007). The geographic distribution of some of these candidate single nucleotide polymorphisms (SNPs) was characterized in further detail; these analyses found evidence for variation leading to lighter pigmentation at OCA2 in both Europe and Asia and at ASIP worldwide, whereas the SLC24A5, MATP and TYR genes showed evidence for selection only in Europeans. Because signatures of selection for lighter pigmentation are not found at the same genes in Europe and Asia, convergent evolution for reduced pigmentation appears to be relatively common for this trait (Norton et al. 2007).
Unbiased scans for signals of adaptations using genome-wide data have also provided strong evidence for selection on skin pigmentation. Strong signals were detected using haplotype structure [OCA2, TYRP1, DNTPB1, SLC24A5 and MYO5A by Voight et al. (2006) and Tang et al. (2007)], EHH and strong population differentiation [SLC24A5, SLC45A2 by Sabeti et al. (2007)], strong population differentiation alone [ABCC11, SLC45A2, SLC24A5 by Barreiro et al. (2008)], and frequency spectrum [KITLG, MATP, SILV, OCA2, TRPM1, SLC24A5, RAB27A, MC2R, ATRN by Williamson et al. (2007)].
The role of heat stress as a selective pressure in humans has been summarized in the sodium-retention hypothesis, which posits that, in hot, humid environments, selection for high sodium retention was strong because salts are lost quickly through sweat but are important for maintaining temperature homeostasis (Denton 1982, Gleibermann 1973). One prediction of this hypothesis is that genetic variation underlying adaptations to heat stress is correlated with climatic variables (e.g., temperature and potential evaporation) or latitude as a proxy for climate. In addition, signals consistent with complete or partial selective sweeps may be observed within individual populations. Consistent with these predictions, the frequency of genetic variants implicated in interindividual variation in sodium retention and risk to hypertension was significantly correlated with latitude. Researchers observed this pattern in a number of genes including those coding for angiotensinogen (AGT), which plays a vital role in the renin-angiotensin pathway, cytochrome P450 3A5 (CYP3A5), which activates cortisol in the kidney, and the G-protein beta-3 subunit, which is involved in signal transduction in a number of tissues (Thompson et al. 2004, Young et al. 2005). In the case of the AGT and the CYP3A5 genes, investigators also observed signals of natural selection using tests of the frequency spectrum or haplotype structure (Nakajima et al. 2004, Thompson et al. 2004).
In addition to heat tolerance, selective pressures related to climate probably acted to increase cold tolerance by shaping genetic variation in energy metabolism. Consistent with a role for positive selection on cold tolerance–enhancing variation, Mishmar et al. (2003) and Ruiz-Pesini et al. (2004) showed that the number and frequencies of nonsynonymous mutations in mitochondrial genes increase with distance from Africa. Furthermore, we recently investigated a large number of candidate susceptibility genes for common metabolic disorders (e.g., type 2 diabetes, obesity) to determine whether there was widespread evidence for spatially varying selection in energy metabolic genes (Hancock et al. 2008). As with the sodium-retention genes, we found that variation in these genes was significantly correlated with climate variables in worldwide population samples. Several of the strongest signals included variants previously associated with disease, and several of these fell in thermogenesis pathway genes.
The results of genome-wide selection scans have generally been consistent with those from hypothesis-driven studies. Carlson et al. (2005) scanned the human genome for signals of selection using the frequency spectrum and found evidence for selection at CYP3A5. Voight et al. (2006) used the genotype data from the HapMap project to test for evidence of an incomplete selective sweep by detecting regions of EHH across the genome (Voight et al. 2006); this study found significant signals at the CYP3A5 and the leptin receptor (LEPR) genes, both showing strong correlations between allele frequency and climate variables. In addition to individual signals, they found a significant excess of signals in genes involved in biological processes related to energy metabolism, such as carbohydrate metabolism and electron transport (in Asians and Europeans, respectively). Tang et al. (2007) used a different measure of EHH that preferentially targeted signals in high frequency alleles; this study found signals for the CYP3A5 region and for a group of genes with similar function (i.e., oxidoreductase activity). Williamson et al. (2007) used a composite likelihood test to scan the genome for signals of complete selective sweeps (Williamson et al. 2007); this study identified several strong signals in genes that may be important for heat or cold stress adaptations, including those coding for sterol carrier protein 2 (SCP2), farnesoid X receptor (NR1H4), and activator of heat shock 90kDa protein ATPase homolog 1 (AHSA1).
Lactase persistence is the production, after infancy, of the enzyme lactase, which breaks down the milk sugar lactose into glucose and galactose so that it can be further processed in the intestines. Although uncommon in nonhuman species, in some human populations lactase persistence is relatively common. Prevalence of lactase persistence varies greatly among human populations and has been shown to cooccur with cultural traits that involve the inclusion of dairy products in the diet. A cline in the lactase persistence phenotype exists within Europe such that persistence is highest among populations in the Northwest and lowest among those in the Southeast (Swallow 2003 and references therein). But the observed pattern is not restricted to Europe: Pastoral populations in the Middle East (e.g., Bedouin) and sub-Saharan Africa (e.g., Fulani and Tutsi) also have high prevalence of lactase persistence. These findings led to the hypothesis that alleles causing lactase persistence were advantageous when populations adopted dairy farming and shifted to a diet in which milk is a major adult staple (Kretchmer 1972, McCracken 1971, Simoons 1970).
In Europeans, lactase persistence was shown to be due to a polymorphism about 14 kb upstream of the gene coding for the lactase enzyme (LCT). This polymorphism, denoted C/T−13910, is likely to affect lactase production via changes in transcription levels. The genomic region containing this polymorphism was investigated for signatures of positive natural selection in ethnically diverse population samples, with a special focus on the haplotype associated with persistence in Europeans (Bersaglieri et al. 2004). Consistent with the idea that lactase persistence was advantageous in European populations, multiple polymorphisms displayed unusually large differences in allele frequencies between European and non-European populations, and the haplotype structure exhibited extremely high EHH in European populations. The time since expansion of the C/T−13910 variant in Europeans was estimated to be 2188–20650 years, roughly consistent with the likely time of onset of dairy farming. The above findings were bolstered by genome-wide scans for selection, which showed that the LCT gene in Europeans contains the strongest signal of a partial selective sweep in the human genome (Consortium 2005, Sabeti et al. 2007, Voight et al. 2006, Williamson et al. 2007).
Although it is common in Europeans, the C/T−13910 variant is absent or very rare in some African and Middle Eastern populations where lactase persistence is high, which led to the hypothesis that different variants underlie the lactase persistence phenotype across populations. Indeed, genotype-phenotype association studies in African pastoral populations found that several polymorphisms near the C/T−13910 (G/T−14010, T/G−13915, and C/G−13907) may cause lactase persistence. Furthermore, in vitro studies showed that these alleles affected LCT transcription levels. These findings suggested that convergent evolution occurred for lactase persistence in Europeans and in African pastoralist populations that regularly drink milk (Tishkoff et al. 2007). The high frequency of the lactase persistence allele(s) in African pastoralists compared with nonpastoralists and the high EHH among pastoralists are consistent with the notion that these alleles were driven quickly to high frequency by positive natural selection. As with European populations, the time estimated since the expansion of the lactase persistence allele in Africa, i.e., 3000–7000 years ago, is consistent with selection beginning around the time when a pastoral subsistence strategy was adopted in Africa. Enattah et al. (2008) described a high frequency haplotype in Saudi Arabians that is defined by two polymorphisms (G−13915 and C−3712) upstream to the LCT gene and is correlated with lactase persistence; in vitro assays showed that the variants defining this haplotype influenced LCT transcription (Enattah et al. 2008). As with the European and African lactase persistence haplotypes, they found evidence that this haplotype had been driven quickly to high frequency by positive selection. They estimated the age for the haplotype to be ~4000 years, broadly consistent with the timing of camel domestication in the region.
Starch consumption, like adult milk consumption, varies greatly across populations with different subsistence, most notably across agricultural versus nonagricultural populations. Therefore, the ability to digest and absorb starch is likely to have become advantageous since the spread of agriculture. Perry et al. (2007) recently presented evidence supporting this hypothesis. They examined variation in the number of copies of the amylase gene (AMY1), which encodes an enzyme important in starch hydrolysis, between agricultural and nonagricultural populations. They found that the agricultural populations had, on average, more copies of AMY1 than did the combined hunter-gatherer and pastoralist populations, whose copy numbers were intermediate between those of the agricultural populations and chimpanzees and bonobos.
Alcohol dehydrogenase genes in the ADH1 cluster, which play an important role in ethanol metabolism, harbor variation associated with susceptibility to alcoholism. This includes a nonsynonymous polymorphism (R47H) in the ADH1B gene, which has near-fixation frequency in Asian populations and is rare elsewhere (Osier et al. 2002a,b). Indeed, a genome-wide selection scan based on population differentiation found that the value for ADH1B R47H was exceptionally high (Consortium 2005). Osier et al. (2002b) reported that several variants on the same haplotype showed high population differentiation, such that the haplotype was common in East Asian populations and was rare or absent elsewhere. Additionally, in their genome-wide scan, Voight et al. (2006) found evidence for selection at the ADH1 gene cluster on the basis of strong EHH in East Asian populations (Voight et al. 2006). These results strongly argue for a selective advantage conferred by the ADH1 polymorphism(s). However, whereas ADH1 polymorphisms clearly affect variation in the processing of a common dietary component, i.e., ethanol, it is unclear whether ethanol consumption itself was the selective pressure underlying the observed population genetic pattern. Because the high-activity ADH1 alleles result in an accumulation of acetaldehyde in response to alcohol load and because acetaldehyde has antiprotozoal activity, Goldman & Enoch (1990) proposed that these alleles are selectively advantageous because they protect against severe infectious diseases by protozoans.
Other specific dietary components may also be important. For example, two genome scans found evidence for selection from EHH for genes important in vitamin transporter activity and cofactor transporter activity (Tang et al. 2007, Voight et al. 2006). In addition, Wang et al. (2006) found signatures of positive selection for several genes involved in protein metabolism (ADAMTS19–20, APEH, PLAU, HDAC8, UBR1, and USP26) and a significant excess of signals in the group of genes in this biological category compared with other groups.
In addition to variation in dietary components, fluctuation in food availability likely exerted strong selective pressures on the human genome. More than 40 years ago, Neel proposed the thrifty genotype hypothesis to explain why variants that increase risk to type 2 diabetes and obesity may be at high frequency in human populations (Neel 1962). He reasoned that because ancestral populations underwent seasonal cycles of feast and famine, they would have benefited from having extremely efficient fat and carbohydrate storage. When food production and storage resulted in more reliable food availability, this ancestral thriftiness became detrimental, and in contemporary populations, it contributes to the increased prevalence of diabetes and obesity. Consistent with the notion that thriftiness is an ancestral state, many alleles that increase risk to type 2 diabetes and other metabolic disorders are ancestral (i.e., shared with chimpanzee), whereas the alternative alleles at those polymorphic sites protect against the disease (Chimpanzee Seq. Anal. Consort. 2005, Di Rienzo & Hudson 2005).
Although researchers have well documented that the transition to a Western lifestyle and diet results in major prevalence increases of type 2 diabetes (O'dea 1991; Szathmary 1986, 1990; Weiss et al. 1984), it is unclear whether this is also true for the transition from hunting-gathering to agriculture. In a revision of the original thrifty genotype hypothesis, Neel proposed that the changes that likely accompanied the spread of agriculture, including the reduction in dietary diversity to a diet composed mainly of carbohydrates, represented an important step in the shift to overall environmental conditions that favor the development of type 2 diabetes in individuals carrying the thrifty genotype (Neel 1999). However, whether food availability and reliability were indeed higher in agriculturalists compared with foragers is arguable (Cohen & Armelagos 1984, Larsen 2003). For example, Benyshek & Watson (2006) compared the quantity of available food as well as the frequency and intensity of food shortages between contemporary agricultural and hunter-gatherer populations and found no significant differences. However, some interesting trends were observed: Occasional and mild-to-moderate food shortages were more common among agricultural populations, whereas frequent and severe shortages were more common among hunter-gatherers.
A corollary of the thrifty genotype hypothesis is that populations with reliable, steady access to food resources may have evolved adaptations that slowed the insulin response and decreased the storage of energy as fat, thus resulting in a decrease in type 2 diabetes prevalence over time. In this context, Diamond (2003) proposed that the low prevalence of type 2 diabetes in Europeans reflects a longer history of stable food supply in these populations. Under this scenario, alleles that protect against type 2 diabetes are expected to have increased in frequency and to carry a signature of positive selection (e.g., be associated with high EHH). It is important to note that changes of allele frequencies, even if driven by strong positive selection, take at least hundreds of generations. Therefore, the above scenario applies only if the change in diet and life style and the resulting shift in selective pressures occurred long before Westernization, possibly with the transition to agriculture as suggested by Neel (1999).
Until now a comprehensive test of the thrifty genotype hypothesis has not been possible because only few susceptibility variants are known; however, studies of individual genes have suggested some signals of selection. Fullerton et al. (2002) found that susceptibility variants in the calpain 10 (CAPN10) gene had large differences in allele frequencies between African and non-African populations and suggested that natural selection had favored different alleles across populations (Fullerton et al. 2002). Another study of the same gene found both a significant reduction in variability for the alleles carrying the putative protective allele, consistent with positive selection acting on this subset of alleles, and a region of unusually high polymorphism and decay of linkage disequilibrium (LD), consistent with balancing selection in the region (Vander Molen et al. 2005). More recently, Grant et al. (2006) showed that variation in the TCF7L2 gene affects risk to type 2 diabetes. The haplotype carrying the protective allele at TCF7L2 is associated with a signature of positive selection on the basis of tests of population differentiation and haplotype homozygosity (Helgason et al. 2007).
Response to pathogens was likely an important selective pressure during human evolutionary history. Although selection in response to pathogen attack has probably been important throughout human evolutionary history, selection likely intensified over the past 10–20 kya as a result of a warming climate, more recent localized changes in environments, and the shifts to agriculture and animal husbandry.
Whereas selection for resistance to several pathogens has been studied, response to malaria pathogens has received an exceptional amount of attention owing to the large number of people affected and the high resultant mortality. The most deadly strain by far, Plasmodium falciparum, is likely to have gained prominence rather recently; its prevalence is thought to have increased as a consequence of the transition to agriculture. A less deadly but widespread strain, Plasmodium vivax, is associated with high levels of morbidity; it is unclear if the spread of P. vivax occurred after the spread of agriculture, as for P. falciparum. A number of common polymorphisms confer resistance to malaria infection, including a number of hemoglobin variants such as Hb S, Hb C, Hb E, β-thalassemia, α-thalassemia, G6PD deficiency, and the null allele of the Duffy blood group (Kwiatkowski 2005). Most of these variants carry a genetic signature of selection.
Deficiency of the enzyme G6PD is correlated with reduced risk of malaria infection, and signatures of positive selection have been observed for alleles that confer this deficiency (Sabeti et al. 2002, Saunders et al. 2005, Tishkoff et al. 2001, Verrelli et al. 2002). Tishkoff et al. (2001) and Saunders et al. (2005) found reduced variability in the subset of haplotypes carrying the deficiency alleles, evidence for EHH associated with the haplotypes carrying these alleles, and long branches separating the deficiency alleles from other alleles. The ages of the G6PD deficiency alleles were estimated using different methods and yielded values ranging from 2.5 to 45 kya (Coop & Griffiths 2004, Saunders et al. 2005, Tishkoff et al. 2001, Verrelli et al. 2002); the younger estimates are consistent with an increase in malaria infection with the transition to agriculture, but the older ones are not.
Carriers of the Duffy null (FY*0) mutation do not express a protein on the red blood cells to which P. vivax binds to invade the cells. As a result, FY*0 homozygotes are resistant to this variety of malaria (Livingstone 1984); Kasehagen et al. (2007) recently showed that FY*0 heterozygosity may also afford some degree of protection against vivax malaria. This allele has a striking geographic distribution in that it is fixed or nearly fixed in most Sub-Saharan African populations and is virtually absent elsewhere. This degree of differentiation in the frequency of the FY*0 allele is highly unusual relative to genome-wide patterns (Consortium 2005) and is consistent with the notion that this allele underwent a complete selective sweep in Sub-Saharan African populations. Consistent with this hypothesis, Hamblin and colleagues showed that polymorphism levels in sub-Saharan Africans are low compared with divergence and that there is an excess of high frequency derived alleles near the FY gene (Hamblin & Di Rienzo 2000, Hamblin et al. 2002). Interestingly, the FY*0 allele is found on two major haplotypes, which suggests that this is a case of selection on standing neutral variation.
Several hemoglobin variants, the most common of which include Hb S, Hb C, and Hb E, represent well-documented cases of balanced polymorphism whereby heterozygotes are resistant to P. falciparum, whereas homozygotes have reduced fitness. Accordingly, signatures of positive selection have been observed at the HBB locus on the basis of extensive haplotype homozygosity and strong population differentiation for HbS (Hanchard et al. 2006, Sabeti et al. 2007) and strong haplotype structure for HbC (Wood et al. 2005) and HbE (Ohashi et al. 2004). Consistent with the idea that the onset of malaria selection coincided with the spread of agriculture, the ages of the HbS, HbE, and HbC variants have been estimated to be 1350–2100 years (Currat et al. 2002), 1240–4440 years (Ohashi et al. 2004), and 1875–3750 years (Wood et al. 2005), respectively.
Genes in the HLA region have received a great deal of attention in studies of mechanisms and selection for pathogen defense as a result of their roles in response to a broad range of pathogens. Balancing selection appears to have been pervasive in this region owing to the need for different HLA peptides to confer resistance to disease, but there is also evidence for strong directional selection on specific alleles. Although the HLA region contains at least 120 genes, many of which are involved in immune function, most studies have focused on nine, which are considered to be the classical HLA genes. Several lines of evidence suggest a history of natural selection at the classical HLA genes. As a group, these genes show exceptionally high levels of polymorphism especially in functional (peptide binding) regions (Meyer & Thomson 2001, Satta et al. 1994), and many genes have an excess of nonsynonymous compared with synonymous mutations (Hughes & Nei 1988). In addition to the high levels of polymorphism and nonsynonymous variation, the frequency spectrum tends to be skewed toward intermediate frequency variants relative to neutral expectation. Large sequence differences are observed among the HLA haplotypes, suggesting that the alleles are old, consistent with a model of long-term balancing selection (Meyer & Thomson 2001).
In addition to studies directed at single genes and small groups of related genes, multigene and genome-wide studies of selection have detected significant signals in immune response genes, either individually [CD226, IGJ (Akey et al. 2002, Carlson et al. 2005, Williamson et al. 2007); ABO, IL1A, IL1RN, KEL (Akey et al. 2002), ABO, KEL (Carlson et al. 2005)] or as a group. Tang et al. (2007) found signals of positive selection for interleukin-1 receptor antagonist activity and cytokine activity. Using a test based on haplotype structure, Voight et al. (2006) found an enrichment for MHC-I-mediated immunity genes, and Wang et al. (2006) found signals in several genes (CSF2, CCNT2, DEFB118, STAB1, SP, Zap70) and an enrichment for pathogen response genes. Sabeti et al. (2007) found signals of positive selection based on haplotype structure and population differentiation at LARGE and DMD.
The ability to sense aspects of the environment is necessary for survival and reproduction, and the array of cues encountered in different environments may be especially diverse for some senses, such as olfaction and taste. Olfaction may be important to detect and identify food and mates, whereas taste is likely crucial to identify suitable foods. Evidence that genes responsible for olfaction and taste evolved under positive selection is beginning to emerge.
Gilad et al. (2003) found evidence of positive selection on the human lineage compared with chimpanzees on the basis of the ratio of nonsynonymous to synonymous substitutions, low levels of polymorphism, and an excess of rare variants. Additional evidence for adaptation in olfactory receptor genes was detected in genome-wide scans. One study identified a signal for a complete selective sweep in a region within a cluster of olfactory receptor genes (Williamson et al. 2007); other studies found evidence for selection based on the frequency spectrum in several olfactory receptor genes (Carlson et al. 2005) and an enrichment of signals based on haplotype structure in genes involved in olfaction (Voight et al. 2006).
Inter-individual variation in the ability to taste PTC has been known for more than 70 years, but the genetic basis for this trait was only recently elucidated (Drayna et al. 2003 and references therein). Identification of the region that influences the PTC phenotype allowed testing for evidence of positive selection at the genetic level (Wooding et al. 2004). Two high-frequency haplotypes, defined by three nonsynonymous variants, were observed, consistent with the action of balancing selection at this locus. An additional 21 bitter taste receptor genes have been analyzed for evidence of selection: the average ratio of nonsynonymous to synonymous substitutions was high compared with that observed in 151 other genes; in addition, polymorphism levels tended to be high in the bitter taste receptor genes (Kim et al. 2005).
Genes that affect fertility and reproduction are expected to be subject to especially strong selection owing to their direct effects on fitness-related traits. This class of proteins evolves rapidly both within and among species (Swanson & Vacquier 2002). Significant variation in fertility exists among human populations with different subsistence patterns (Bentley et al. 1993). To the extent that variation in fertility within populations is heritable (Pettay et al. 2005), it is possible that natural selection has acted on this trait to influence variation among populations.
Stefansson et al. (Stefansson et al. 2005) showed that female carriers of a large inversion polymorphism, which shows evidence of selection based on population differentiation and long-range haplotype structure, have higher fertility than do noncarriers. Follicle stimulating hormone is important for fertility in both males (for Sertoli cell proliferation and in maintaining sperm quality in the testes) and females (for stimulation of ovarian follicles). Grigorova et al. (2007) examined sequence patterns for the follicle stimulating hormone receptor-binding beta-subunit (FSHB) in several human populations and found low overall diversity combined with an excess of intermediate frequency variants, consistent with balancing selection at this locus. The authors also found that one of the two high-frequency haplotypes was associated with increased fertility and hypothesized that balancing selection at this locus may act on birth intervals.
Although relatively few single gene studies have been conducted for genes involved in fertility and reproduction, several signatures of selection have been found for genes in this class from genome-wide scans for selection. Williamson et al. (2007) found evidence for a complete sweep in European Americans and Han Chinese for SPAG6, a gene involved in sperm motility, and others found evidence for an excess of signals of partial selective sweeps in genes related to fertility, gametogenesis, spermatogenesis, sperm motility, and fertilization (Voight et al. 2006, Wang et al. 2006).
Early studies examined evidence for positive selection from the distribution of phenotypic traits and their cooccurrence with variables that were expected to represent underlying selective pressures. With the advent of DNA sequencing and genotyping technologies and the development of methods to detect evidence of selection from sequence variation data, testing for evidence of genetic adaptations in single genes became feasible. More recently, the availability of dense, genome-wide genotype data for multiple populations and the development of methods for detecting selection using SNP data have elicited many genome-wide scans for evidence of positive selection in human populations. Now, the HapMap project is expanding to include genotype data for additional populations as well as genome-wide resequencing data for some HapMap individuals. In addition, the 1000 Genomes Project was recently launched, which aims to resequence completely the genomes of 1000 individuals from diverse worldwide populations. The continued collection of data from large-scale projects will be useful for conducting more complete genome scans to detect evidence for positive selection. At the same time, more focused studies to test specific hypotheses or to follow up on results from genome-wide scans will continue to have an important place in reconstructing the overall history of selective pressures among human populations.
We thank Cynthia Beall and William Leonard for helpful comments on the manuscript. We gratefully acknowledge funding from National Institutes of Health grants DK56670 and GM79558. A.M.H. was partially supported by a predoctoral fellowship from the American Heart Association.
Several methods have been developed to evaluate statistically whether empirical patterns of variation are consistent with the expectations of the null hypothesis of evolutionary neutrality. These tests, which are briefly described here, can be grouped on the basis of the aspect of variation that they use.
The most widely used test to detect departures from neutrality based on polymorphism levels is the so-called HKA (Hudson-Kreitman-Aguade) test (Hudson et al. 1987), which is based on the notion that polymorphism levels within species and sequence divergence between species are proportional to the same underlying mutation rate. Therefore, by comparing polymorphism and divergence at two or more loci, one can test for departures from neutrality due to a deficit or an excess of polymorphism.
The standard neutral model provides quantitative expectations for the spectrum of allele frequencies, whereby the expected fraction of alleles that occur i times in a sample is proportional to 1/i. The most widely used test statistic summarizing information about the frequency spectrum is Tajima's D (TD) (Tajima 1989), which is based on the standardized difference between two estimators of the same parameter, 4Neμ (where Ne is the effective population size and μ is the mutation rate per site per generation). Under the standard neutral model, TD is expected to be near zero; an excess of intermediate frequency variants will generate a positive TD value, whereas an excess of rare variants will give rise to a negative TD value (because TD does not use information about the ancestral state of the polymorphisms, it does not distinguish between rare and high-frequency new mutations). A different test, referred to as H (Fay & Wu 2000), uses information from an outgroup sequence to infer the ancestral allele at each polymorphic site and tests specifically for an excess of high-frequency nonancestral alleles.
This family of tests relies on the property of partial sweeps to generate an extended region of haplotype homozygosity surrounding the selected site (Figure 2) (Hudson et al. 1994). In general, the extent of haplotype homozygosity is a function of the age of the mutation, with younger mutations exhibiting homozygosity over larger regions than would older ones. Under neutrality, young mutations are expected to be rare, whereas equally young advantageous mutations tend to occur at higher frequency. Therefore, tests based on the haplotype structure aim to capture the discrepancy between the extent of haplotype homozygosity in a region and the frequency of that haplotype in the population. The EHH in a region can be calculated for a core haplotype or a core SNP allele and then compared with that for other core haplotypes or the alternative allele at the core SNP. To perform a statistical test of neutrality, the relative EHH at a candidate region is compared with that for other core haplotypes or core SNPs within the same frequency range, obtained either by simulations or from empirical patterns (Sabeti et al. 2002, Voight et al. 2006).
Tests to detect the impact of selection on the extent of allele frequency differentiation have historically relied on the summary statistic FST, which is the proportion of the total genetic variance that occurs among populations. The spatial distribution of allele frequencies can also be tested for a correlation with an environmental variable, e.g., latitude or temperature, which is likely to reflect a selective pressure. For both FST and the correlation between allele frequency and an environmental variable, a statistical test of neutrality can be constructed by comparing the value of the test statistic at a candidate locus with the distribution of values expected for a null neutral model of subdivided populations. Because the models of human population structure investigated so far are too simplistic to provide an appropriate null model, an empirical approach is usually taken in which the value of the test statistic at the candidate locus is compared with the distribution of the statistic observed for a large set of independent loci.
DISCLOSURE STATEMENT The authors are not aware of any biases that might be perceived as affecting the objectivity of this review.