One of the central problems in evolutionary biology is to understand the genetic and ecological mechanisms that drive adaptation. With the advent of large-scale SNP and DNA sequence data it is now possible to study selection and adaptation at a genome-wide scale. In recent years there has been considerable progress in identifying potential signals of selection in a wide variety of species
[1]–
[4].
In this study, we focus on recent adaptation in human populations. In particular, we examine the role of geography and population history in the spread of selectively favored alleles. The methods that we use provide information about adaptive events that have occurred since the divergence of African and non-African populations—i.e., over the last 50–100 KY
[5]–
[8]. During this time period the environment and ecology of humans have changed profoundly. Humans have spread out of Africa to colonize almost all of the world's land mass, and in the process have experienced a vast range of new climates, diets and ecosystems
[6],
[9]. Humans have also encountered new pathogens as they moved around the globe and moved into close proximity with domesticated animals, and as human population densities increased.
These changes in human ecology suggest that there has been ample scope for the action of natural selection in recent human evolution. Moreover, most species, including humans, probably face various additional selection pressures on a persistent basis: e.g., due to sexual competition, viability selection and resistance to evolving pathogens. Hence, it seems reasonable that our genomes would show evidence for recent selection, and there is great interest in understanding what types of environmental pressures and biological processes show the strongest signals of adaptation
[1],
[10],
[11].
Some of the strongest evidence for recent adaptation comes from candidate genes where there is both a strong biological hypothesis for selection as well as evidence for selection from unusual haplotype patterns, homozygosity, or extreme values of F
ST [1]. Examples include genes involved in malaria resistance such as
G6PD and the Duffy antigen gene
[12]–
[14]; genes involved in lighter skin pigmentation in non-Africans (e.g.,
SLC24A5,
SLC45A2 and
KITLG)
[15]–
[21]; and a pair of genes involved in dietary adaptations (
lactase and
salivary amylase)
[22]–
[25].
Recent studies have also cast a wider net to identify signals of selection using genome-wide SNP data
[16],
[17],
[26]–
[31], or large-scale resequencing data
[32],
[33]. Most of these studies report many candidate signals of positive selection. However, for most of the signals detected in this way, we do not yet know how the variation affects phenotypes or the nature of the selective pressures; indeed even the target genes are often uncertain. It is difficult to assess what fraction of the candidate signals are genuinely due to selection, rather than being extreme outliers in the neutral distribution
[34]; however, simulations generally show that extreme values of various test statistics are more abundant in the real data than would be expected under neutral models
[16],
[17],
[28],
[35]. Some studies have also reported enrichment of selection signals in and around genes, as might be expected if selection is concentrated near genes
[16],
[31],
[36], and a recent study has provided robust genome-wide evidence of selection shaping patterns of diversity
[37].
While most recent papers on selection in humans have focused on identifying genes and phenotypes involved in selection, our paper aims to learn more generally about the nature and prevalence of positive selection in humans. We also highlight some of the conceptual and methodological challenges in studies of selection. A separate companion paper focuses more closely on individual selection signals of particular interest
[21], and a genome browser of our results is available (
http://hgdp.uchicago.edu/).
Data and Populations Studied
We analyzed genome-wide SNP data from two primary sources, namely, the Human Genome Diversity Panel CEPH (HGDP), and the Phase II HapMap. Together, these two data sets provide the best available combination of dense geographic sampling (HGDP) and dense SNP data (Phase II HapMap) and hence provide complementary information for our analysis.
The HGDP data reported by Li et al.
[38] consist of 640,000 autosomal SNPs genotyped in 938 unrelated individuals. These individuals include samples from 53 different human populations. They represent much of the span of human genetic diversity
[39],
[40], albeit with notable sampling gaps in Africa and elsewhere
[41],
[42]. Using these samples, Rosenberg et al.
[40] identified five major genetic clusters corresponding to native populations from sub-Saharan Africa, west Eurasia, east Asia, Oceania and the Americas. There is also an overall relationship between genetic differentiation and geographic distance
[43],
[44] suggesting that human population history is likely a complex mixture of population splits and gene flow
[45].
The HapMap data consist of over 3 million SNPs genotyped in 210 unrelated individuals
[26],
[36]. These individuals include 60 Yoruba from Ibadan, Nigeria (YRI), 60 individuals of northwest European ancestry from Utah (CEU) and 90 individuals from east Asia (from Beijing and Tokyo) that we analyzed as a single “analysis panel”(here denoted ASN). For those analyses in which uniform SNP ascertainment is most important, we used a subset of the HapMap data consisting of 900,000 SNPs identified by Perlegen Sciences
[46]. These SNPs were detected using array-based resequencing in a multiethnic panel, and subsequently genotyped in the HapMap. This screen should have good power to detect high- F
ST SNPs since both alleles of a high- F
ST SNP are likely to be present in a multiethnic sample (see
Methods for further details). Throughout this paper we consider only the autosomes since the smaller effective population size and the smaller sample sizes in the X chromosome data make it inappropriate to merge the X and autosomal data.
Overview of the Paper
As noted above, we now know of several genes in which recent selection appears to have been very strong, driving new alleles to high frequencies in particular populations or groups of populations
[48]–
[50]. Some genome-wide studies have estimated that strong selection, with selection coefficients above 1%, is widespread in the genome (e.g.,
[16],
[47]). Similarly, studies of other organisms have identified cases in which selection has created large allele frequency differences between populations, even in the presence of high rates of gene flow
[48],
[49],
[50]. Together, these studies suggest that selection in humans might be a strong force that allows for local adaptation via large allele frequency shifts at individual loci.
If this were the case, then we might expect to find SNPs whose frequency distributions in the HGDP differ dramatically from neutral patterns. For example, some SNPs might show extreme allele frequency differences between closely related populations due to divergent selective pressures
[51]. More broadly, we might expect to find alleles whose geographic distributions differ dramatically from the expectations of neutral population structure, if their frequencies are driven by factors such as diet or climate
[24],
[52]. However, neutral forces including migration and admixture would tend to work against selection, reducing frequency differences between geographically close populations
[53],
[54]. Hence it is unclear whether selection pressures in humans are strong enough, and sufficiently divergent over short geographic scales, to produce large frequency differences at individual loci.
In this paper, we begin to answer some of these questions by examining the distributions of potentially selected SNPs at a variety of geographic scales. Our approach combines the complementary strengths of the HGDP and HapMap data sets: we use the HGDP to study the geographic distributions of putatively selected alleles at fine scales, and the much denser HapMap data to study differences between continental populations. We aim to learn whether selection in humans is strong enough to generate highly divergent allele frequencies between closely related populations, and geographic distributions that diverge strongly from neutral patterns. At the largest geographic scales, we ask: How effective has selection been at driving allele frequency differentiation between continental groups?