The promise of the genomic age for elucidating human evolution has not yet been realized, in part due to the large size of regions identified as targets of selection, each of which can contain thousands of candidate causal variants, and in part due to the incompleteness of genotype data. Drawing on full genome sequence data from 1000G and on the CMS method, this paper presents the first comprehensive catalog of potential human adaptive mutations, instead of genomic regions. Each fine-mapped region contains 20-100 candidate variants, a small enough number to be tractable for functional characterization. As causal variants under selection typically have 10-50 perfect proxy variants, we are already near the limit of the power of population genetic tests to pinpoint the causal variant (Grossman et al., 2010
). We computationally annotated all candidates and provide a proof-of-principle example of functional validation, creating a rich resource for future studies of human adaptations.
Many of the variants thus identified are associated with pathways that have emerged as targets of the strongest selective pressures on humans in recent history; the relevant traits include skin color, metabolism, and infectious disease resistance (). In addition to the phenotypic associations and gene enrichment in these pathway discussed above, several of the eQTL SNPs regulate the expression of genes in these pathways, such as IVD, ACAS2, and CTNS (involved in metabolism) and BLK (involved in immune function). Many mutations fall in and around genes encoding the receptors or enzymes that modify the receptors for some of the most devastating pathogens in human history, including RHOA
) (Edelmann et al., 2010
) and DAG1
and others), LARGE
(Lassa virus) (Kunz et al., 2005
malaria) (Sabeti et al., 2006
), PVRL4 (Measles virus), VDR
(HIV) (Farzan et al., 1999
), and CXCR4
(HIV). New pathways under selection are also coming to light: for example, in this issue of Cell
Kamberov et al. elucidates selection on a nonsynonymous mutation in EDAR
, which leads to a number of pleiotropic traits including altered hair and sweat gland formation.
Characterization of Candidate Regions and Variants
Our data support the mounting evidence that a great deal of recent human adaptation and phenotypic variation is based in regulatory regions (Hindorff et al., 2009
; Lindblad-Toh et al., 2011
; Vernot et al., 2012
; Wang et al., 1995
). Less than 10% of our fine-mapped regions contained high-scoring non-synonymous SNPs; candidate selected SNPs are enriched for eQTLs and include many mutations that disrupt transcription factor motifs in enhancers and promoters. Motifs for transcription factors involved in a number of different processes are disrupted, including STATs, Jun, GATAs, C/EBP, PPARγ, ETS, and IRFs. In several cases, the motif for a cell-specific transcription factor is disrupted in a cell-specific enhancer, for example an LXR:RXR motif in a hepatocyte-specific enhancer or a PU.1 motif in a monocyte-specific DNase HS site. The magnitude of the change in binding affinity varied from a minimum change in LOD score of 0.3 to a maximum of 12. More complete characterization of the regulatory variants using high-throughput cellular assays and eQTL studies in additional individuals may be illuminating.
Given the bounds on population genetic approaches to fine-map signals of selection and the limitations of current functional annotations, the true adaptive mutation must ultimately be distinguished using functional approaches. This is a challenge, especially for regions identified through genome-wide scans instead of based on a prior hypothesis of an adaptive pressure (e.g. malaria and lactose tolerance). It is impracticable to assay each variant in every possible cell type and process, and furthermore even functional variants need not be causal. While there is no way to prove what evolution did, even if we could go back in time to observe it, the standard in the evolutionary genomics field for establishing a mutation as having caused selection is strong statistical evidence of selection plus a phenotypic effect likely to enhance survival.
We chose one of the candidate variants, a nonsynonymous mutation in TLR5
, to characterize experimentally. The derived allele in TLR5 with evidence of selection leads to diminished NF-κB signaling during bacterial infections. Intriguingly, another allele that decreases the function of TLR5 (a nonsense variant, TLR5-392STOP) has previously been reported to reach a frequency of 10% in European populations (Barreiro et al., 2009
). The existence of common variants that decrease TLR5 signaling suggests that modulating TLR5 signaling may be advantageous in certain environments. Indeed, decreasing NF-κB signaling can have a protective effect in several bacterial infections, most significantly in bacterial sepsis (Koedel et al., 2000
; Okugawa et al., 2006
). Furthermore, the pathogen Salmonella typhimurium
requires activated lamina propria cells (LPCs) in the intestinal epithelium to invade a host, and is consequently unable to infect mice with deficient TLR5 signaling (Uematsu et al., 2006
). In a human population constantly exposed to high levels of bacterial antigens, a TLR5 variant with reduced NF-κB activation may well confer a fitness advantage.
An accompanying article from Kamberov et al. models an adaptive human variant of EDAR in mice, and characterizes its phenotype and evolutionary origins in humans. EDARV370A, one of the 35 nonsynonymous variants detected by CMS in 1000G data, likely emerged in central China ~30,000 years ago and leads to increased sweat gland number and scalp hair thickness in mice and humans. TLR5L616F and EDARV370A demonstrate the power of our framework to move from genomic scans to the characterization of a novel adaptive mutation and elucidation of distinct mechanisms of evolution.
This paper, in conjunction with the accompanying paper on EDAR
, represents a decisive shift for the field of evolutionary genomics, moving from hypothesis-driven to hypothesis-generating science. We further provide a comprehensive list of candidate adaptive mutations
driving recent human selective sweeps that lay the foundation for myriad future functional studies. The data from the 1000G Project, along with functional annotations, is available on a genome-wide browser, together with software to compute CMS on any dataset (http://www.broadinstitute.org/mpg/cms
). In the years ahead, unprecedented data availability and collaborations across multiple disciplines from molecular, developmental, and computational biology to history and anthropology, promise to bring key recent events that have shaped our species to light.