|Home | About | Journals | Submit | Contact Us | Français|
For decades, it has been hypothesized that gene regulation has had central role in human evolution, yet much remains unknown about the genome-wide impact of regulatory mutations. Here we use whole-genome sequences and genome-wide chromatin immunoprecipitation and sequencing data to demonstrate that natural selection has profoundly influenced human transcription factor binding sites since the divergence of humans from chimpanzees 4–6 million years ago. Our analysis uses a new probabilistic method, called INSIGHT, for measuring the influence of selection on collections of short, interspersed noncoding elements. We find that, on average, transcription factor binding sites have experienced somewhat weaker selection than protein-coding genes. However, the binding sites of several transcription factors show clear evidence of adaptation. Several measures of selection are strongly correlated with predicted binding affinity. Overall, regulatory elements seem to contribute substantially to both adaptive substitutions and deleterious polymorphisms with key implications for human evolution and disease.
It has long been argued that mutations affecting mechanisms of gene regulation must have had a prominent role in the evolution of humans and other mammals1–3. For several theoretical reasons, transcription factor binding sites and other cis-regulatory sequences may be particularly important in evolutionary adaptation4–7. For example, mutations in such sequences might help to minimize the functional tradeoffs associated with evolutionary changes because these elements often primarily influence the expression of a single gene in a particular cell type or under a particular condition, whereas proteins tend to have broader effects. In addition, cis-regulatory mutations are often co-dominant (neither allele dominates), which may allow natural selection to act on them more efficiently than on protein-coding mutations. Accordingly, several examples of phenotypic changes driven by cis-regulatory mutations have been identified in recent years, including pigmentation and bristle patterns in Drosophila melanogaster, skeletal reduction in stickleback fish and lactose persistence in humans7,8. In addition, some genome-wide analyses have found bulk statistical evidence of natural selection in noncoding regions near genes, presumably due to cis-regulatory elements9–12.
Nevertheless, evidence in support of the overall prominence of cis-regulatory mutations in evolutionary adaptation remains largely anecdotal and indirect, and there is continuing controversy about the relative roles of regulatory and protein-coding sequences in evolution8. Large-scale genomic studies of the evolution of transcription factor binding sites have the potential to advance this debate, but a major limitation of such studies so far has been a lack of precisely annotated binding sites across the genome. The analysis of conserved noncoding sequences and/or promoter regions rather than experimentally identified transcription factor binding sites tends to dilute the signature of natural selection and makes it more difficult to connect DNA mutations with fitness-influencing phenotypes. In addition, because polymorphisms are sparse and provide weak information at individual binding sites, most studies have either pooled polymorphism counts across genomic regions9,13, which can produce significant statistical biases14, or they have relied on divergence patterns over longer evolutionary time scales10,12,15, which can be influenced by binding site turnover, alignment errors or misidentifications of orthology. As a result, many questions remain about the manner in which transcription factor binding sites evolve and the general functional consequences of noncoding mutations.
We sought to address these questions using two types of data that have recently become available: whole-genome sequences for dozens of human individuals16,17 and genome-wide chromatin immunoprecipitation and sequencing (ChIP-seq) data identifying binding sites for dozens of transcription factors in multiple human cell types18. We also made use of whole-genome sequences for several nonhuman primates, which enable patterns of human variation to be contrasted with patterns of molecular evolution since the divergence of humans and their closest living relatives 4–6 million years ago. To interpret these data, we made use of a new probabilistic method, called Inference of Natural Selection from Interspersed Genomically coHerent elemenTs (INSIGHT), that characterizes the effects of natural selection on collections of short transcription factor binding sites. Our analysis of these data using INSIGHT sheds new light on the evolution of transcription factor binding sites in the human lineage.
Full mathematical details of the INSIGHT model and inference procedure are presented separately19, but we summarize the approach here to aid in the interpretation of our results. Similar to McDonald-Kreitman–based methods for identifying departures from neutrality9,20–22, INSIGHT obtains information about natural selection by contrasting patterns of polymorphism and divergence in transcription factor binding sites with those in flanking neutral regions, thereby mitigating biases from demography, variation in mutation rate and differences in coalescence time. However, INSIGHT improves substantially on these methods by making use of a full generative probabilistic model, directly accommodating weak negative selection23, allowing information from many short (~10-bp) elements to be combined in a rigorous manner and integrating phylogenetic information from multiple outgroup species with genome-wide population genetic data.
The INSIGHT model (Supplementary Fig. 1) assumes that nucleotides within a transcription factor binding site evolve by a mixture of four selective modes: (i) neutral drift, (ii) weak negative selection, (iii) strong negative selection and (iv) positive selection. All flanking nucleotides are assumed to evolve neutrally. This coarse-grained, categorical approach to modeling the distribution of fitness effects (DFE) is inspired by a simpler method developed for protein-coding genes in Drosophila24 and by observations indicating that the data contain only limited information about a full, continuous DFE25,26. Three key assumptions allow these selective modes to be disentangled. First, strong selection (positive or negative) is assumed to eliminate polymorphism because it causes mutations to rapidly reach fixation or be lost24. Second, weak negative selection permits polymorphism, but does not allow derived alleles to reach high frequencies27. Third, positive selection tends to favor the fixation of derived alleles (such events are denoted ‘adaptive substitutions’), whereas negative selection (strong or weak) guarantees that derived alleles will eventually be lost. Consequently, information about the overall prevalence of selection derives primarily from rates of high-frequency derived alleles, information about positive selection comes from levels of divergence, and information about weak negative selection comes from relative rates of low- and high-frequency derived alleles, all in transcription factor binding sites relative to flanking neutral regions.
Maximum-likelihood estimates of the model parameters yield three quantities of particular interest: (i) the fraction of sites under selection (ρ) and the expected numbers of (ii) adaptive substitutions (E[A]) and (iii) weakly deleterious polymorphisms (E[W]) (Table 1 and Online Methods). The model allows for likelihood ratio tests for negative and/or positive selection and the straightforward estimation of confidence intervals for all parameters. Model parameters can also be used to obtain estimates of two ratios that have been of interest in previous studies: the fraction of fixed differences driven by positive selection (α)22,27 and the fraction of polymorphisms subject to weak negative selection (here denoted τ)28,29. However, because the denominators of α and τ are strongly influenced by both negative and positive selection, we focus on the more readily interpretable quantities E[A] and E[W] in our analysis.
To test our methods, we generated synthetic data sets consisting of 10,000 instances of a 10-bp transcription factor binding site with various mixtures of selective effects, flanked by 5,000 neutral bases on each side (Supplementary Note). We then estimated all model parameters (Table 1 and Online Methods) from each data set and compared our estimates with the ‘true’ values used in the simulation (Supplementary Note). We found that our model-based estimates substantially improved on estimators based on divergence alone, polymorphism alone or the McDonald-Kreitman framework22. In particular, divergence-based estimators tended to be distorted by combinations of positive and negative selection (which have opposing influences on substitution rates; Fig. 1, rows 2 and 3), whereas polymorphism- and McDonald-Kreitman–based estimators were biased in the presence of weak negative selection23,27 (Fig. 1, rows 1 and 3). Our model-based estimates, by contrast, could accommodate combinations of selective effects with minimal bias. In addition, our estimates were robust to the assumption of a realistically complex demographic history, unlike methods that estimate a full DFE from the site frequency spectrum25,30–32, which typically require corrections for demography when applied to real data. Our methods did slightly underestimate ρ in the presence of moderate positive selection because many positively selected mutations are lost to drift and leave no signature in the data. Nevertheless, the effect on estimates of adaptive substitutions (E[A] and α) was small.
Next, we applied our methods to real data describing transcription factor binding sites, human polymorphism and patterns of divergence in primate genomes. First, we developed a pipeline for identifying high-confidence binding sites using ChIP-seq data from the Encyclopedia of DNA Elements (ENCODE) Project18. Briefly, this pipeline involved de novo motif discovery, manual inspection of motifs and binding-site prediction at ChIP-seq peaks, followed by filtering (Supplementary Note). Data were combined across cell types. Our pipeline yielded a total of 1.4 million binding sites for 78 transcription factors, with 582–106,113 binding sites per transcription factor (median of 9,748; Supplementary Table 1). Each transcription factor binding site was associated with a collection of putatively neutral nucleotide positions in a 20-kb block containing the binding site. These neutral positions excluded known protein-coding and RNA genes, conserved non-coding elements, their immediate flanking sites and the predicted transcription factor binding sites, leaving an average of 7,315 neutral sites per block. Finally, we summarized patterns of polymorphism and divergence within transcription factor binding sites and flanking neutral sites using high-coverage human genome sequence data for 54 unrelated individuals of diverse ancestry from Complete Genomics17 and synteny-based alignments of the chimpanzee33, orangutan34 and rhesus macaque35 genomes (Supplementary Note).
We applied our model and inference procedure to the complete set of binding sites for each transcription factor and obtained transcription factor–specific estimates of all parameters and expected values (Fig. 2). For comparison, we also applied our methods to second codon positions (CDS2 sites, at which all mutations cause amino acid substitutions) in 15,864 protein-coding genes that were carefully filtered to avoid errors from alignment, orthology identification and gene annotation (Supplementary Note). We obtained an average estimate across transcription factors (weighted by the number of nucleotides considered) of ρ = 0.33 (Fig. 2a and Supplementary Table 2). In comparison, our estimate for CDS2 sites was ρ = 0.80 (Supplementary Table 3), in reasonable agreement with estimates of 0.70–0.76 from comparative genomic analyses15,36. Thus, we detect a strong signature of natural selection in transcription factor binding sites, with about a third of nucleotides estimated to be under selection, but the fraction is considerably lower than at CDS2 sites. Notably, our comparison excluded synonymous sites in protein-coding genes but included analogous positions in transcription factor binding sites at which transcription factors exhibit, at most, weak base preferences.
We observed considerable variation across transcription factors in the estimates of key parameters (Fig. 2). Not unexpectedly, many of the transcription factors showing the strongest evidence of natural selection in their binding sites, such as ZEB1, SP2, FOXP2, ATF3 and BAF170, have fairly short binding sites (6–9 bp) with strong base preferences and few degenerate positions. Many of these transcription factors have relatively few binding sites in our set (700–2,000). Basic helix-loop-helix (bHLH) proteins were significantly over- represented among the transcription factors associated with strong selection, with five of six bHLH proteins ranking in the top third by ρ (P = 0.014, one-sided Fisher's exact test). Transcription factors having roles in apoptosis and development were slightly underrepresented (Supplementary Tables 4–6). A substantial fraction of the variation in estimates of ρ was explained by the information content of the associated motifs (R2 = 0.27; Fig. 3a), a measure of the average binding affinity of the transcription factor for its binding sites. This finding suggests that the same forces that constrain the sequences of many binding sites across a genome also influence patterns of evolution at each individual transcription factor binding site (as has been observed over longer evolutionary time scales15,37). For transcription factors that have many binding sites, it was possible to estimate a separate ρ value for each position within the motif, and, in several of these cases, we observed a clear correlation between position-specific estimates of ρ and information content (Fig. 3b).
Notably, a substantial minority of transcription factors showed significant evidence of positive selection in their binding sites (Fig. 2). We estimated that, on average, an adaptive substitution occurred about once for every ~8,300 nucleotides in transcription factor binding sites (E[A] per kilobase = 0.12) (Fig. 2b), and about 1 in 20 recent nucleotide substitutions in binding sites had been driven by positive selection (α = 0.05). By contrast, CDS2 sites showed little evidence of adaptive evolution on average (E[A] per kilobase ≈0, α ≈0 for all sites pooled). Classes of genes previously described as being under positive selection38,39 showed evidence of adaptation by our methods, but seven transcription factors had estimated values of E[A] per kilobase that exceeded those for positively selected genes (Fig. 2b and Supplementary Fig. 2). We also found substantial evidence of weak negative selection in transcription factor binding sites, with an average estimate of E[W] per kilobase of 1.3, 30% greater than the estimate for CDS2 sites. Overall, our results support previous findings that negative selection is dominant in the evolution of human protein-coding sequences, with positive selection primarily influencing a relatively small subset of genes25,32,38, but they indicate that positive selection has had a somewhat more pronounced influence on binding-site evolution, at least for some transcription factors.
The two transcription factors whose binding sites showed the strongest evidence of positive selection, as measured by E[A] per kilobase, were the GATA-binding zinc-finger proteins GATA2 and GATA3, both key regulators of gene expression in hematopoietic cells. GATA2 and GATA3 are unusual among the transcription factors associated with strong selection in having fairly large numbers of binding sites (27,475 and 15,617, respectively), which together contribute an expected 312 adaptive substitutions, 19% of the total from all transcription factor binding sites in our study. By contrast, binding sites for the third member of this family, GATA1, showed much weaker evidence of selection. The 50,389 binding sites for MAFF, a basic leucine-zipper protein best known for enhancing expression of the oxytocin receptor gene (OTR) but also thought to have broad roles in the cellular stress response, contributed an additional 286 adaptive substitutions (18% of the total). Also of interest was the presence of the forkhead-box protein P2 (FOXP2) among transcription factors whose binding sites are under strong selection, given apparent positive selection in this gene's protein-coding sequence and its possible role in the development of human speech40. However, FOXP2 has relatively few binding sites in our set (743), and evidence for selection was not statistically significant.
To allow for variability across transcription factor binding sites, we estimated the local binding affinity of each binding site in our set (Supplementary Note) and then partitioned all binding sites by binding affinity and estimated ρ values separately for each group. We observed a strong correlation between binding affinity and ρ (R2 = 0.87; Fig. 3c), indicating that the strength of natural selection at individual binding sites is well predicted by local binding affinity. We also compared the signatures of selection at transcription factor binding sites that have experienced recent affinity-increasing and affinity-decreasing mutations and found that affinity-increasing mutations showed an enrichment for adaptive substitutions, whereas affinity-decreasing mutations showed an enrichment for weakly deleterious polymorphisms (Supplementary Fig. 3).
We examined several other possible covariates of natural selection on transcription factor binding sites and identified a few other trends of interest. The fraction of sites under selection, ρ, was positively correlated with the expression levels of associated genes, consistent with observations in protein-coding sequences39 (R2 = 0.36, P = 0.002; Supplementary Fig. 4a). We also observed a positive correlation between ρ and the number of cell types in which each transcription factor binding site was bound as determined from our ChIP-seq data (R2 = 0.22, P = 0.0004; Supplementary Fig. 4b) and a negative correlation between the prevalence of weak negative selection (E[W] per kilobase) and distance to the nearest coding exon (Supplementary Fig. 5). We found no significant correlations with the tissue specificity of gene expression or distance to the transcription start site. In tests for elevated or reduced values of ρ, E[A] per kilobase and E[W] per kilobase near subsets of transcription factor binding sites associated with putative target genes of particular Gene Ontology (GO)41 categories, several transcription factors showed elevated values of E[A] per kilobase (indicating adaptive evolution) near genes involved in neural processes, consistent with a recent analysis of the promoter regions of primate genomes10. Other transcription factors showed elevated values of ρ near genes involved in development or cellular differentiation (Supplementary Tables 7–9 and Supplementary Note). To enable other researchers to explore this data set further, we created a UCSC Genome Browser track that displays all analyzed transcription factor binding sites and summarizes all relevant parameter estimates (Supplementary Fig. 6).
We applied our model to the 15,864 filtered protein-coding genes described, treating each coding exon as a ‘locus’ and drawing neutral sites from flanking regions. We then compared estimates of the total expected number of adaptive substitutions (E[A]) for these coding sequences and our 1.4 million transcription factor binding sites. (We experimented with several approaches for fitting the model to coding sequences and here report results for the approach that was most sensitive to adaptation; Supplementary Note.) This analysis produced cumulative estimates of E[A] = 4,786 for coding sequences and E[A] = 1,635 for transcription factor binding sites (Fig. 4a and Supplementary Table 3) or about 1.5 and 0.5 adaptive substitutions per hundred generations (ASPHG), respectively, assuming an average genomic divergence time of 6.5 million years42 and an average generation time of 20 years43. Our estimate for coding sequences implies that the fraction of fixed differences driven by adaptation in coding regions is ~20%, in reasonable agreement with estimates of 10–35% from previous studies25,27,44. When we extrapolated to all annotated protein-coding genes, our coding sequence estimate rose to E[A] = 8,954 or 2.8 ASPHG (Supplementary Note). Although our filtered gene set may not be representative of genome-wide coding sequences in all respects, it provides an approximate benchmark against which to compare the estimate for our 1.4 million transcription factor binding sites. Despite the incompleteness of our transcription factor binding site annotations, their estimated cumulative contribution to adaptive substitutions is nearly one-fifth that estimated for all protein-coding sequences (1,635 versus 8,954 adaptive substitutions). We also compared total expected numbers of weakly deleterious polymorphisms, obtaining estimates of E[W] = 16,937 for coding sequences and E[W] = 17,024 for transcription factor binding sites (Fig. 4b). Extrapolating to a full set of genes, as above, yielded E[W] = 31,687 for coding sequences, suggesting that the contribution of weakly deleterious polymorphisms from our transcription factor binding sites is more than half that from all coding sequences.
Using the minor allele frequency at each site, we could further calculate the expected numbers of weakly deleterious mutations per haploid genome (E[D]) for coding sequences and transcription factor binding sites. Here a deficiency of rare alleles and a slight enrichment for more common low-frequency alleles in transcription factor binding sites relative to coding sequences (Fig. 4c) disproportionally increased the estimates for transcription factor binding sites, yielding E[D] = 386.1 and E[D] = 431.1 for coding sequences and transcription factor binding sites, respectively (Supplementary Fig. 7). The extrapolated estimate for a full set of genes was E[D] = 722.3, which is fairly similar to previous estimates of 500 (ref. 45) and 674 (ref. 46) (Supplementary Note). Thus, we estimated the contribution of deleterious mutations per haploid genome from our transcription factor binding sites to be well over half the total contribution from coding sequences. Notably, common low-frequency alleles accounted for a substantially larger fraction of deleterious mutations in transcription factor binding sites than in coding sequences (Fig. 4d), which may have implications for human disease.
Since the discovery by Jacob and Monod of the cis-regulatory control of transcription more than 50 years ago47, there has been a great deal of speculation about the role of cis-regulatory elements in evolution1–7. Complete genome sequences and genome-wide ChIP-seq data make it possible to begin to examine these issues empirically on a genomic scale. Using a novel statistical approach designed to exploit these new data, we have shown that natural selection has indeed exerted substantial influence on transcription factor binding sites in the human genome. The transcription factor binding sites we have analyzed have some limitations—for example, some may represent sites of nonfunctional protein binding or may be present only in the immortalized cell lines examined by ENCODE—but our use of experimentally defined binding sites leads to a much clearer signature of selection than has been observed in studies based on less precisely defined noncoding regions proximal to genes9–11. Notably, we obtain qualitatively similar results when applying our methods to alternative genome-wide sets of transcription factor binding sites (Supplementary Note).
We estimate that, overall, the transcription factor binding sites analyzed in this study have contributed nearly a fifth as many adaptive substitutions (E[A]) and more than half as many weakly deleterious polymorphisms (E[W]) and weakly deleterious mutations per haploid genome (E[D]) as coding sequences. However, our analysis considers only a relatively small subset of all transcription factor binding sites, corresponding to perhaps 5% of all transcription factors48, a limited set of cell types and conditions, and fairly conservative predictions of binding sites. In addition, we have not considered regulatory elements involved in splicing, post-transcriptional regulation and other forms of gene regulation. It is not possible, at present, to estimate with any accuracy the total number of bases that function in gene regulation. However, if we assume that noncoding functional elements outnumber coding bases by at least 2:1, as suggested by patterns of long-term evolutionary conservation49–51, that these elements predominately function in gene regulation52 and, further, that mutations in these elements have similar distributions of fitness effects as in our transcription factor binding sites, then we can obtain rough estimates of the genome-wide contributions of regulatory sequences. This extrapolation yields a genome-wide estimate of E[A] = 9,017 (2.8 ASPHG), roughly equal to the extrapolated estimate for coding sequences (Fig. 4a and Supplementary Note). The projections for weakly deleterious polymorphisms and mutations per haploid genome from transcription factor binding sites rise to values 3.0 and 3.3 times as large as the corresponding estimates for coding sequences. Although these calculations are crude, they nevertheless highlight the substantial genome-wide contribution from regulatory sequences at both the positive and negative ends of the distribution of fitness effects.
Our observations raise the question of what impact regulatory mutations have on the genetic load associated with segregating deleterious mutations53,54. This question cannot be addressed directly using our methods because they do not yield estimates of the selection coefficient s. However, simulations suggest that the weakly deleterious mutations detectable by our methods have average population-scaled selection coefficients (2Nes, where Ne is the effective population size and s is the selection coefficient) of between about –16.7 and –7.7 (Supplementary Table 10 and Supplementary Note). Assuming Ne = 10,000, the average reduction in fitness per haploid genome due to our estimates for coding sequences–or, equivalently, the expected number of lethal equivalents per gamete–would therefore be 0.3–0.6 (Supplementary Note). This estimate is somewhat lower than the estimates of 0.7–2.5 lethal equivalents per gamete obtained from reductions in survival in the presence of human inbreeding54,55. However, if transcription factor binding sites are included and we extrapolate to a larger set of regulatory elements, our estimates rise to 1.2–2.6, in better agreement with inbreeding studies. These calculations are also crude, but they hint that most deleterious mutations segregating in human populations may be regulatory in nature. Furthermore, our finding that these regulatory mutations occur at higher frequencies, on average, than those in coding sequences (Fig. 4d) suggests somewhat different genetic architectures for regulatory and coding deleterious variations. For example, elevated derived allele frequencies most likely correspond to increased average allele ages and greater sharing of deleterious alleles across populations, differences that may have implications for the roles of these mutations in human diseases and their detectability in association studies.
Our results may be influenced in some respects by the simplifying assumptions underlying our model. Our estimates of the fraction of nucleotides under selection (ρ) depend on the assumption that a negligible fraction of high-frequency derived alleles is under selection. If modes of selection that cause alleles to remain at elevated frequencies or produce a steady influx of high-frequency derived alleles–such as balancing selection or recurrent positive selection–are common, the parameter ρ will be underestimated, and other parameters could also be influenced. To have a substantial influence on our results, however, these phenomena would have to be more prevalent than suggested by current evidence39,56,57. Another possible concern is that our estimates of ρ (and derived quantities) could be artificially elevated by reduced diversity in flanking neutral sites due to background selection or hitchhiking. Similarly, it is conceivable that fine-scale differences between binding sites and flanking neutral regions in mutation, fixation or SNP detection rates could lead to biased parameter estimates. However, follow-up experiments indicate that our approach adequately controls for these effects (Supplementary Figs. 5 and 8–10 and Supplementary Note). Evolutionary turnover of transcription factor binding sites58,59 is another possible confounding factor in our analysis. For example, losses of transcription factor binding sites in chimpanzee could lead to the overcounting of substitutions in the human lineage. However, simulations indicate that our model is effective in mitigating this problem by making use of outgroup sequences only to infer ancestral alleles (Supplementary Fig. 11 and Supplementary Note). Finally, our analysis ignores structural variants, focusing on point mutations, because they are most common, easiest to detect and easiest to model. Structural variants and point mutations generally show qualitatively similar patterns of selection51,60, but it will be of interest in future work to consider broader classes of mutation.
The INSIGHT model is detailed elsewhere19 and outlined in the Supplementary Note. Briefly, it incorporates a statistical phylogenetic model for outgroup genomes, a Jukes-Cantor62 model for divergence along the human branch and simple Bernoulli models for the occurrence of low- and high-frequency polymorphisms in human samples. It assumes a mixture of selected and neutral positions in binding sites and assumes that all flanking sequences evolve neutrally. The parameter ρ is the coefficient for the mixture model. The other two key global free parameters are η, which scales the rate of divergence, and γ, which scales the rate of low-frequency polymorphisms, both in selected sites relative to neutral flanking sites (Table 1). The phylogenetic parameters are pre-estimated using RPHAST63, and the neutral parameters (β1, β2 and β3, λi and θi for each locus i) are pre-estimated from the flanking sites. The remaining parameters are estimated by maximum likelihood using an expectation maximization algorithm.
Synthetic data were generated by forward simulation using SFS CODE64. We considered two demographic scenarios: one with a single human population of constant size (Ne = 10,000) and another with separate African, East Asian and European populations and recently estimated demographic parameters65. For each of the assumed distributions of fitness effects and each demographic scenario (Fig. 1 and Supplementary Figs. 12 and 13), we generated 10,000 independent loci consisting of a transcription factor binding site of 10 bp in size with 5,000 neutral flanking sites on each side. Nucleotides within the transcription factor binding sites were assigned selective modes by sampling from a multinomial distribution corresponding to each assumed mixture of fitness effects. We assumed mutation and recombination rates of μ = 1.8 × 10–8 and ρ = 1.1 × 10–8 events per nucleotide per generation, respectively. The recombination rate was held constant, but mutation rates were allowed to vary across loci by sampling from a normal distribution with the given mean and a standard deviation equal to one-tenth of the mean. Data were generated for 50 individuals (100 chromosomes). The true values of all parameters were based on the selective modes assigned during simulation. For comparison, we used the simple divergence-based estimator of Kondrashov and Crow66, an analogous polymorphism-based estimator and the McDonald-Kreitman–based estimator of Smith and Eyre-Walker22. Complete details appear in the Supplementary Note.
The transcription factor binding site identification pipeline is detailed in the Supplementary Note. Briefly, precomputed ChIP-seq peaks for 122 transcription factors from the Hudson Alpha Institute for Biotechnology and the Stanford-Yale-USC-Harvard consortium in the ENCODE Project were obtained from the UCSC Genome Browser (see URLs), excluding time-course experiments, chemically treated cell types, controls and data sets with release dates after June 2012. Motifs were identified for each transcription factor in multiple rounds of analysis with MEME67 using subsampling strategies similar to those used in MEME-ChIP68. Motifs were manually inspected and compared with those in motif databases, and a single best motif was selected for each transcription factor. Transcription factors were discarded if a high-quality motif could not be identified. Binding sites at ChIP-seq peaks were then identified using MAST (P < 0.0001, E value < 10). Binding sites were merged across cell lines and, where applicable, from the two data providers. Transcription factors having fewer than 500 binding sites were discarded, leaving 78 transcription factors. Motifs and corresponding binding sites were trimmed by eliminating edge positions with information content of <0.5.
Information about human polymorphisms came from the 69 Genomes data set from Complete Genomics (see URLs). Although larger data sets are available16, this one was selected for its high coverage, which allows singleton variants to be characterized with fairly high confidence. Our simulation results indicate that the sample size is large enough for our purposes. We eliminated data from the child in each of 2 trios and all but the 4 grandparents in the 17-member CEPH (Centre d'Etude de Polymorphisme Humain) pedigree, leaving 54 individuals. Genotype calls were extracted from the masterVar files. Outgroup data were obtained from the alignments in the UCSC Genome Browser of the chimpanzee (panTro2), orangutan (ponAbe2) and rhesus macaque (rheMac2) genomes with the human reference genome (hg19). Filters were applied to eliminate repetitive sequences, recent duplications, CpG sites and regions not showing conserved synteny with outgroup genomes. Our analysis considered only the autosomes (chromosomes 1–22). Complete details appear in the Supplementary Note.
The INSIGHT program was run separately on the full set of binding sites for each of the 78 transcription factors analyzed. Binding sites with fewer than 100 flanking nucleotides (after filtering) were excluded. To avoid overfitting due to sparse data, we added a single small ‘pseudolocus’ to each data set. We assumed a low frequency threshold of f = 15% for most analyses but also experimented with f = 10% and 20% (Supplementary Fig. 14). After parameter estimation, the expected values E[A] and E[W] were obtained by summing over the appropriate conditional distributions across all non-filtered nucleotide positions in the set, given the estimated parameters and observed data. Site frequency spectra (Fig. 4 and Supplementary Fig. 15) represent counts of derived alleles out of 108 chromosomes. Ancestral alleles were determined by parsimony at sites where the chimpanzee allele was shared by at least one of the other outgroups (orangutan or macaque). Details appear in the Supplementary Note.
Variances in parameter estimates of ρ, η and γ were obtained by calculating the 3 × 3 negative Hessian matrix for the log-likelihood function at its maximum (an approximation of the Fisher information matrix) and then inverting this matrix. The square roots of the diagonal elements of the resulting matrix were used as approximate standard errors for these parameters. These variances were propagated to the quantities E[A] and E[W] using an approximation. The error bars in the figures extend one standard error above and below the maximum-likelihood estimates of the parameters. Complete details appear in the Supplementary Note.
Likelihood ratio tests (LRTs) were used to assess the significance that parameters of interest have values greater than zero (Fig. 2) and that the parameters differ significantly between two sets of elements belonging to different classes (GO analysis for ρ). The first type of test was carried out by fitting the model to the data with all parameters free, fitting it again with a parameter of interest fixed at zero and then comparing twice the difference in the maximized log likelihoods to an appropriate asymptotic null distribution (all variants of χ2 distributions). The second type of test was accomplished by fitting the model separately to two partitions of a data set, fitting it once to the complete data set and again comparing twice the difference in maximized log likelihoods to an appropriate null distribution. Complete details appear in the Supplementary Note.
Information content was calculated from the inferred motif models as
The predicted binding affinity of a binding site having sequence X = (x1,..., xk) was calculated as
where is the probability of observing base xi at position i in a binding site and πxi is the background frequency of base xi (estimated across the genome)71,72. Additional details appear in the Supplementary Note.
The potential impact of binding site turnover was assessed by simulation. We simulated both losses of chimpanzee binding sites and gains of human binding sites, at various rates, under various mixtures of selective effects. We observed almost no sensitivity to relaxation of constraint on the chimpanzee lineage, even at fairly high rates of binding site loss (Supplementary Fig. 11). The model also behaves reasonably when gains in humans involve genuine positive selection. We did observe some overestimation of adaptive substitutions in a scenario in which binding sites emerge fully formed and immediately come under negative selection, with no intermediate adaptive interval, but this scenario is unlikely to have a major impact on our results (Supplementary Note).
We thank R. Blekhman, C. Danko, A. Boyko, K. Pollard, N. Goldman and A. Clark for comments on the manuscript. This research was supported by a Packard Fellowship, a Sloan Research Fellowship, US National Science Foundation grant DBI-0644111 and US National Institutes of Health (National Institute of General Medical Sciences, NIGMS) grant GM102192 (to A.S.). In addition, L.A. was supported in part by a postdoctoral fellowship award from the Cornell Center for Vertebrate Genomics, and B.A.A. was supported by US National Institutes of Health training grant T32-GM083937.
L.A., I.G. and A.S. conceived and designed the study. L.A., I.G., B.A.A., M.J.H., B.G. and A.S. analyzed the data. L.A., I.G., B.A.A. and M.J.H. contributed materials and analysis tools. A.S. and A.K. supervised the research. L.A., I.G. and A.S. wrote the manuscript with review and contributions from all authors.
URLs. UCSC Genome Browser, http://genome.ucsc.edu/; Complete Genomics human variation data, http://www.completegenomics.com/public-data/69-Genomes; INSIGHT source code and webserver, http://compgen.bscb.cornell.edu/INSIGHT/.
Supplementary information is available in the online version of the paper.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.