Our non-coding dataset included 6.9 Mb of autosomal conserved non-coding sites (CNCs) resequenced in 15 African Americans (AAs), 20 European Americans (EAs), and aligned to a single chimpanzee (PanTro2), while our coding dataset included 17.5 Mb of autosomal sites resequenced across the same panel of individuals, see
[32],
[33]. Conserved noncoding sites were defined as strictly non-coding sites that fall within human and mouse conserved sequences (sequences with at least 70% identity and 100 nucleotides in length). We found at least one human polymorphism or fixed difference to the chimpanzee in CNCs flanking a total of 11,334 autosomal genes (75.4%), with 7,826 genes having at least one human polymorphism across the 35 individuals (52.9%). CNCs that are flanking genes tend to have lower divergence and polymorphism compared to those in intergenic regions, however all categories of CNCs exhibit lower polymorphism and divergence as compared to synonymous sites (). In AAs, CNCs in 5′ upstream regions have a smaller ratio of divergence to polymorphism when compared to synonymous sites, consistent with the presence of selective constraint in the upstream regions of genes (
p
=

0.029, ). However, the overall ratio of divergence to polymorphism is generally similar between pooled CNCs and synonymous sites (
p>0.05, ), and all categories of CNCs exhibit a higher ratio of divergence to polymorphism when compared to nonsynonymous sites.
| Table 1The total number of fixed and segregating sites in 15 African Americans (AA) and 20 European Americans (EA) as compared to the chimpanzee (PanTro2), along with the ratio of fixed/segregating (F/S) sites, the odds ratio (or neutrality index), and a chi-square (more ...) |
The site frequency spectra of CNCs in both upstream and downstream regions of genes show a significant excess of lower frequency derived alleles (SNPs at a frequency of 1/16) as compared to both synonymous and intergenic sites ( and ), suggesting an excess of weakly deleterious alleles in CNCs in the flanking regions of genes. A study by Veyrieras
et al. [34] has found that the majority of eQTLs lie either within or close to genes, suggesting that the excess of low frequency derived alleles observed in the flanking regions of genes may reflect the past action of negative selection on gene regulation. A shift in the distribution toward more rare alleles in CNCs is consistent with previous findings
[8],
[10],
[13],
[14], and confirmed by significantly smaller values of Tajima's
D in CNCs as compared to synonymous sites (
D
=

−0.52 vs. −0.45 in AAs [Mann-Whitney
U test,
p<10
−4],
D
=

−0.31 vs. −0.21 in EAs [Mann-Whitney
U test,
p<10
−6]) (
Figure S1). Therefore, patterns in the site frequency spectrum suggest that CNCs in the flanking regions of genes have been subject to selective constraint.
| Table 2Fisher's Exact tests comparing the proportion of low frequency derived alleles between different categories of CNCs compared to synonymous and intergenic sites. |
Estimates of the population scaled selection coefficient (γ

=

2
Nes) from the site frequency spectrum were calculated for different categories of CNCs using the program prfreq
[32]. All point estimates of γ were negative for CNCs in the flanking regions of genes as compared to intergenic CNCs, which share the same ascertainment scheme and show no evidence for selection when compared to synonymous sites (). We find a significantly better fit for a single estimate of γ when compared to a neutral model for all categories of CNCs, suggesting the presence of selective constraint that cannot be explained by the ascertainment of CNCs alone. Furthermore, all but the 5′ upstream and 5′ UTR regions of genes show a significantly better fit to a model with a distribution of γ, implying there is variation in the deleterious effects of mutations in CNCs. We also observe an excess of fixed differences that cannot be explained by a Gamma model of deleterious fitness effects, suggesting the presence of positive selection within CNCs in the flanking regions of genes. We estimate that 4.67% of the human-chimp fixed differences in CNCs can be attributed to positive selection, with UTRs showing more adaptive evolution (5′: 14.3%, 3′: 23.3%) than upstream or downstream regions (5′: 12.2%, 3′: 12.0%). Intronic CNCs show less evidence for positive selection (2.22%), but because they make up the majority of sites, pooled CNCs in the flanking regions of genes show modest evidence for adaptive divergence relative to intergenic CNCs.
| Table 3Maximum likelihood estimates of γ under various models of fitness effects for mutations in CNCs in the flanking regions of genes as compared to intergenic sites. |
Gene-specific tests of selection
We defined a candidate
cis-regulatory region as all CNCs within the introns, UTRs, or within 5 kb up- or downstream of the transcription start or stop site of a gene. CNCs were pooled in this way to increase the power to detect selection, and to capture signals of selection without regards to a specific mode of
cis-regulation. However, we note there is a significant correlation between the probability of selection estimated from the 5′ upstream regions and the combined set of CNCs (
Figure S2). To identify genes whose candidate
cis-regulatory regions may be under selection, contingency tables were constructed containing the counts of polymorphic sites and fixed differences to the chimpanzee in the pooled flanking CNCs of a gene. The total number of synonymous polymorphisms and fixed differences in coding regions (without respect to human and mouse sequence conservation) were pooled to use as a neutral standard as in
[35]; the use of pooled vs. local synonymous sites has little effect on our estimates of selection (
Figure S3). In order to identify loci showing signatures of natural selection, we implemented the program mkprf
[36] by conservatively setting no fixed variance on the prior distribution of the population scaled selection coefficient (γ

=

2
Nes), which shrinks the estimates of γ (
Figure S4). For each gene we quantified the probability that the estimate of γ falls within five bins: γ<−1 (strong negative selection), −1<γ<−0.5 (weak negative selection), −0.5<γ<0.5 (nearly neutral), 0.5<γ<1 (weak positive selection), and γ>1 (strong positive selection) (). In order to compare different modes of selection, the probability of negative selection was defined as Pr(γ<−0.5), and the probability of positive selection was defined as Pr(γ>0.5).
In order to control for population size changes that may affect our estimates of positive and negative selection, we incorporated demographic parameters when calculating the likelihood of our observed data in mkprf, the effect of which can be seen in
Figure S5. A population expansion is expected to increase levels of polymorphism at neutral sites to varying extents due to differences in local mutation rates. In a set of neutral loci simulated under a model of population expansion, failing to correct for demography results in a neutral locus showing strong signatures of negative selection with credibility intervals on the mean estimate of γ below 0 (), and a slight inflation in the probability that γ<0 at higher values (
Figure S6). Demographic models including a population expansion in African Americans and a single population bottleneck in European Americans were fitted to the complete set of autosomal synonymous SNPs using the program prfreq
[32] and incorporated into mkprf. However, a single bottleneck model was found to be a poor fit to the European American sample, and even a complex, multi-bottleneck model could not account for the excess of high frequency derived alleles at synonymous sites
[32]. Therefore, we focus the majority of our conclusions on the African American sample. A complete list of genes including McDonald-Kreitman tables and mkprf results are available in
Table S1,
Table S2,
Table S3,
Table S4,
Table S5,
Table S6,
Table S7, and
Table S8.
| Table 4The number of genes showing strong signatures of positive and negative selection in conserved non-coding sites (CNCs), synonymous, and nonsynonymous sites (Nonsyn) in African Americans (AA) and European Americans (EA), and for simulated data (Sim) under (more ...) |
In order to evaluate the performance of mkprf, we performed Wright-Fisher forward simulations under the inferred demographic model in African Americans using the program SFS_CODE
[37]. We find a significant correlation between simulated and estimated mean values of γ in datasets including various combinations of loci simulated under positive and negative selection (
Figure S7). However, in a dataset including an equal number of loci simulated under positive and negative selection, the correlation is stronger for positively selected loci as compared to negatively selected loci, most likely reflecting a limited ability to distinguish between strong vs. weak negative selection. There is also a 10-fold increase in the number of genes returned with credibility intervals (CIs) on the mean estimate of γ>0 compared to CIs<0 (neut+del+pos, ), as genes subject to strong enough negative selection often have reduced levels of polymorphism and divergence. Moreover, of the 11,000 simulated loci in the neut+del+pos dataset, 96% of loci in the positively selected class (γ>0) had at least 1 informative site, whereas only 94% of loci in the negatively selected class (γ<0) had at least 1 informative site making an estimate of γ even possible in mkprf. We note that across all datasets, only a limited number of the loci simulated under positive and negative selection have CIs on the mean estimate of γ above or below 0 (), suggesting there are likely additional loci subject to selection than identified by CIs alone.
An important consideration in gene specific tests for selection is unequal sequence coverage, as genes with a smaller number of resequenced sites tend to have fewer informative sites (fixed or polymorphic sites,
Figure S8). It is unlikely that we have complete coverage of all
cis-regulatory regions of a gene due to a focus on the 5′ upstream regions of genes, which is potentially problematic for our comparisons if genes with fewer informative sites were to provide less reliable estimates of γ. As discussed in the previous paragraph this may be particularly relevant for negatively selected loci, as genes subject to strong negative selection are expected to have fewer informative sites. Our simulations confirm that a higher percentage of negatively selected loci have fewer than 4 informative sites (neut+del+pos, 52%) as compared to either positively selected (16%) or neutral loci (28%). However, we find no significant difference in the distribution of γ when we compare loci with fewer or at least 4 informative sites when simulated under neutral, positive, and negative selection (
Figure S9), suggesting that loci with fewer informative sites are unlikely to be problematic in our comparisons. The distributions of the number of resequenced sites at a locus in candidate
cis-regulatory and protein-coding regions are shown in
Figure S10.
Selection on candidate cis-regulatory regions and gene expression profiles
We downloaded the Novartis Gene Expression Atlas 2 data from 72 normal human tissues in order to examine patterns of selection in candidate
cis-regulatory regions with respect to gene expression signals
[38]. Microarray expression profiles were available for 87% of genes in our tests for natural selection. Conflicting studies have shown that expression patterns in tissue-specific genes evolve more rapidly
[39], or more slowly
[40] as compared to broadly expressed genes in comparisons of humans and mice. More recently, experimentally defined
cis-regulatory regions were found to exhibit stronger degrees of selective constraint in genes expressed in a smaller number of tissues
[18]. However, we find little correlation between the probability of negative selection and the absolute number of tissues in which a gene is expressed (AA: Kendall's
tau
=

−0.011,
p
=

0.12; EAs:
tau
=

−0.0060,
p
=

0.40), nor with the index of tissue specificity
[41], which includes additional information on the level of expression in each tissue (AA: Kendall's
tau
=

0.0086,
p
=

0.21; EAs:
tau
=

0.0024,
p
=

0.73). We also find no significant correlation between probabilities of selection and the mean and the maximum expression level of a gene across all tissues, suggesting that expression levels may not have a large impact on patterns of recent natural selection within candidate
cis-regulatory regions. However, our simulations suggest that we likely have reduced power to identify significant correlations that are driven by negative rather than positive selection.
We then assigned genes to each tissue of expression in order to examine differences across tissues in evidence for selection on candidate
cis-regulatory regions (
Table S9,
Table S10). Mann-Whitney
U tests did not identify any tissues with a significantly higher probability of negative selection in either candidate
cis-regulatory regions or nonsynonymous sites, suggesting that weak selective constraint may be a persistent factor affecting most human tissues, but likely reflects the limited power we have to detect negative vs. positive selection. On the other hand, we find that genes expressed in at least 3 tissues have a significantly higher probability of positive selection in candidate
cis-regulatory regions (FDR<10%, ), suggesting they may have an excess of positively selected loci. Notably, genes expressed in the fetal brain have a higher mean probability of positive selection as compared to genes expressed in other tissues, suggesting the importance of adaptive regulatory changes during brain development. Similarly, genes expressed in certain tissues of the adult brain, including the cerebellum peduncles and the medulla oblongata, also have a higher probability of positive selection in candidate
cis-regulatory regions. Curiously, the medulla controls a variety of autonomic functions and is considered to be the most plesiomorphic structure of the brain
[42], and might be expected to have a high level of conservation across species. However, the medulla also contains several motor nuclei important for facial expression, mastication, tongue movements, and controlling sound amplitudes that reach the inner ear, which are hypothesized to have played an adaptive role in the evolution of facial expression, feeding, and speech in humans
[43].
| Table 5Tissues from the Novartis Gene Atlas 2 data with a higher mean probability of positive selection in candidate cis-regulatory regions (CNCs) and nonsynonymous sites (Nonsyn) in African Americans (AA) and European Americans (EA) (p<0.05, Mann-Whitney (more ...) |
Microarray studies have found that differences in brain expression patterns are more pronounced between humans and other primates as compared to other tissues
[20], and that the majority of these differences are likely due to upregulation of brain-expressed genes in humans as compared to chimpanzees
[21]. However, these findings have been controversial
[22], and a more recent study has found that differences in expression patterns between humans and chimpanzees are less pronounced in the brain as compared to heart, kidney, liver and testis
[24]. Regional expression in parts of the brain is generally conserved between human and the more distantly related mouse
[44], raising the possibility that cognitive differences between species may be more likely to result from differential gene expression during development. While comparative microarray studies have generally focused on adult brain expression, our findings suggest that adaptive evolution might have had a larger impact on expression patterns in the fetal brain.
We applied the same methodology to examine differences in selection acting on coding regions with regards to gene expression. In contrast to what we observed for candidate
cis-regulatory regions, we find no evidence for a higher probability of positive selection on nonsynonymous sites in genes expressed in the fetal brain in either AAs (Mann-Whitney
U-test:
p
=

0.26) or EAs (
p
=

0.78). Our results are similar to a previous finding that positive selection in protein-coding regions is not elevated in genes expressed in the fetal brain
[45]. In light of our results, it would seem that adaptive evolution of fetal brain development is influenced more strongly by changes at the regulatory level rather than at the protein-coding level. However, we note that genes expressed in the “whole brain” and cerebellum show a trend towards higher probabilities of positive selection within nonsynonymous sites in AAs (
p
=

0.03 and 0.048 respectively, FDRs

=

34%), suggesting that adaptive changes in the human brain may be the cumulative result of positive selection on regulatory regions during early development, and possibly on protein-coding regions in the adult brain.
Previous studies have identified immune response and T cell-mediated immunity as processes that are enriched for genes showing signatures of positive selection in protein-coding regions
[46],
[47], highlight the importance of adaptive evolution in response to pathogens. We find that genes expressed in natural killer cells and T-cells both rank high among tissues with higher probabilities of positive selection in both coding and candidate
cis-regulatory regions as compared to genes expressed in other tissues. Therefore, while genes expressed in the fetal brain show a different trend in coding and candidate
cis-regulatory regions, positive selection in genes expressed in various immune cells may have occurred at both the coding and regulatory level.
Selection on candidate cis-regulatory regions and functional categories
Evolutionary patterns within candidate
cis-regulatory regions can also provide insight into the relative importance of functional categories in the evolution of modern humans. We generated a custom GOslim set containing 129 terms from the Gene Ontology database
[48], and performed Mann-Whitney
U-tests to identify functional categories that have higher than expected probabilities of selection within candidate
cis-regulatory regions (
Table S11,
Table S12). In AAs, functional categories with a higher probability of positive selection in candidate
cis-regulatory regions include regulation of cellular process (GO:0050794,
p
=

5.6×10
−4, FDR

=

4%), protein modification (GO:0006464,
p
=

4.8×10
−3, FDR

=

9%), and cell cycle (GO:0007049,
p
=

3.7×10
−3, FDR

=

9%) (). In EAs, the most significant terms include calcium ion binding (GO:0005509,
p
=

4.9×10
−3, FDR

=

18%), organelle organization and biogenesis (GO:0006996,
p
=

5.4×10
−3, FDR

=

22%), cell cycle (GO:0007049,
p
=

7.9×10
−3, FDR

=

22%), and behavior (GO:0007610,
p
=

9.3×10
−3, FDR

=

22%). Functional categories with a higher probability of negative selection in candidate
cis-regulatory regions in AAs are generally less significant than positive selection, but include cytosol (GO:0005829,
p
=

8.6×10
−3, FDR

=

16%), ribosome (GO:0005840,
p
=

0.02, FDR

=

16%), extracellular region (GO:0005576,
p
=

0.03, FDR

=

16%), and carrier activity (GO:0005386,
p
=

5.9×10
−3, FDR

=

22%), and in EAs include proteinaceous extracellular matrix (GO:0005578,
p
=

0.01, FDR

=

19%), and extracellular space (GO:0005615,
p
=

0.03, FDR

=

19%) ().
| Table 6Gene Ontology terms with higher mean probabilities of positive and negative selection in candidate cis-regulatory regions (CNCs) and nonsynonymous sites (Nonsyn) in African Americans (AA) and European Americans (EA) (FDR<25%, Mann-Whitney (more ...) |
We tested the same group of GO terms in our parallel tests for selection on nonsynonymous sites within coding regions, and find that different categories are identified as having higher probabilities of selection in nonsynonymous sites as compared to candidate
cis-regulatory regions (). Terms showing significantly higher probabilities of positive and negative selection in nonsynonymous sites correspond to previously published results involving the same dataset with AAs and EAs pooled
[35]. For example, we find that “transcription” has significantly higher probabilities of positive selection in nonsynonymous sites (GO:0006350; AA:
p
=

1.1×10
−3, FDR

=

8%; EA:
p
=

2.5×10
−3, FDR

=

6%), and that “actin binding” has higher probabilities of negative selection in nonsynonymous sites (GO:0003779:
p
=

8.0×10
−3, FDR

=

25%). Transcription factors have frequently been reported as having a high degree of positive selection in protein-coding regions in general
[35],
[45],
[49],
[50], however this is not a trend that we observe in candidate
cis-regulatory regions. Interestingly, in EAs we find that “calcium ion binding” has a significantly higher probability of negative selection at nonsynonymous sites (GO:0005509,
p
=

1.5×10
−3, FDR

=

6%), but within candidate
cis-regulatory regions it shows a higher probability of positive selection (
p
=

4.9×10
−3, FDR

=

18%). Therefore, evolutionary patterns in candidate
cis-regulatory regions tend to exhibit different functional patterns than those found in protein coding regions of genes.
Noteworthy genes with evidence for selection on candidate cis-regulatory regions
We identified several genes that are likely to have undergone adaptive regulatory changes during human evolution (95% credibility interval on mean γ above 0).
NUDT16 shows the strongest evidence for positive selection in candidate
cis-regulatory regions in both AAs (Pr[γ>0.5]

=

98.5%) and EAs (Pr[γ>0.5]

=

97.4%).
NUDT16 is a negative regulator of ribosome biogenesis, and may be involved in RNA decay in the nucleus.
ITPR1 shows the second strongest evidence for positive selection in EAs (Pr[γ>0.5]

=

97.3%), however we find much weaker evidence for positive selection in AAs due to the presence of SNPs that are exclusive to AAs (Pr[γ>0.5]

=

70.7%).
ITPR1 is essential for brain function, and is translated in response to synaptic activity in order to modulate calcium release from the endoplasmic reticulum. However
ITPR1 also modulates calcium entry in the plasma membrane of B-cells, suggesting it may also have an important role in immunity. Functional studies have shown that the 3′ UTR of
ITPR1 is required for dendritic localization in the mouse, however the majority of human-chimp fixed differences fall within conserved intronic sites rather than in the 3′ UTR region of this gene, suggesting that the target of selection is more likely to be an intronic
cis-acting element (or elements). Within nonsynonymous sites,
ITPR1 demonstrates little evidence for either negative or positive selection in both AAs and EAs, suggesting that positive selection on
ITPR1 likely involved adaptive substitutions within candidate
cis-regulatory regions rather than within protein-coding regions.
Additional genes with strong evidence for positive selection in candidate cis-regulatory regions in both populations include CACNA2D3, a calcium channel subunit protein ubiquitously expressed in fetal tissues; OR2L13, an olfactory receptor; and RNF167, a gene involved in protein degradation.
We also identified several genes that are likely to have experienced negative selection on candidate
cis-regulatory regions (95% credibility interval on mean γ below 0). The gene with the highest probability of negative selection in candidate
cis-regulatory regions in AAs is
FREM1 (Pr[γ<−0.5]

=

98.4%), a gene involved in the development of a number of epidermal structures in the mouse
[51], which also shows signatures of negative selection in EAs (Pr[γ<−0.5]

=

92.9%, but CI includes 0).
FREM1 is highly expressed in the dermis during mouse embryonic development, and truncation of the FREM1 protein results in blebbing (blistering) diseases that are similar to phenotypes observed in Fraser syndrome, and dystrophic epidermolysis bullosa in humans. The gene with the highest probability of negative selection in candidate
cis-regulatory regions in EAs in
KRT40 (Pr[γ<−0.5]

=

99.9%), a hair keratin protein that shows a similarly high probability of negative selection in AAs (Pr[γ<−0.5]

=

97.6%). However, positive values of Tajima's
D in both EAs (
D
=

1.80) and AAs (
D
=

1.32) indicate that polymorphisms at
KRT40 tend toward intermediate frequencies, suggesting that
KRT40 may potentially be subject to balancing rather than negative selection.
Other genes with strong evidence for negative selection in candidate cis-regulatory regions include SGCZ, a gene that may be important in the pathogenesis of muscular dystrophy and cardiomypathy; KIF19, a gene involved in intracellular transport and a member of a superfamily of proteins important for brain functioning; DOCK1, a gene thought to play a role in regulating phagocytosis during apoptosis; and TNNI3K, a cardiac-specific protein kinase. We also identify C14orf119 and C20orf117 as having strong evidence of negative selection in candidate cis-regulatory regions, however these genes have not been fully characterized.
Selection inferred at disease-associated genes
In order to examine patterns of natural selection on candidate
cis-regulatory regions with respect to human disease, we identified 666 Mendelian disease genes using a hand-curated list of genes from the Online Mendelian Inheritance in Man database (OMIM)
[52], and 1,072 complex disease genes using the Genetic Association Database (GAD)
[53] that were included in our scans for selection. We find that disease genes have a higher mean probability of negative selection within candidate
cis-regulatory regions as compared to non-disease genes, however this trend is only suggestive in EAs, the population where the majority of diseases have likely been characterized (Mann-Whitney
U-test; OMIM:
p
=

0.23 in AAs,
p
=

0.011 in EAs; GAD:
p
=

0.29 in AAs,
p
=

0.06 in EAs). A link between negative selection and human disease has also been observed in protein-coding regions of the genome
[35], however genetic diseases can also be regulatory in nature
[54].
There are several examples of disease-associated genes showing evidence for negative selection in candidate cis-regulatory regions in EAs. For example, LDB3, is a gene expressed in skeletal and cardiac muscle for which several nonsynonymous mutations have been associated with myofibrillar myopathy (OMIM:609452) and cardiomyopathy (OMIM:601493). Another gene, PLCE1 shows evidence for negative selection in both coding and candidate cis-regulatory regions, and homozygous mutations within the coding regions have been associated with type 3 nephrotic syndrome (OMIM:610725). Both LDB3 and PLCE1 show only moderate signatures of negative selection at candidate cis-regulatory regions in AAs.
On the other hand, there are several examples of disease genes that show strong evidence of positive selection in candidate cis-regulatory regions. For example, CACNA2D3 displays a pattern where loss of heterozygosity in intronic sequences is associated with renal cell carcinoma; CENTG3 variants confer protection against the pathogenesis of polyglutamine disease in the brain; and ALG3 variants are associated with congenital disorder of glycosylation type Id, a metabolic disease.
Contrasts of selection on protein-coding and candidate cis-regulatory regions
Direct comparisons between different classes of sites in mkprf may be confounded by differences in power to infer selection when there are varying degrees of selection on background loci in the dataset, particularly when there is no fixed variance on the prior distribution of γ. Our simulations show that as the proportion of negatively selected loci is varied, so does the number of genes showing strong signatures of positive selection due to a broader exploration of the parameter space (CI>0, ), despite the actual number of positively selected loci remaining constant. For example, by increasing the number of negatively selected loci from 0 (neut+pos) to 699 (neut+wkdel+pos), to 1,485 (neut+del+pos), while keeping the number of positively selected loci constant (1,824 loci), the number of genes with CIs>0 changes from 206 to 199, to 203 (). If we increase the strength and the number of negatively selected loci to 4,632 loci (neut+stdel+pos), we then identify 267 loci with CIs>0 (). More importantly, the distributions of γ for positively selected and neutral loci show visible differences when the number of negatively selected loci is varied, which may confound direct comparisons when different classes of sites are analyzed independently (
Figure S11). In order to control for varying levels of selection on background loci, we ran a combined analysis with nonsynonymous, synonymous, and candidate
cis-regulatory regions in a single run of mkprf. The effect of analyzing different classes of sites in a concurrent vs. independent analysis is a shift in the distribution of γ towards smaller values for both candidate
cis-regulatory regions and synonymous sites, and a shift in the distribution of γ towards larger values for nonsynonymous sites, causing the distributions of γ to be more similar, and more closely centered around 0 for different classes of sites (
Figure S12).
First we looked for a correlation between selection acting on coding and candidate
cis-regulatory regions, and observe what appears to be a complex relationship (). In AAs we find a weak yet significant rank correlation between nonsynonymous and non-coding regions for both the probability of positive (Kendall's
tau
=

0.055,
p
=

1.6×10
−11) and negative selection (
tau
=

0.056,
p
=

9.6×10
−12), and between synonymous and non-coding regions (positive selection:
tau
=

0.060,
p
=

2.1×10
−13; negative selection:
tau
=

0.060,
p
=

4.1×10
−13). The correlation between synonymous and nonsynonymous sites appears to be slightly stronger (positive selection:
tau
=

0.098,
p<10
−16; negative selection:
tau
=

0.096,
p<10
−16), as expected due to increased linkage disequilibrium from a closer proximity between sites. Results from the EA sample are similar. However, given the small values of
tau, identifying genes with evidence for selection in coding regions may be a poor predictor of whether a gene will show evidence for selection in the same direction as candidate
cis-regulatory regions. A recent study has found that genes with patterns of expression consistent with any direction of selection (either positive or negative) exhibit reduced rates of protein evolution on nonsynonymous sites
[25], which is consistent with only a weak correlation in the probability of selection between sites. The lack of a strong correlation in the mode of selection also suggests that the effect of linkage between candidate
cis-regulatory and protein coding regions may be small, increasing the probability that we are detecting signatures of selection that are specific to non-coding regions.
We next compared the overall distributions of the probability of selection at different classes of sites (). Nonsynonymous sites have a higher mean probability of negative selection as compared to both candidate
cis-regulatory regions and synonymous sites (Mann-Whitney
U-tests,
p-values<10
−16), and candidate
cis-regulatory regions have a significantly higher mean probability of negative selection as compared to synonymous sites (
p
=

8.7×10
−4). We find that candidate
cis-regulatory regions exhibit a significantly higher mean probability of positive selection as compared to nonsynonymous sites (
p<10
−16), however this is also true for synonymous sites (
p<10
−16), and synonymous sites have a marginally higher mean probability of positive selection as compared to candidate
cis-regulatory regions (
p
=

0.014). Although synonymous sites may not evolve under strict neutrality, they may better represent the distribution expected under neutrality while taking into consideration local effects of linkage, mutation rates, and variability in effective population sizes across the genome. Therefore, differences in the distributions of the probability of positive selection between nonsynonymous and candidate
cis-regulatory regions may be driven by more neutral, rather than more adaptive evolution in candidate
cis-regulatory regions.
However, it is possible we have limited power to identify differences in the degree of positive selection in protein-coding and candidate
cis-regulatory regions, as the extent of positive selection within candidate
cis-regulatory regions may be an underestimate due to their ascertainment. By restricting our analyses to human-mouse conserved sequences, it may have biased our dataset toward lower ratios of divergence/polymorphism based on neutral coalescent simulations (
Text S1,
Table S13,
Table S14,
Table S15,
Figure S13, and
Figure S14). For example, if we restrict our analysis of nonsynonymous sites to those within human-mouse conserved regions (
i.e. use the same ascertainment as non-coding sites), all genes show strong evidence for strong selective constraint and have 95% credibility intervals below 0 on the mean estimate of γ (). Moreover, it is likely that only a small proportion of sites within candidate
cis-regulatory regions are truly functional, as regulatory elements are often small. Therefore, the relative contribution of adaptive evolution at the level of gene regulation vs. changes in the actual protein remains an open question.
An important consideration is whether selection on synonymous sites has affected our estimates of selection. While we observe no discernable relationship between the probability of positive or negative selection with GC content (
Figure S15), negative selection may indeed affect a certain proportion of synonymous mutations
[55]. Wright-Fisher simulations show that if purifying selection is acting on synonymous sites, we should expect a shift in the distribution of γ from a mean of 0.044 to 0.26 for a set of neutral loci (). On the other hand, selection on closely linked nonsynonymous sites may cause a reduction in polymorphism at synonymous sites, and could explain the signatures of positive selection we observe on synonymous sites (). Nevertheless, pooled synonymous sites provide a good fit to a neutral demographic model of expansion in the African American sample
[32], and were used as a neutral standard in the same way for all classes of sites. Therefore, any bias is expected to have a similar affect on candidate
cis-regulatory, synonymous, and nonsynonymous sites.
Conclusion
Our analysis of human polymorphism and divergence in conserved non-coding sites suggests that the evolution of candidate cis-regulatory regions is often driven by both positive and negative selection. Our findings reinforce the idea that the non-coding portion of our genome has an important functional and evolutionary role, and suggest that patterns of natural selection in non-coding DNA are often distinct from that of protein-coding regions. Many of the adaptive changes in candidate cis-regulatory regions might have occurred near genes expressed in the fetal brain, supporting the hypothesis that the evolution of the developing brain may be largely attributable to changes in gene regulation. Our results add to the increasing evidence that non-coding DNA is not all selectively neutral, and that selection on candidate cis-regulatory regions has played an important role throughout hominid evolution.