|Home | About | Journals | Submit | Contact Us | Français|
Conserved noncoding elements (CNCs) are an abundant feature of vertebrate genomes. Some CNCs have been shown to act as cis-regulatory modules, but the function of most CNCs remains unclear. To study the evolution of CNCs, we have developed a statistical method called the “shared rates test” to identify CNCs that show significant variation in substitution rates across branches of a phylogenetic tree. We report an application of this method to alignments of 98,910 CNCs from the human, chimpanzee, dog, mouse, and rat genomes. We find that ~68% of CNCs evolve according to a null model where, for each CNC, a single parameter models the level of constraint acting throughout the phylogeny linking these five species. The remaining ~32% of CNCs show departures from the basic model including speed-ups and slow-downs on particular branches and occasionally multiple rate changes on different branches. We find that a subset of the significant CNCs have evolved significantly faster than the local neutral rate on a particular branch, providing strong evidence for adaptive evolution in these CNCs. The distribution of these signals on the phylogeny suggests that adaptive evolution of CNCs occurs in occasional short bursts of evolution. Our analyses suggest a large set of promising targets for future functional studies of adaptation.
Conservation of DNA sequences across evolutionary history is a highly informative signal for identifying regions with important biological functions. In particular, conserved noncoding regions have been shown to be good candidates for containing regulatory elements that have roles in gene regulation. Recent studies have found that there are many thousands of conserved noncoding elements (CNCs) in vertebrate genomes and have suggested possible functions for some of these elements, but the function of most CNCs remains unknown. To study the evolution of CNCs, we developed a statistical method to identify CNCs that show changes in evolutionary rates on particular branches of the mammalian phylogenetic tree. Those rate changes may indicate changes in the function of a CNC. We applied our method to CNCs of five mammalian genomes, and found that, indeed, many CNCs have experienced rate changes during their evolution. We also found a subset of CNCs showing accelerations in evolutionary rate that actually exceed the neutral rates, suggesting that adaptive evolution has shaped the evolution of those elements.
Phenotypic evolution proceeds both by changes in protein coding sequences and by changes in gene expression that determine when, where, and how much genes are expressed [1–3]. Although recent genome-wide studies have begun the process of identifying genes that show signals of adaptive evolution in coding sequences , much less is known about the adaptation of regulatory sequences. One avenue to studying adaptation of gene regulation is to identify regulatory elements that show rapid evolution at the DNA sequence level . However, a challenge for this approach is that at present we have only limited knowledge of the DNA sequence elements that drive gene expression and regulation.
One possible way forward is to study the evolution of conserved noncoding elements (CNCs) [5–7]. In recent years it has been shown that ~3.5% of noncoding DNA sequence is substantially conserved across diverse mammals [8–10], and that a smaller amount of noncoding sequence is also shared with more distant vertebrates, including chicken and even fish [9,11–13]. Some CNCs show extremely high levels of conservation; for example, Bejerano et al.  identified 481 segments longer than 200 bp that are absolutely conserved among the human, rat, and mouse genomes. Recent studies of CNCs, using varied definitions, have reported that most CNCs are segments of around 100–300 bp, and that they are widely distributed across the human genome [9,10,14–18]. CNCs are not preferentially located near genes . In some cases, clusters of CNCs are found in gene deserts and a subset of these CNCs have been shown to play functional roles as enhancers [19–21].
It has been shown repeatedly that screening for CNCs is an effective method for identifying cis-regulatory modules of gene expression [18–25]. CNCs that are shared among humans and distant outgroups such as Fugu are heavily overrepresented near developmental regulator genes, and many serve as highly conserved regulators of these functionally conserved genes .
That said, there is still considerable uncertainty about the function of most CNCs, and it has been suggested that some CNCs may serve other kinds of functions, perhaps including roles in chromatin structure or structural connections between chromosomes . In principle, another possibility might be that many CNCs could simply be regions of the genome with low mutation rates. However, two kinds of evidence argue convincingly that the low evolutionary rates of CNCs are indeed due to selective constraint. First, the allele frequency spectrum of human SNPs that lie within CNCs is skewed towards rare variants, consistent with the action of weak purifying selection [27,28]. Second, the rate of evolutionary change of CNCs is closer to the neutral rate in primates than in rodents [28,29]. The latter observation is probably due to reduced efficiency of weak purifying selection in primates, which have smaller effective population sizes.
Hence, in this study, in view of the likely functional importance of CNCs, we set out to describe the patterns of evolutionary sequence change in these elements. We start with a simple null model in which the evolution of each CNC is characterized by a single substitution rate parameter r that accounts for varying levels of constraint and local mutation rate across CNCs. For each CNC we compare the null model to a hierarchy of alternative models that allow the CNC to have different evolutionary rates in different parts of the phylogeny. In the simplest alternative model, the CNC evolves at a single rate across the phylogeny except for one branch, which shows a change in rate (Figure 1). More complex alternative models allow multiple changes in rate. Increases in rate can be interpreted as evidence for positive adaptation or relaxation of functional constraint for the element in question. Decreases in rate are consistent with a tightening of selective constraint.
Two recently published papers [5,7] have taken similar approaches to identify nongenic regions that show accelerated evolution specifically in the human lineage. Both studies concluded that human lineage-selection signals are enriched near neurological genes. In the study of Pollard et al. , the most dramatically accelerated region was found to be part of a novel RNA gene that is expressed during cortical development. Here, we expand this kind of approach to look more broadly at evolutionary patterns of CNCs across the mammals.
To scan for functionally interesting CNCs that are shaped by changing selection pressures, we examined regions that are conserved in up to eight vertebrates (human, chimpanzee, dog, mouse, rat, chicken, zebrafish, and fugu; see Methods). We started from a publicly available set of aligned regions that were characterized as “most conserved” by the group that maintains the University of California Santa Cruz (UCSC) genome browser [10,30]. In short, the “most conserved” regions represent 4.3% of the human genome that were identified as conserved by a phylogenetic Hidden Markov Model (HMM)  (Methods). The model used for identifying “most conserved” regions assumes that such regions are conserved across all the species with aligned sequence for the region. However, as we show below, the method was flexible enough to include many regions that show fairly dramatic variation in rates across the vertebrate phylogeny.
We performed extensive filtering of the “most conserved” regions. First, we excluded both translated and untranslated exons, repetitive sequences, and sites that are gaps or missing data in any of the five mammalian genome sequences (human, chimpanzee, mouse, rat, and dog). We then discarded regions with less than 100 bp of ungapped sequence. The remaining data consisted of 231,285 CNCs spanning ~48Mb. The alignments from UCSC make use of global alignment information across species, thus lowering the risk of incorrectly aligning paralogous CNCs as apparent orthologs. However, in order to further reduce the risk of this type of error, we filtered out 98,593 CNCs with human paralogs (see Methods). Since CNCs with different levels of conservation might show differences in their evolutionary patterns, we then subdivided the remaining CNCs into more homogeneous subsets according to conservation levels in chicken and fish (see Methods). Our study examines the properties of the two largest of these subsets, to be denoted as “mammalian” CNCs (conserved within mammals but not found in chicken or fish) and “amniotic” CNCs (conserved in mammals and chicken but not found in fish). For both the mammalian and amniotic CNCs, our analysis studied evolutionary patterns across the history of the five mammalian species only.
Our final dataset consists of 82,335 mammalian CNCs (for a total of 18.5 Mb) and 16,575 amniotic CNCs (4.6 Mb). The median sizes of CNCs in the two groups are 201 and 240 bp, respectively. We find that overall, the amniotic CNCs have a longer length distribution than the mammalian CNCs, consistent with previous results . Further details on the size distribution are in Table S1.
Assuming the Felsenstein 84 substitution model , we obtained maximum likelihood estimates of the average numbers of substitutions on each branch of the mammalian tree for each of our CNCs (see Methods; Table S3). All of our analyses assume the phylogenetic tree indicated in Figure 1 . Summing across all branches on the tree, the average number of substitutions per site is 0.16 for amniotic CNCs and 0.24 for mammalian CNCs. Notice that, as might be expected, amniotic CNCs show lower overall substitution rates than mammalian CNCs. We estimate that the average substitution rates of our amniotic CNCs and mammalian CNCs are ~20% and ~29% of the neutral rate (based on comparison to local unconserved sequences), respectively. Overall, our CNCs are more conserved on average than the original set of “most conserved” regions identified by Siepel et al. , which averaged ~33% of the unconserved rate. This difference indicates that our filtering process preferentially retains more highly conserved elements.
We also examined the location of CNCs with respect to nearby genes. For each CNC, we computed the distance to the nearest gene without considering gene orientation. Thirty-seven percent of the mammalian CNCs are in introns, and the remainder are intergenic. Among intergenic CNCs, 10% are within 10 kb of a gene, 27% are between 10 kb and 100 kb, and 26% are greater than 100 kb from any gene. The amniotic CNCs have a similar overall distribution in the genome, although they are significantly more clustered (Table S2; Figure S2). Overall, we find that CNCs are distributed across the genome approximately at random with respect to the locations of nearby genes (Table S2), as noted previously .
Analysis of the relationship of CNCs with PANTHER gene ontology (GO) categories  shows that genes related to developmental processes are significantly enriched near CNCs (1.5-fold enrichment, p < 10−21), as seen previously [9,10,13] (see Methods; Tables S4 and S5). The genes in the “signal transduction” and “nucleoside, nucleotide and nucleic acid metabolism” categories are enriched near mammalian CNCs and amniotic CNCs, respectively (1.2-fold enrichment, p < 10−11; and 1.3-fold enrichment, p < 10−7). Olfaction genes are ~15-fold underrepresented in our dataset, presumably because olfactory genes tend to be highly duplicated and our filtering process removes duplicated CNCs.
To identify CNCs that have been targets of selection, we introduce a likelihood ratio test that we call the “Shared Rates Test” (SRT). Under the null model, the divergence times of lineages are shared across CNCs, but each CNC may evolve faster or slower according to its local mutation rate and level of evolutionary constraint. For each CNC, we test whether any branches are surprisingly long or short compared to the others, indicating speed-ups or slow-downs of the substitution rate. For example, in Figure 1, the first two trees evolve at different rates, but with the same tree “shape” (i.e., the ratios of branch lengths are the same). In contrast, the third tree has a longer-than-expected branch on the human lineage, suggesting the action of natural selection.
In our model, each branch of the mammalian tree has a branch-length parameter vb, defined as the average number of substitutions per site on branch b for CNCs evolving under a constant level of constraint. (Here, vb is defined as the average number of substitutions per site on branch b across all CNCs.) In addition, under the null hypothesis, each CNC is associated with a single rate parameter r0(h) (where h indicates a particular CNC). Then the number of substitutions that occur in CNC h, on branch b has an expectation at each site of Nb,h, where
Under the null model, there are seven branch length parameters for the tree that we consider, and one additional rate parameter for each CNC. As described in the Methods and Text S1, we obtain a joint maximum likelihood estimate for all the parameters, assuming the Felsenstein 84 model of sequence evolution .
Our model is designed so that all CNCs have the same expected tree shape (i.e., the ratios of expected branch lengths are the same). However the total size of the tree is allowed to vary according to r0(h), in order to reflect variation in mutation rates and the level of selective constraint across CNCs. In addition, we place no constraints on the relative values of the vb, so that lineage-specific variation in mutation rates (such as the higher substitution rate in rodents) is reflected in longer estimates for those branch lengths (Figures 1 and S1). In summary, the null model allows mutation rates and levels of constraint to vary across CNCs, and it allows for the property that broad-scale mutation rates may vary across lineages.
In addition to the basic null model, we consider a family of alternative models that allow additional rate parameters for particular CNCs. In the simplest alternative, a single branch on the tree evolves at a rate that is different from the background rate shared by the remaining lineages (as for the third tree in Figure 1). In the extreme alternative, each of the seven branches evolves with its own rate ri(h), giving a total of seven rate parameters for the CNC in question. (For simplicity of notation, we will henceforth drop the notation h on the rate parameters.) In the extreme case, to test the hypotheses H0: r1 = r2 = ···= r7(= r0 ) versus HA: r1 ≠ r2 ≠ ···≠ r7 at a particular CNC, we compute the SRT as
where L is the likelihood of the sequence data for the five mammalian species, maximized with respect to the rate parameters, and with the fixed estimate of branch lengths parameters ( ) and the sequence evolution model. Large values of the SRT indicate a substantially better fit of the alternative than the null model. Another example of alternative model is the case in which branches 2 and 3 have distinct rates r2 and r3, while the other branches have a single “background” rate r0, −2, −3. In this case, to test the hypotheses H0: r1 = r2 = ···= r7(= r0 ) versus HA: r2 ≠ r3 ≠ r1 = r4 = ···= r7(= r0, −2, −3), we can compute the likelihood ratio statistic as
In this paper, we perform two kinds of analyses. One analysis performs model selection using the SRT, while the other tests for individual branches with rate changes. When testing for a rate change on the ith branch only, it is convenient to transform the likelihood ratio statistic as follows. In this case, we will use special notation, denoted by SRTi:
where sign(x) = 1 if x > 0 and otherwise sign(x)= −1. Rewriting the SRT in this way provides the convenient property that SRTi > 0 implies that ri is larger than the background rate r0, −i, and hence branch i shows a rate speed-up relative to the rest of the tree; conversely, SRTi < 0 implies a slow-down on branch i. As a convention, when we subscript SRT by a character or number, it will represent the signed likelihood ratio statistic testing for rate changes on the indicated branch. Otherwise, the notation SRT without subscripts will be used to indicate use of an unsigned test statistic, in the form of Equations 2 and 3.
Our SRT is a likelihood ratio test and, as such, standard theory suggests that under the null hypothesis the test statistic should asymptotically follow a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the constrained (null) and less-constrained (alternative) models. Similarly, the signed root of this statistic for a one-dimensional parameter of interest is asymptotically standard normal. Therefore, when the null hypothesis is true and the number of sites in a CNC is large enough, the unsigned SRT might be expected to follow the chi-square distribution with the degrees of freedom equal to the difference in the number of rate parameters between the two models. For example there are six degrees of freedom in the global test (Equation 2) and two degrees of freedom in the example in Equation 3. Similarly, under the null, the signed test SRTi is constructed to have a standard normal distribution as the CNC size goes to infinity. Our simulation studies show that the asymptotic theory is reasonably accurate for both versions of the test statistic, except in the cases in which the lineages tested for selection are relatively short and are expected to accumulate few substitutions (namely, the human and chimpanzee lineages; Figure S3). Hence, to reduce computational burden, we calculate p-values using the asymptotic chi-square or normal approximations, except for tests on the human and chimpanzee branches for which, except where stated, we compute p-values based on the empirical null distribution in simulated data (see Methods).
An additional consideration is that we do not want the estimated null branch lengths (vb) to be heavily influenced by outlier CNCs with evidence for selection. To mitigate the impact of such CNCs, we first identify CNCs with clear overall departures from the null model (SRT > 25 in the global six degrees of freedom test, corresponding to p < 0.00034), and then reestimate the branch lengths after dropping those nonneutral CNCs, which represent 2.8% and 3.8% of the total mammalian and amniotic CNCs, respectively.
In summary, then, our analysis performs the following steps: (1) Estimate maximum likelihood branch lengths and rates under the null; (2) identify outlier CNCs that have SRT > 25 comparing the seven- and one-parameter models; (3) drop outlier CNCs and recalculate the null branch lengths and rates; and (4) compute the shared rates test statistics for each CNC according to a range of alternative models.
For reasons discussed below, in practice these analyses were performed in a sliding window of 50 consecutive CNCs, as defined by position in the human physical map. All analyses considered the mammalian and amniotic CNCs separately.
It is well established that the extent of divergence among mammalian species varies substantially across large genomic regions [33–38]. For example, Gaffney and Keightley  showed that divergence between the mouse and rat genomes varied between and within chromosomes. While the causes and the scales of this type of variation are not completely understood, it has been shown that divergence correlates with various genomic features, including GC and CpG content, simple-repeat structures, and recombination rate, suggesting that these genomic features drive variation in mutation rates [35,37].
Variation in mutation rates or levels of CNC conservation across genomic regions should not be problematic for our method, provided that the substitution rate in any given region maintains a constant ratio to the average across the mammalian phylogeny. If a CNC is in a region with a higher, or lower, mutation rate than average, this effect should simply be absorbed into the rate parameter that we estimate for each CNC as part of our null model. However, if mutation rate variation is not stable across the phylogeny, this might produce false signals for our method.
Therefore, we looked at whether the average tree shapes are significantly variable across chromosomes (according to the human physical map) as well as within chromosomes. We found that in fact there is nontrivial variation in tree shape, both at the chromosome level, and across genomic regions within chromosomes. For example, within Chromosome 2 there is a highly significant autocorrelation in the fraction of the tree occupied by the mouse lineage (Figure 2). This result implies that local variation in large-scale mutation rates is not conserved across evolutionary time; for example, genomic regions that evolve faster than average on some lineages may evolve slower than average elsewhere on the tree.
If average tree shapes were constant across the genome, we could use CNCs from across the genome to estimate the tree shape for our null model. However, the observation that tree shape is not constant suggests that instead our model should allow for variation in tree shape across the genome. After some experimentation, we settled on using a sliding window of 50 consecutive CNCs to estimate the tree shape. That is, we test each CNC for significant departures from the tree shape in a 50-CNC window that, in the human physical map, is centered near the CNC in question (see Methods). On average, this window size corresponds to 525 kb and 1.3 Mb (median) for mammalian CNCs and amniotic CNCs, respectively.
Overall, we find that using the sliding window method produces only a modest impact on the rate of significant CNCs, but it should improve our inferences by taking into account the local variation in tree shapes (Figures 2 and S4). An obvious concern about using a sliding window based on the locations of CNCs in humans is that due to chromosomal rearrangements, CNCs that are close together in humans may not be close together in other mammals. Consequently, a sliding window based on the human map might not provide a suitable correction. Fortunately, our window size is relatively small compared to the typical size of syntenic blocks [8,39] and in Figure 3, we show that the results of tests on the human lineage are highly concordant whether we use windows based on the human or mouse physical maps and, indeed, are only modestly different from the results using all CNCs together. Consequently, all subsequent results use 50-CNC windows based on the human map.
Another plausible concern about our model stems from the prediction that selection against weakly deleterious mutations is more efficient in species with large populations than in small populations. This means that weakly constrained sites in CNCs are likely to evolve more quickly in primates than in rodents (which have larger effective population sizes). This effect has been observed in a comparison between the evolutionary rates of CNCs and putatively neutral flanking sequences . Hence—in contrast to our null model—one might expect the overall tree shape for a CNC to depend on its level of selective constraint.
To investigate this issue, we classified CNCs into four different levels of conservation, according to their substitution rates on the dog lineage. We then separately compared the average human-to-chimpanzee divergence against the average mouse-to-rat divergence, within each of the four conservation levels (Table S15). We find that that as the level of constraint increases, the divergence in rodents indeed decreases faster than divergence in hominids, consistent with the results of Keightley et al. . However, we find that the variation across CNCs is relatively small (less than 11% change across different classes of CNCs) and much less than when CNCs are compared to neutral sequences (Table S3). As shown below, we do not have the power to detect such small variations in tree shape at individual CNCs, so we conclude that it is not necessary to control for overall conservation level more carefully for the current study.
For each CNC, we calculated SRTi for each of the seven branches of the mammalian tree to identify CNCs that have experienced a speed-up or slow-down on a particular branch. Figure 4A shows the histogram of p-values on the mouse lineage (SRTm) for the mammalian CNCs. The p-values are defined as P( SRTi > srti ) where srti is the observed value. Hence, p-values near 0 indicate increased rates, and near 1 indicate decreased rates. The histogram is flat for intermediate p-values with peaks at both ends, suggesting that most CNCs fit the null distribution of SRTm, but with a substantial number of outliers. At the significance level of 0.001, 1027 (1.2%) and 503 (0.6%) mammalian CNCs show speed-ups and slow-downs, respectively. Among amniotic CNCs, 228 (1.4%) and 106 (0.6%) show speed-ups and slow-downs, respectively on the mouse lineage.
Figure 4B plots the expected and observed branch lengths on the mouse lineage for the CNCs that are significant at p < 0.001 in each tail. (Similar plots for other lineages are shown in Figure S5.) The red points above the diagonal indicate CNCs with rate speed-ups. For the central 95% of the significantly fast-evolving CNCs, the observed branch lengths are between 0.04 to 0.13 substitutions per site, and are 2–4-fold higher than the expected branch lengths. The blue points below the diagonal are CNCs with reduced branch lengths. Nearly half of these CNCs accumulated no substitutions on the mouse lineage.
The other long lineages show similar p-value histograms though with some variability in the proportion of significant CNCs. The dog lineage is the most enriched for signals, with 2.3% and 1.9% of mammalian CNCs showing speed-ups and slow-downs, respectively, at p < 0.001 (in each tail). Even after a stringent Bonferroni correction, 186 and 46 CNCs, respectively, are still significant at p = 0.001 in the dog lineage. The overall results for amniotic CNCs are similar, but the fraction of significant results is slightly higher on each branch (Table S8). For most lineages, our significance threshold (one-sided p-value < 0.001 on each end) corresponds to a genome-wide false discovery rate (FDR) between 0.05 and 0.1 (Table S9).
Since the distribution of SRTi on the human and chimpanzee lineages does not follow the standard asymptotic distribution, we simulated data under the null over a range of substitution rates that cover the observed range over all 50-CNC windows (see Methods). We account for heterogeneity in the distribution of SRTi across bins of CNCs with different numbers of expected substitutions on the tested lineage by computing p-values based on the empirical null distribution of SRTi constructed in each bin (unpublished data). At a significance level of 0.001, 256 mammalian CNCs and 59 amniotic CNCs, respectively, show rate speed-ups on the human lineage (Table S8). Note that there is little power to detect rate reductions on these very short lineages.
To better understand these SRTi results, we performed power simulations under a range of models. The simulation results, summarized in Figure S6, show considerably greater power to detect speed-ups than slow-downs on all lineages, consistent with the results of Siepel et al. . Thus, the fact that we detect more speed-ups than slow-downs does not necessarily imply that speed-ups are actually more common, and it is likely that many slow-down events are simply not detected by our analysis.
Our human results allow a comparison to the human accelerated regions (HARs) identified by Pollard et al.  using a similar type of approach, based on regions that were highly conserved (at least 96% identity) across chimpanzee, mouse, and rat. Among the top 49 HARs, which include two coding regions, 34 overlap with CNCs in our dataset; however, generally, the HARs are considerably shorter and more conserved and lie within our CNCs. Perhaps not surprisingly, since the HARs are the top genome-wide hits in their data, the signals in our overlapping CNCs tend to be weaker. Among the 34 CNCs, just five CNCs are significant in our analysis at a genome-wide FDR less than 0.05. Nonetheless, our CNCs that overlap HARs do show a strong enrichment of modest signals. Our human lineage p-values are <0.01 for 26 of the 34 CNCs overlapping HARs, and are <0.1 for 33 of the 34 (Table S10).
Within our dataset, one of the most significant CNCs on the human lineage is a 144-bp amniotic CNC located on human Chromosome 21 starting at 33481809 (q22.11, NCBI Build 35). It was not detected by Pollard et al.  because it fails their filtering threshold for similarity between chimpanzee, mouse, and rat. As illustrated in Figure 5, the posterior expected number of substitutions (see Methods for details) on the human lineage is 5.2, which is 26-fold higher than the value of 0.2 expected under the null model. The corresponding SRTh is 4.84. The p-value for this CNC is so small that it is difficult to evaluate by simulation; however, the standard normal approximation suggests that p ≈ 6 × 10−7 (our simulations indicate that this is conservative). In addition to the five nucleotide substitutions, there is also a 2-bp insertion on the human lineage that was not included in the statistical inference. Since the UCSC genome browser database was recently updated, we were able to inspect an alignment of 17 vertebrate species for this region. Manual inspection confirmed that all six of these substitutions occurred on the human lineage.
The function of this CNC is unclear but the two nearest genes are C21orf54, 17 kb upstream, and IFNAR2, 42 kb downstream of the CNC. Not much is known about C21orf54, but IFNAR2 codes for a type I membrane protein that forms one of the two chains of a receptor for interferons alpha and beta . This CNC is strongly conserved among the other mammalian species and chicken but does not appear to be present in the fugu genome. In addition to the rapid evolution on the human lineage, there is weak evidence for slower evolution of this CNC on the mouse and dog lineages (one-sided p-values = 0.011 and 0.023, respectively; see Figure 5B).
Thus far, we have focused on the simplest class of alternative models, in which a CNC changes substitution rate on a single branch only and has a constant background rate elsewhere on the tree. We now extend this approach in order to classify each CNC according to a family of more complicated models of evolutionary patterns.
Our data are connected by a tree containing seven branches. The simplest model (our “null”) has a single rate parameter, and the most complicated alternative model has seven different rate parameters. In between, there are 876 ways of partitioning the seven branches into two or more different substitution rate groups. However, considering all of these partitions does not seem biologically meaningful or necessary, and here we focus on a reduced set of 126 alternative candidate models.
The alternative models we consider can be divided into two distinct classes of models. In one class of models, each tree is assumed to have a “background” rate parameter. Then, each CNC may have between one and six “selected” lineages, and each selected lineage evolves at its own rate. In the other class of models, each tree may be split into subtrees that share a single rate, while the rest of the tree has a single background rate (for full details, see Table S11).
We use a modified Akaike Information Criterion (AIC) procedure to classify each CNC into its best model. In brief, the method attempts to account for multiplicities of alternative models as well as the number of estimable parameters in each model (see Methods). We have performed simulations to test the performance of this method, and we find that it provides suitable control over the rate of “false positives” (i.e., accepting models with more parameters than used to simulate the data). That said, our simulations show that it is often difficult to correctly classify complex models with multiple rate changes (see Methods; Figure S7).
The results of our data analysis are summarized in Figure 6 and Table S12. We estimate that ~68% (54,643/81,957) of the mammalian CNCs evolve at a single rate. The remaining nonneutral CNCs show rate changes on at least one lineage. The number of CNCs assigned to each model category decreases with increasing model complexity. Among the 32% of CNCs with more than one rate, ~75% (20,420/27,314) exhibit rate changes on a single lineage but not on the remaining lineages and ~9% (2,419/27,314) exhibit rate changes on the primate or the rodent lineage that are inherited across all branches below. For the two-parameter models, the rate change events are easily classified as speed-ups or slow-downs. Counts for both types of event are shown in Figure 6B. For most lineages, there are slightly more speed-up events than slow-downs (~55% versus ~45%). However, there are 638 and 530 CNCs that show rate speed-ups on the human and chimpanzee lineages, respectively, far more than the four and eight CNCs, respectively, showing slow-downs. Presumably, these results are due in large part to the greater power to detect speed-ups, as well as differences in power across lineages (Figure S6).
It is notable that the dog lineage shows a very large number of rate changes, which may not be fully explained by the long length of this lineage (second longest among the seven). Since there is no strong tendency towards an excess of speed-ups over slow-downs on this lineage, it is unlikely that this can be explained by occasional CNCs with low-quality dog sequence. Perhaps a hint is that we have observed greater variation in the dog-lineage substitution rates at neutral sites than on other lineages. Perhaps there is greater fine-scale variation on the dog lineage that is not well captured by our 50-CNC window method (see Methods; Figure S8).
As discussed above, we have identified many CNCs with significantly accelerated rates on one or more branches. However, it is unclear a priori whether these speed-ups reflect positive adaptation or relaxation of functional constraint. In order to address this issue, we estimated substitution rates in unconserved sequences near each CNC to estimate local neutral rates (see Methods). We then determined how many of the CNCs showing rate speed-ups have an accelerated rate that actually exceeds the corresponding lineage-specific neutral rate. If the rate in a CNC actually exceeds the local neutral rate, this is strong evidence for adaptive evolution. However, a negative result here is difficult to interpret, since adaptive evolution in an otherwise slow-evolving sequence may not necessarily bring the total rate above the neutral background rate.
Our results are summarized in Table 1. We observe that most CNCs showing accelerations on the human and chimpanzee branches indeed have rate estimates exceeding the neutral rates; of these, more than half are actually significantly faster than the neutral rate at p < 0.05. Meanwhile, the other branches of the mammalian tree all show smaller fractions of CNCs with rates that exceed the neutral rate, and very few of these are significantly faster than the neutral rate. One plausible explanation might be that if there is sufficiently rapid evolution on a long branch, this might cause an otherwise conserved element not to be classified as a “most conserved” region by the HMM . However, some simple calculations suggest that this is likely to be a modest effect in practice. Moreover, we see the same effect for both the mammalian and amniotic CNCs (Table 1), even though the HMM data for the latter include the relatively long branch to chicken, and should therefore be much less susceptible to this effect.
Instead, to explain these observations, we hypothesize that the rate speed-ups that we detect may often reflect rapid bursts of adaptation in which a CNC accumulates a series of sequence changes, thus modifying its function. A single burst of adaptation may produce enough sequence changes to exceed the neutral rate on a short branch, but not on a longer branch. In this model, we would have the most power to detect adaptive events on short branches. Our data argue strongly against a model in which a CNC adapts continuously over extended periods of evolutionary time, as such a model should also produce signals on the long branches.
We have also performed analyses of the locations of CNCs showing branch-specific speed-ups, with respect to nearby genes. A recent report by Drake et al.  found that the frequency spectrum in CNCs is most skewed towards rare variants (indicating weak purifying selection) in introns and near genes, and is less skewed in CNCs that are far from genes.
To test whether CNCs showing speed-ups on particular branches occur at higher rates near to or far from genes, we divided all our CNCs into four classes: intronic, within 10 kb of a gene, between 10 kb and 100 kb, and greater than 100 kb from any gene. We found that on the mouse and rat lineages, CNCs showing speed-ups (p < 0.001 on the branch-specific test SRTi) occur at higher rates in introns and within 10 kb of genes than among CNCs further from genes. However, this trend was not replicated on the other lineages of the tree (Table S13).
We next looked at whether CNCs showing significant rate speed-ups are more likely to be in the proximity of particular kinds of genes , using the PANTHER GO database . A significant difficulty in this sort of analysis is that even for those CNCs that act as cis-regulators, it is unknown which of the nearby genes is being regulated. However, as a rather imperfect proxy for this we simply used, for each CNC, the nearest gene (in either orientation). For each branch of the mammalian tree, we divided the CNCs into those with increased rate on that branch (by AIC) and used CNCs evolving under the null model as “neutral” controls. We looked at whether particular biological process categories were enriched among the nearest genes of the selected CNCs compared to the neutral CNCs.
For mammalian CNCs, there is significant enrichment of the process categories “amino acid activation” and “other coenzyme and prosthetic group metabolism” on the dog and the lineage leading to the common ancestor of mouse and rat (rodent lineage), respectively, at p < 0.05 after Bonferroni adjustment. We also tested whether any categories show repeated evidence for enrichment on different branches of the tree. For mammalian CNCs, the “sensory perception” category appears in the top ten enriched biological processes for three out of the seven lineages. However, in summary, we view these GO associations as rather tentative, since none of them is highly significant or highly repeatable across branches of the tree. Complete results from this analysis are presented in Tables S6 and S7.
Our paper presents a new approach to studying the evolutionary patterns of CNCs. We find that a large fraction of CNCs (~32%) do not fit a simple model of evolution with a consistent substitution pattern across the mammalian tree. Among those CNCs that do not fit our null model, ~75% show changes in evolutionary rate on a single branch of the mammalian tree, while the remainder have more complex substitution patterns. In many cases—particularly on the short branches of the phylogeny—CNCs with rate accelerations on a particular branch significantly exceed the neutral rate on that branch, suggesting that the changes are driven by adaptive evolution. The less extreme speed-ups may be due to either adaptation or a relaxation of selective constraint; however, we suggest that much of our signal on the longer branches may be due to short bursts of adaptation that do not generate enough changes to exceed the total neutral rate on a long branch.
A very recent paper by Galtier and Duret  argues that many of the recently reported HARs  are likely the result of biased gene conversion (BGC). One of the main characteristics of BGC is an excess of AT → GC transitions. In some of our CNCs showing accelerations on the human lineage, we also observe this transition bias, which seems to be larger with increased acceleration signals. However, for most fast-evolving CNCs, the numbers of AT → GC changes roughly match the distribution expected based on the overall distribution across random CNCs (Figure S9). In summary, these data suggest that some of the fast-evolving CNCs may in fact be due to BGC, however, that most fast-evolving CNCs do not show the signal expected for BGC.
Overall, our results imply that either the levels of functional constraint or the functional roles of CNCs are reasonably changeable across the timespan of mammalian evolution. Although it lies beyond the scope of this paper, it will be of interest to use experimental approaches to probe the functional significance of the many CNCs that we have identified as having had bursts of rapid evolution [5,25].
Of course, in this type of study, there are inevitably features of the real data that are not fully accounted for in the models. We believe that our results should be reasonably robust to these issues, however, as follows. One natural concern is that our CNC alignments might occasionally align paralogs. This is a serious concern in principle; however, we have aimed to aggressively filter out CNCs with related paralogs to minimize this effect, in addition to making use of global alignments. Other model departures might inflate the variance of branch-specific substitution rates. These include the possibility of fine-scale, branch-specific changes in mutation rate, as well as variation in the branch lengths of the human and chimpanzee branches due to coalescent time variation . On the whole these effects are likely to be fairly modest, since the observed rate changes are usually not significant unless they are quite dramatic (significant rate changes are usually ~2–4-fold on the mouse lineage, and larger on the shorter branches). For this reason, the analysis that uses a single global tree shape produces fairly similar overall results to the window-based analysis, despite evidence that the window-based analysis fits the data better (Figure S4). A related concern is that due to variation across lineages in effective population size, the evolutionary rates of CNCs with different levels of constraint might not scale linearly across the trees . However, our data show that this is a modest effect relative to the size of change needed to produce a significant rate change in a CNC (Table S15).
In this study, we aimed to classify CNCs according to their evolutionary patterns. To do so, we used a modified version of the AIC to find the model that best describes the pattern of evolution of each CNC. In order to reduce the space of alternative models, we restrict our alternatives in two classes of models. As we obtain genome sequences for increasingly more species, it will be worth revisiting these models, as we will be better able to distinguish among different modes of evolution . In particular, two natural models for rate changes in a CNC are (1) that the CNC has a one-time change in evolutionary pattern (for example, a burst of adaptation to acquire a new function), or (2) that the CNC changes function or evolutionary constraint in a way that is inherited across all branches below. With larger numbers of taxa, it should be possible to gain better insight into the relevance of these two possible modes of evolution. More broadly, as we obtain increasing information about the functions of CNCs, we will increasingly be able to interpret the biological relevance of the patterns of rate changes detected here.
From the UCSC genome browser , we downloaded the genome-wide multiple alignment of eight vertebrate species: human, hg17 (May 2004); chimpanzee, panTro1 (November 2003); dog, canFam1 (July 2004); mouse, mm5 (May 2004); rat, rn3 (June 2003); chicken, galGal2 (February 2004); fugu, fr1 (August 2002); and zebrafish, danRev1 (November 2003). We also downloaded the annotation of “most conserved” regions defined on the same multiple alignment by a phylogenetic HMM in June 2005 . The most conserved regions are defined without regard to whether the sequence is coding or noncoding, and cover around 4.3% of the human genome. To define CNCs, we first extracted those conserved regions from the multiple alignment and then processed them by removing coding regions (exons in the “known gene” annotation, UCSC genome browser), repetitive sequences (marked by lower case letters in the alignment), and sites that are gaps or missing data in any of the five mammalian genome sequences. Conserved regions of less than 100 bp after the processing were discarded. The remaining 231,285 regions out of the initial 1,451,896 most conserved regions comprised our raw dataset of CNCs and spanned ~48 Mb.
We used BLAT  to exclude spuriously aligned CNCs. We restricted our data to unique CNCs in which the human version of a CNC does not find any similar sequence (>50% sequence identity) elsewhere on the human genome. This resulted in discarding 24,234 CNCs (~5.4 Mb). Furthermore, we required that each nonhuman mammalian verion of a CNC find the human version as the best match when it is BLATed against the whole human genome. This resulted in discarding an additional 74,359 CNCs (~10.7 Mb).
Our statistical inferences are based on alignments of the five mammalian sequences. However, we used the aligned chicken and fugu sequences to classify CNCs into different conservation level groups, of which we analyzed the two largest, denoted as “mammalian” and “amniotic” CNCs. Roughly speaking, a CNC was classified into the mammalian group if it is conserved across the mammalian genomes but not chicken or fugu, and into the amniotic group if it is conserved among the mammals and chicken but not fugu. The classification depended on (1) the presence or absence of aligned chicken and fugu sequences, and (2) the mean identity between mammals and chicken and fugu. The details are given in Text S1.
To examine the scale of local variation in tree shape, we estimated tree shapes over a chosen set of window sizes of 10, 30, 50, or 100 consecutive CNCs (ordered according to the human genome position). Since the scale of local variation seems to vary across chromosomes, there is no clear boundary explaining the rate of decay of autocorrelations. Nonetheless, incorporating such variation into our model is important, since otherwise, regions that show a general pattern of evolution that departs from the shared pattern from all CNCs might produce clusters of spurious signals. After several trials, we decided to estimate tree shape using a sliding window of 50 CNCs with an overlap of 34 CNCs between successive windows. With this window size, we obtained enough data to stably estimate tree shapes, but were also able to capture much of the local variation. The one third of CNCs located in the center of each window use the estimated tree shape from that window. In order to reduce the effect of outliers (defined as having SRT > 25 with degrees of freedom of six), we estimate branch lengths in each window, drop outliers, and then reestimate the divergence times after dropping those nonneutral CNCs. Through this procedure, we expect that our estimates are robust in the presence of outlier CNCs. However, when rate changes are spatially clustered, our locally estimated tree shapes may absorb some of the signal of variable rates, hence potentially reducing power.
Our preliminary analysis of classifying CNCs using the modified AIC showed that the number of CNCs with signals on the chimpanzee lineage was 48% larger than on the human lineage. Closer examination indicated that often CNCs with low-quality chimpanzee sequence (PanTro1) produced a large signal of rate changes on the chimpanzee lineage, since miscalled bases would appear as mutations. Therefore, we dropped any CNC that is classified by AIC into the group showing rate changes on the chimpanzee lineage but that has low-quality chimpanzee sequence.
To identify those CNCs, the chimpanzee sequence in each CNC was BLATed to the chimpanzee genome (PanTro1). The best match position (according to the BLAT score) was found when it was available. Then, in the target region, we counted the number of sites that have low quality score (≤20). If this count was larger than 15, we considered the CNC to have low-quality chimpanzee data. A total of 378 mammalian and 89 amniotic CNCs that were significant on the chimpanzee lineage were dropped for this reason.
The impact of occasional sequence errors is likely to be much smaller for the other species. The human genome sequence has very high accuracy (the estimated error rate is one site per 100 kb, much lower than the human polymorphism rate ). Meanwhile, occasional sequence errors in the other species should have only a small effect due to the much longer branches leading to those taxa.
We estimated branch lengths for an alignment using the Felsenstein 84 sequence evolution model and using the empirical base frequencies. To make computation feasible, the “peeling” algorithm  was used with the assumption that sites evolve independently and that given their common ancestor, branches evolve independently. Details of the evolution model and the “peeling” algorithm were described by Felsenstein and Churchill . Note that there are many more general evolutionary models, but the Felsenstein 84 model, which is essentially the same as the HKY85 model , seems to be sufficient for the purposes of our study .
Under the null hypothesis, our parameters are a set of seven branch lengths shared by all CNCs, and one additional local substitution rate for each CNC. Under an alternative, our parameters are a set of lineage-specific rates that explain a specific scenario for each CNC. Rather than maximizing the likelihood directly, we developed an expectation-maximization algorithm (EM) that efficiently maximizes many parameters jointly under the null model. The details are given in Text S1, but essentially, in our EM algorithm, each branch length is updated sequentially by computing the posterior number of substitutions on each branch and updating the related parameters accordingly. We find that our EM algorithm is stable to choices of initial starting points. The estimates that we obtain for simple models match well with those computed by Phylip  and PAML .
There are many possible models of CNC evolution, ranging from the simplest case, where there is a single rate across the entire tree, to the most extreme case, where each lineage evolves with its own rate. Here, we address how to classify CNCs according to their evolutionary patterns.
Each of the possible alternative models corresponds to a partition of the seven lineages into two or more blocks of substitution rates. There are 876 ways of partitioning the seven branches into two or more different substitution rate groups. However, to reduce the space of possible models, we restrict ourselves to a subset of 127 candidate models that seem biologically most natural. Our main class of alternative models consists of the models where there are k selected branches (1 ≤ k ≤ 6), each with its own rate parameter, while the remaining branches share a single background rate parameter. Such models have k + 1 parameters, and there are such models for k = 1,…,5 and one additional model for k = 6. This accounts for 121 candidate models.
In addition, we also consider a further set of six models that seem biologically natural, that split the branches on an unrooted tree into two or three rate groups using an internal branch (connecting two internal nodes). Thus, for example, we might hypothesize a single rate-changing event in the ancestor of mouse and rat that leads to a single altered rate on both the mouse and rat branches. To reduce the model space complexity, we assume that such rate change events occur at internal nodes on the tree. These six models are summarized in Table S11.
Since there are many possible models, correct classification of the CNCs is likely to be difficult. Here, we view the classification as a multiple testing problem rather than a model selection problem, where our first goal is to control the rate of over-estimating the number of model parameters. The scheme below, though ad hoc, provides a reasonable compromise in providing fairly good model choice while not having excessive rates of “false positives.”
For each CNC, we select the model that, among the 127 candidate models produces the highest value of the penalized likelihood, which is log(L) − (k + 1) + log((7 − k − 1)!} for our main class of alternative models, where L is the maximum likelihood and k is the number of selected lineages. The first penalization term (k + 1) penalizes for the number of estimated parameters and is introduced for the same reasoning as in the standard AIC. The last term (log((7 − k − 1)!}) aims to account for the multiplicity of different models within each level. This latter term was suggested previously as a prior weight for Bayesian classification in an analogous setting . This term is motivated by thinking of each model as corresponding to a partition of the seven branches into one or more blocks of substitution rate groups; as a natural choice of partition distribution we use the Ewens sampling distribution  with concentration parameter λ of 1 (see Text S1).
To evaluate the performance of the classification, we simulated data under the full range of null and alternative models. We used these simulations to compare among three possible choices of penalty functions: (1) using only the number of parameters (AIC), (2) using only the Ewens prior (Ewens), and (3) using both the number of parameters and Ewens prior (AIC + Ewens), as detailed above. The penalty function computed for each model is summarized in Table S14 and the power simulation results are shown in Figure S7. The AIC + Ewens penalization provides the best control against over-fitting and that is what we use for our data analysis.
There are 6,037 mammalian and 1,497 amniotic CNCs that show branch-specific accelerations on at least one lineage at significance level of 0.001 (based on the asymptotic distribution of SRTi). To see if these CNCs actually exceed the neutral rate on a particular branch showing speed-ups, we estimate the local “neutral” tree near each of those CNCs. Specifically, we take the surrounding 10 kb with each such CNC at the center, then exclude “most conserved” regions as well as exons to construct putatively neutral local regions. The genome-wide average of each branch length on trees estimated from those regions (shown in Table S3) is very similar to that from CFTR nonexonic DNA region in Cooper et al. . It is notable that the variation in neutral branch lengths of the dog lineage is considerably larger than that on other lineages (Figure S8).
To test whether the accelerated rate exceeds the neutral rate on a particular branch i, we compute a likelihood ratio statistic testing the hypothesis H0: r0, −i, ri = ri, neutral versus HA:r0, −i, ri > ri, neutral only for those CNCs that have a substitution rate on the tested lineage ri that exceeds the neutral rate ri, neutral in the surrounding region. Based on the chi-square distribution with one degree of freedom, we compute p-values and reject the null hypothesis if p-value < 0.05. Our results are summarized in Table 1.
We performed simulation studies with the Felsenstein 84 sequence evolution model using evolver13, implemented in PAML , to assess the distributions of the SRT and SRTi statistics under the null model and to evaluate p-values when the distribution is not well approximated by the asymptotic theory.
We simulated two sets of 1 million CNCs under the null, one set for the mammalian and one for the amniotic CNCs, matching the distribution of base frequencies, the CNC size, the variation in the local substitution rates, and the overall tree shape. Specifically, each of the 1 million simulated CNCs was based on the characteristics of a randomly sampled CNC in our dataset (sampling with replacement). We specified the branch lengths by rescaling the global tree estimated from all CNCs by the local substitution rate of the chosen CNC, and generated an alignment of five sequences of the corresponding size simulated on the specified tree.
For each set of simulated CNCs, we obtained the empirical null distributions of statistics testing for various alternative scenarios (the simplest and most extreme cases are shown in Figure S3). As mentioned earlier, the asymptotic theory works reasonably well, except when testing for rate changes on short branches (i.e., the human and chimpanzee lineages). We also grouped sets of CNCs into bins with similar CNC sizes and local substitution rates and examined empirical distributions for each bin separately (unpublished data). For each test statistic, the empirical distributions across bins were homogeneous, except for the cases in which the asymptotic theory does not work because of the small number of accumulated substitutions. For these cases, since the inhomogeneity across bins was mainly explained by differences in the expected number of substitutions on the tested lineage, we reconstructed bins of CNCs according to the number of expected substitutions and evaluated p-values within each bin separately.
We also simulated a number of CNC sets under various alternative scenarios to examine the performance of the modified AIC method. Specifically, for each k selected lineages (k = 0,…,6), we simulated 100,000 CNCs in which each CNC has k branches that evolve with their own rates. These rates were higher or lower than the background rate with 50% probability each. To incorporate the variation in strength of signals in real data, the rate of each selected branch was simulated by multiplying or dividing the background rate by a scale factor that is drawn from 1 + Γ(α, β) distribution with a scale parameter β = 1 and a shape parameter α = 1 (weak signals) or α = 2 (stronger signals).
We downloaded a reference assembly (seq_gene.md.gz) that corresponds to the human genome build (NCBI build 35) from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ARCHIVE/BUILD.35.1/mapview. We proceeded by extracting only reference genes (release 3 ) in autosomes only, and obtained 25,249 genes. We extracted the nearest gene for each CNC without considering gene orientation. Here the distance from a CNC to a gene is the minimum of distances from the middle of the CNC to either end of the gene.
The PANTHER GO database was downloaded from http://www.pantherdb.org/panther/prowler.jsp in March 2006. For each conservation group, we examined what kinds of biological process categories are enriched for being: (1) near CNCs in general and (2) near CNCs showing lineage-specific rate increases compared to near CNCs evolving under the null model. The nearest genes of CNCs were used for this analysis.
For (1), we compiled the list of all genes in the reference assembly and the list of genes near mammalian (or amniotic) CNCs. For each biological process category, we counted the number of genes in each list and compared them with a chi-square test. Within each list, genes are counted only once. For (2), CNCs were first classified by the AIC (in order to obtain disjoint categories of CNCs showing signals of speedups on each lineage). For each category, we counted genes near CNCs under the null and near CNCs in each of the seven selection groups that show rate speed-ups on a single lineage. In this case, however, individual genes were counted repeatedly each time they were the nearest neighbor of a relevant CNC. The reason for this is that multiple CNCs often have the same nearest neighbor. This effect is more pronounced in the null CNC group than in the selection groups. Consequently, if we count genes only once, then any biological functional category that is enriched near CNCs, in general, may be underrepresented in the null but overrepresented in each selection group. Since the numbers of selected CNCs are small for many gene categories, p-values were computed using Fisher's exact test. To account for multiple testing, the p-values were multiplied by the number of biological processes that were jointly tested.
We have prepared a datafile that contains the list of all CNCs and summarizes our analysis results. It includes genomic properties and test statistics, as well as the best evolutionary pattern of each CNC. It will be downloadable from http://pritch.bsd.uchicago.edu/data.html.
(91 KB EPS)
The red dashed line on each plot indicates the mean of the plotted total tree lengths. To aid visual comparison of the plots, 1,741 mammalian CNCs (equal to the number of amniotic CNCs) were randomly selected and plotted here. Overall, amniotic CNCs are more conserved and more clustered than mammalian CNCs.
(103 KB EPS)
The empirical null distributions are obtained from the simulated data that capture the variation in size and the local substitution rates of the mammalian CNCs. Note that the empirical null distributions of SRTh (human) and SRTc (chimpanzee) are bimodal since when testing for a rate changes on a short lineage, discreteness nature of the data has a big impact on the shape of the distribution. The histogram of SRTi for each of the seven lineages is overlapped with the density function of the standard normal distribution (dashed red line). The two vertical green bars in each plot indicate the 0.1% and 99.9% sample quantiles of SRTi. Note that the 0.1% and 99.9% quantiles of the standard normal distribution are −3.09 and 3.09, respectively. The bottom right plot is the histogram of SRT testing the alternative hypothesis in which each of the seven lineages evolves with its own rate. The dashed red line shows the density function of the chi-square distribution with degrees of freedom of six. The vertical green bar indicates the 99.9% sample quantile of SRT. Note that the corresponding theoretical quantile from the chi-square distribution is 22.46.
(46 KB EPS)
(A) Comparison of the SRT tests on the mouse lineage (SRTm) between those using the branch length parameters estimated from all amniotic CNCs (global tree) and those using estimates from nearby 50-CNCs (local trees). (B) Robustness of the SRTm to the definition of windows. For further details of the caption, see Figure 3
(802 KB EPS).
The rate speed-ups and slow-downs are indicated in red and blue, respectively. CNCs that are significant on the chimpanzee lineage but have low-quality chimpanzee sequences are removed from plot.
(73 KB EPS)
The tree for each CNC is simulated by taking the global tree estimated from all mammalian CNCs and multiplying certain branch length with rate changes (each line) by a scale factor (x-axis). The power is estimated as the fraction of being rejected at significance level of 0.001 among the 500 simulated CNCs. For all seven lineages, the p-values are computed based on asymptotic approximation.
(44 KB EPS)
In each plot, each line corresponds to the true number of the selected lineages (k) that evolve with its own rate. Each selected branch is multiplied or divided by a scale factor that is drawn from 1 + Γ(α, β) distribution with a scale parameter β = 1 and the shape parameter α = 1(A, B, and C), α = 2(D, E, and F), respectively. The number of parameters found by the AIC method is shown in the x-axis, and the fraction of detected CNCs among the 100,000 simulated CNCs under each scenario is shown in the y-axis.
(56 KB EPS)
(65 KB EPS)
A random CNC of size 3 kb is created by concatenating contiguously located CNCs to obtain a sufficiently large number of human-specific substitutions. We parsimoniously obtain the number of human-specific substitutions by counting those sites for which four non-human species share the same base which is different with the human one. Among those human-specific changes, we count A/T → G/C changes to compute the proportion. The plots are the histograms of those proportions in random CNCs (A, C, and E) and fast-evolving mammalian CNCs (B, D, and F) across sets of homogeneous numbers of human-specific substitutions: A and B, seven or eight; C and D, five or six; and E and F, three or four.
(15 KB EPS)
Overall, amniotic CNCs are longer than mammalian CNCs.
(27 KB DOC)
For each CNC, the nearest gene was found without considering gene orientation. Fractions of CNCs in intron and three intergenic regions (within 10 kb, between 10 kb and 100 kb, and greater than 100 kb from any gene) were computed within each group. The “Random” group indicates the expected distribution if CNCs were distributed completely at random in noncoding regions (estimated by sampling 2 million noncoding positions at random in the human genome).
(26 KB DOC)
The ratio of each branch length on the mammalian (or amniotic) tree to that on the neutral tree is shown beneath the length estimate. A branch length is the expected number of substitutions that occurs on each branch per site. Each branch is named using the species that it leads to. (The primate and rodent lineages are the lineages leading to the common ancestors of human and chimpanzee, and mouse and rat, respectively.)
(30 KB DOC)
For each CNC, the nearest gene was found without considering gene orientation. The first column (Key) shows the nested relationship between GO categories. For example, 22_8 has a level of 2 and is nested in 22, which has a level of 1. The second column (Obs) is the observed count of genes near CNCs in each category. The third column (Exp) is the expected count of genes near CNCs that are obtained based on the distribution of all human genes. The fourth column indicates if each GO category is over- or underrepresented. One-sided p-values are computed using a chi-square test and are corrected by the Bonferroni criterion (p*-values), using the number of the GO categories that are tested in each level.
(38 KB DOC)
For further details of the caption, see Table S4.
(33 KB DOC)
The one-sided p-values are computed using the Fisher's exact test and are corrected by Bonferroni criterion (p*-values). Any category that includes less than two genes near CNCs showing a rate speed-up is removed from the list.
(111 KB DOC)
For further details of the caption, see Table S6.
(127 KB DOC)
Fractions of those CNCs within each group are shown underneath the counts. The p-values are computed based on each of the asymptotic (asym) and empirical (empi) distribution. Note that CNCs that show rate changes on the chimpanzee lineage but have low-quality chimpanzee sequence are not included.
(54 KB DOC)
The p-values are computed using the squares of SRTi statistics, and are transformed into q-values using the R package “qvalue” (http://faculty.washington.edu/jstorey/qvalue/; ) using default settings of that program. At each fixed FDR, the corresponding one-sided p-values (for both tails) are found. Note that for human and chimpanzee, the empirical p-values are used to construct the distribution of p-values correctly under the null hypothesis.
(56 KB DOC)
Among the 49 HARs, 11 and 23 HARs are overlapped with our mammalian (M) and amniotic (A) CNCs, respectively. All of the HARs are embedded in our CNCs. The difference in size between each HAR and the corresponding CNC is shown in the fifth column (Diff). The p-values correspond to our SRTh statistics, and are computed based on the empirical null distribution obtained from the simulation study. The SRT tests the alternative hypothesis in which each of the seven lineages evolves with its own rate.
(93 KB DOC)
(30 KB DOC)
Note that CNCs that show rate changes on the chimpanzee lineage but have low-quality chimpanzee sequence are not counted.
(43 KB DOC)
The table shows the fractions of amniotic CNCs either within introns or at three different distances from genes, within each selected group (neutral, rate speed-ups on a single lineage by SRTi, or by AIC). For each CNC, the nearest gene was found without considering gene orientation. Each intergenic CNC was divided into three groups that are 1) within 10 kb, 2) between 10 kb and 100 kb, and 3) greater than 100 kb from any gene.
(37 KB DOC)
Each penalty is adjusted so that the null model (the number of parameters is one) has zero penalty.
(36 KB DOC)
Class 1 has the lowest substitution rate on the dog lineage and Class 4 the highest. Within each class, the ratio of the divergence in rodents (M-R) to the divergence in hominids (H-C) is computed. Notice that in contrast to the comparison between CNCs and neutral regions, shown in Table S3, there is only very slight variation in tree shape across different classes of CNCs. To avoid being affected by outliers, each divergence is estimated using only CNCs evolving under our null model (by AIC).
(32 KB DOC)
(58 KB DOC)
The National Center for Biotechnology Information (NCBI) Entrez (http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi) gene accession numbers for the genes C21orf54 and IFNAR2 are GeneID:339629 and GeneID:3455, respectively.
We thank Peter McCullaugh, Dan Nicolae, Molly Przeworski, Adam Siepel, and members of the Pritchard lab as well as an anonymous reviewer for many helpful discussions or comments.
A previous version of this article appeared as an Early Online Release on July 18, 2007 (doi:10.1371/journal.pgen.0030147.eor).
Author contributions. The project was jointly conceived and developed by both authors. Data analysis was performed by SYK. JKP and SYK wrote the paper.
Funding. SYK was supported in part by National Science Foundation grant number DMS0305009 to Peter McCullaugh, and JKP by a grant from the Packard Foundation.
Competing interests. The authors have declared that no competing interests exist.