|Home | About | Journals | Submit | Contact Us | Français|
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in ~28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in ~12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation.
It is well established that nonsynonymous sites in protein-coding sequences are subject to purifying selection caused by constraints operating at the level of protein structure and function and that positive selection that, at least in mammals, affects a minority of genes and/or sites is an important force of adaptive evolution (Li 1997; Vallender and Lahn 2004; Bustamante et al. 2005; Nielsen et al. 2005). Synonymous (silent) sites are often used as a proxy for neutral evolution. Under this premise, the traditional gauge of selection in nonsynonymous sites is the ratio of nonsynonymous (Ka) over synonymous (Ks) substitutions. Ka/Ks <1 is thought to indicate purifying selection, whereas Ka/Ks >1 is construed as the signature of positive selection (Li 1997; Hurst 2002). However, the neutrality of synonymous substitutions is only a rough and not necessarily valid approximation; the extent, range, and underlying causes of selection in synonymous sites remain subjects of intense debate (Chamary, Parmley, and Hurst 2006). The results of several studies suggest that efficient translation (Ikemura 1985; Akashi and Eyre-Walker 1996; Eyre-Walker and Keightley 1999) and mRNA stability (Duan and Antezana 2003; Chamary and Hurst 2005; Shabalina, Ogurtsov, and Spiridonov 2006) are substantial forces of purifying selection in synonymous sites. It has also been shown that synonymous substitutions are under purifying selection in mammalian exonic splicing enhancer motifs (ESEs) (Yeo et al. 2004; Parmley, Chamary, and Hurst 2006) and in alternatively spliced exons (Xing and Lee 2005).
By contrast, to the best of our knowledge, positive selection in synonymous sites has not been reported. However, this possibility has been brought up in the course of analysis of the SPANX family of mammalian cancer-testis genes (Kouprina et al. 2004) and the insect cyclin A inhibitor gene (Avedisov et al. 2001) that are characterized by exceptionally high and comparable rates of evolution in both synonymous and nonsynonymous sites.
We were interested in addressing the problem at a fundamental level: is positive selection in synonymous positions a common phenomenon, and if so, what could be the underlying causes of such selection? We reasoned that, to investigate selection in synonymous sites, the substitution rate in intronic sequences (Ki) was a logical choice of the proxy for neutral evolution. A method for estimating the neutral rate using Ki has recently been reported (Hoffman and Birney 2007). In principle, at least, cases of negative selection in synonymous sites were identified as Ks/Ki <1 whereas cases of positive selection were indicated by Ks/Ki >1. No part of the genome can be automatically assumed to evolve neutrally: the possibility of a hidden function that constrains evolution or an adaptive component in the evolution of a sequence always should be considered. However, apart from pseudogenes, internal regions of introns are among the best candidates for neutrally evolving sequences. The sequences of ~30 nucleotides at each end of an intron are thought to be subject to weak purifying selection that stems from the involvement of these sequences in splicing (Louie, Ott, and Majewski 2003; Yeo et al. 2004) and SAS (unpublished observations). In addition, some of the introns contain highly conserved sequences with various, often unknown functions including genes for noncoding RNAs (Washietl et al. 2005). However, in mammals, these functional regions have been estimated to comprise <5% of the intronic sequences (Waterston et al. 2002). In addition, it has been demonstrated that, even in conserved noncoding sequences such as those found in introns, the pressure of purifying selection tends to be substantially weaker than in coding regions (Kryukov, Schmidt, and Sunyaev 2005). Thus, it appears that, after discarding terminal regions, introns could serve as a reasonable approximation of neutrally evolving sequences.
We found that Ks and Ki distributions for mammalian genes have different statistical properties, which makes Ki suspect as the neutral baseline for the analysis of selection in synonymous sites. Therefore we developed a computational procedure to shuffle aligned intron sequences such that their statistical properties mimic those of nonsynonymous sites and used the corresponding substitution rate (Ki-pseudo) to assess the extent of negative (purifying) and positive (diversifying) selection in synonymous sites of mammalian genes. It is shown that both types of selection in synonymous sites are widespread and that positive selection at synonymous sites is much more common than positive selection on protein sequence. Positive selection at synonymous sites is unrelated to functional constraints at the protein level, but is linked to gene expression, probably through mRNA destabilization.
We calculated rates of divergence in coding and noncoding DNA for mouse-rat orthologs taken from the May 2004 HomoloGene database (Wheeler et al. 2006). HomoloGene orthologs are defined as bidirectional best hits using the BLAST program for sequence comparison (Altschul et al. 1997). Protein and mRNA sequences were obtained from the Entrez protein and nucleotide databases (Wheeler et al. 2006). We started with a total 8,178 mouse-rat orthologs but removed over half (see below) to eliminate the potential bias estimates of the Ks estimates that could be introduced by alternative splice variants and other alignment ambiguities (see next section). The final gene set employed for all analyses contained 1,562 mouse-rat orthologs.
Protein alignments for mouse and rat were generated using the MUSCLE alignment package (Edgar 2004). Protein alignments were then used to guide alignment of the corresponding mouse and rat coding sequences (CDS). It was required that each coding sequence contain a start and stop codon, in order to eliminate all partial sequences. All alignments that contained insertions/deletions with the total length >30 bp were removed in order to exclude potential effects of incorrect gene prediction and alternative splicing.
Intron alignments were generated using the OWEN alignment tool (Ogurtsov et al. 2002) with the following specifications: (1) an intron must be bound on 5′/3′ ends by exons that align across ≥80% of length, (2) the presence of constitutive splice sites at each intron/exon boundary was required, (3) a P value <0.001 for each intron alignment was required, (4) 30-nucleotide regions from the 5′/3′ ends of each intron were removed, and 5) the proximal, 5′-terminal introns in the compared orthologous genes were discarded, because these introns are known to be enriched for various regulatory elements and, consequently, could be subject to purifying selection (Majewski and Ott 2002). These requirements help ensure that accurate orthologous intron alignments are generated. The 30-nucleotide regions from the ends of each alignment were removed to eliminate splicing signals from the estimates of intron divergence.
The evolutionary rates for coding DNA were originally calculated using the Pamilo-Bianchi-Li method (Li 1993; Pamilo and Bianchi 1993), which takes into account transition and transversion rates. Evolutionary rates for noncoding DNA were measured using the Kimura's 2-parameter model (Kimura 1980). However, considering the difference in the statistical properties of the CDS and intron sequences (see Results and Discussion), a method was developed to shuffle the intron sequence alignments such that their statistical properties mimicked those of coding sequences. Mouse-rat pseudo-CDS alignments were generated from alignments of mouse and rat intronic sequences using the following procedure for each alignment (supplementary fig. 1S): (1) Start the pseudo-CDS from ATG for both mouse and rat sequences. (2) Take the next pentanucleotide starting from the first codon position from the mouse CDS sequence and find this pentanucleotide in the mouse intronic sequence; if there are several such pentanucleotides, 1 is chosen randomly. (3) Add the corresponding segment of the intronic alignment to the pseudo-CDS; if the length of the pseudo-CDS alignment >5 nucleotides, the overlapping 2 nucleotides are chosen randomly; if a pentanucleotide is not found in the mouse intronic sequence, the corresponding fragment of the CDS alignment is added to the pseudo-CDS alignment. 4) The procedure is repeated until the end of the CDS alignment is reached. The resulting pseudo-CDS alignment has the same length as the CDS alignment, and the base compositions of mouse CDS and pseudo-CDS are identical. The significance of the difference in the codon composition of the rat CDS and pseudo-CDS was tested using a Monte-Carlo modification of the χ2 test (Adams and Skopek 1987).
For each CDS alignment, 10,000 pseudo-CDS alignments were generated. A score of divergence at synonymous sites Ks was calculated using the Pamilo-Bianchi-Li method (Li 1993; Pamilo and Bianchi 1993) or the fraction of mismatches at 4-fold degenerate sites. This score was calculated for the mouse-rat CDS alignment (Ks) and 10,000 pseudo-CDS alignments (Ki-pseudo). The distribution of Ki-pseudo was used to calculate probabilities P(Ks ≤ Ki-pseudo) and P(Ks ≥ Ki-pseudo); the fractions of the pseudo-CDS with Ks ≤ Ki-pseudo and Ks ≥ Ki-pseudo were taken as approximations of the respective P values. The genes with P(Ks ≥ Ki-pseudo ≤ 0.05 were considered positively selected, and the genes with P(Ks ≤ Ki-pseudo) ≤ 0.05 were considered negatively selected. These calculations were performed for the complete alignments and repeated after masking CG, TG, and CA dinucleotides. For the analysis of statistical properties of distributions and correlation analysis, a pseudo-CDS alignment was randomly drawn from the total sample of 10,000, and Ki-pseudo was calculated using the PBL method.
The GNF Gene Expression Atlas2 data (Su et al. 2004) for mouse was used as the source of data on genes (rat expression data was limited for the majority of genes and therefore was not included in this analysis). The GNF Atlas2 data contain 2 replicates for each of 61 mouse tissues. The data for redundant tissue types was combined to yield a final set of 55 mouse tissues. Average expression values for each probe were calculated using raw expression data. Average probe expression values for raw data were calculated by summing the expression values across each probe set and dividing that sum by the total number of tissues (55). Tissue breadth values for each gene (the number of tissues that each probe is expressed in) were obtained using raw expression data. The raw expression value for a given tissue had to be ≥350 in order for that tissue to be counted in the tissue breadth analysis (Jordan, Marino-Ramirez, and Koonin 2005). Thus, the final tissue breadth score for a probe represents the number of all tissues with the raw expression value ≥350.
The effective number of codons (ENC) for the coding sequences in the analyzed gene sets was calculated using previously described methods (Wright 1990). The codon adaptation index (CAI) scores (Sharp and Li 1987) for the analyzed coding sequences were calculated using the EMBOSS bioinformatics suite (Rice, Longden, and Bleasby 2000). The CAI values were calculated by comparing the codon usage patterns of a given gene against the codon usage patterns of a reference set of highly expressed mouse genes. Specifically, the GNF Atlas2 mouse expression data were used to identify the top ~10% (1,479/15,007) most highly expressed mouse genes by calculating the average overall expression level of each probe from raw expression data. The average expression values were ranked, from largest to smallest, to obtain the top 10%.
In order to quantify the dissimilarities between the distribution functions of Ka, Ks, Ki, and Ki-pseudo, we have computed pairwise distances between these distributions using an information-theoretic measure known as the L-divergence (Lin 1991). This distance measure is a refined version of the widely used Kullback-Leibler distance.
In order to avoid ambiguities of alignment, especially, in intron sequences, as well as substitution saturation effects, we limited the present analysis to orthologous genes from closely related rodents, mouse and rat. It has been recently shown that Ki is particularly prone to taxon-specific variation at longer evolutionary distances (Hoffman and Birney 2007). A critical issue is whether Ks/Ki is an adequate measure of selection in synonymous sites. We generated Ks and Ki distributions for a set of 1,562 reliable (see Materials and Methods) alignments of intronic and coding sequences from orthologous mouse and rat genes in order to assess the suitability of Ki as the baseline for detecting selection in synonymous sites. First, we compared the statistical properties of the distributions of Ki and Ks. The distribution of Ki had almost precisely the same mean and median as the Ks distribution but was quite narrow compared to the latter: the standard deviation of Ks was more than twice greater than that of Ki (fig. 1 and supplementary table 1S). Furthermore, the skewness of the distribution was also much greater for the Ks distribution than for the Ki distribution (fig. 1 and supplementary table 1S). We also compared the nucleotide compositions of introns and synonymous sites and found substantial differences between these two (supplementary table 2S). These observations showed that Ks and Ki distributions had distinct statistical properties and suggested that introns and synonymous positions in exons are subject to different evolutionary forces.
Previous studies that compared Ks and Ki do not seem to arrive to a consensus. Some reports have claimed that synonymous substitution rates are approximately equal to those in introns despite differences in the patterns of substitution (Hughes and Yeager 1997; Chamary and Hurst 2004), whereas others have suggested that intron rates of divergence are greater than those in synonymous sites (Hellmann et al. 2003), or conversely, that synonymous substitution rates exceed those found in introns (Subramanian and Kumar 2003). Our present observation that Ks and Ki are (nearly) identical on average but are very differently distributed suggests that these diverging conclusions might be attributed to different evolutionary models and data sets used in the respective studies. It has been argued that the apparent increase in synonymous substitution rates of some genes over those of introns is due to the context-dependence of mutation in synonymous sites, in particular, the high mutation rate of CpG dinucleotides (Hughes and Yeager 1997; Kondrashov, Ogurtsov, and Kondrashov 2006).
Our observations, together with those in previous studies, suggest that Ki might not be a proper null model for Ks due to different nucleotide compositions of coding and non-coding DNA and distinct statistical properties of the Ks and Ki distributions. Thus, we developed a computational procedure to account for these differences between introns and synonymous sites. Under this approach, alignments of pseudo-coding sequence (pseudo-CDS) were generated by sampling alignments of intronic sequences such as to mimic the base composition of the synonymous sites for each respective gene and thus eliminate potential artifacts caused by differences in the CpG content and other compositional differences between synonymous sites and introns. The pseudo-CDS alignments were used to calculate Ki-pseudo values (see Material and Methods and Supplementary fig. 1S for details). Using an information-theoretical measure of divergence (see Materials and Methods for details), we computed distances between the distributions and found that, unlike Ki, Ki-pseudo had statistical properties highly similar to those of Ks (fig. 2 and supplementary table 3S). In particular, the distribution of Ki-pseudo was shifted to the right compared to the Ki distribution such that the right tail of the Ki-pseudo distribution behaved more similarly to that of the Ks distribution (fig. 1). Accordingly, Ki-pseudo values were used as the baseline to detect purifying and positive selection acting on synonymous sites of mammalian genes.
We examined positive and negative selection in synonymous sites, using the Ks/Ki-pseudo ratio as the criterion under 2 distinct estimation schemes, the Pamilo-Bianchi-Li (PBL) method (Li 1993; Pamilo and Bianchi 1993) and the fraction of mismatches at 4-fold degenerate sites (4F method). These calculations were performed either before or after masking CG, TG, and CA dinucleotides (the highly mutable CpG sites and the “highly CpG-prone” sites, i.e., those convertible to CpG via a single transition [Kondrashov, Ogurtsov, and Kondrashov 2006]) or, finally, after removing all CpX and XpG dinucleotides (all CpG-prone sites). Starting with a set of 1,562 reliably aligned mouse-rat orthologs (see Materials and Methods), we identified a significant excess (compared to the random expectation) of genes with both negative and positive selection in synonymous sites in all 5 tests. Masking the mutable dinucleotides did not substantially affect the results (table 1). In order to obtain conservative estimates of positively and negatively selected genes, we required agreement between the 2 evolutionary models: only genes found to be positively or negatively selected in 3 or 4 tests were included in the final sets. The results of test #3 (table 1, PBL, CpX, and XpG sites removed) were not used in this selection procedure because of the dramatic loss of sites (>50%) that was caused by the masking procedure and made the 4F method inapplicable. However, the results of the PBL test show that this masking had but a small effect on the number of genes with apparent negative and positive selection in synonymous sites (table 1).
With this approach, 185 cases of positive (diversifying) selection (positive set) and 438 cases of negative (purifying) selection (negative set) were identified. The remaining 939 genes were conservatively assigned to the neutral set. Thus, ~28% of the analyzed rodent genes were found to be subject to substantive negative selection in synonymous sites, whereas roughly half as many genes (~12%) appeared to be subject to relatively strong positive selection. A comparison of the content of CpG sites involving synonymous position in the 3 gene sets did not reveal significant differences, suggesting that the observed distinct modes of evolution are not caused by effects of mutagenic contexts (supplementary table 4S).
We further compared the Ks, Ki, Ki-pseudo, and Ka distributions for the positive, negative, and neutral datasets. The distributions of Ka, Ki, and Ki-pseudo were very similar, in both shape and parameter values in all 3 sets (figs. 3a,c,d), although there was a statistically significant difference between the distributions of all 3 of these variables (supplementary table 5S). In a sharp contrast, the Ks distributions differed much more significantly between the 3 gene sets than the Ka, Ki-pseudo, or Ki distributions (see supplementary table 5S, and compare fig. 3b to figs. 3a,c,d). Although all 3 Ks distributions were, approximately, equally broad, the negative and positive set distributions were significantly shifted to the left and to the right, respectively, compared to the neutral set distribution (fig. 3). In particular, the mean Ks in the positive set was substantially greater than the mean Ks of the negative set (mean = 0.239 for positive; mean = 0.132 for negative set) as expected of sites evolving under positive selection.
It should be emphasized that the existence of a major difference between the Ks distributions but not Ki distributions for the negative and positive sets (compare fig. 3b and fig. 3c) all but rules out a potential alternative explanation of these results, namely, that the apparent positive selection in synonymous sites is an artifact caused by an anomalously high sequence conservation, due to purifying selection, in the intronic sequences of the respective genes. In order to further ascertain that purifying selection in introns was not a significant factor accounting for the detected positive selection in synonymous sites, we applied stringent filtering to remove potential functional elements from the intronic sequences used in the Ki and Ki-pseudo calculations. For that purpose, longer exon-flanking sequences were trimmed off the intron alignment, and short introns that could be enriched for functional elements were discarded. This procedure reduced the set of orthologous gene pairs available for the analysis to 952 but did not substantially change the fractions of genes subject to positive and negative selection in synonymous sites (table 6S). These results indicate that Ki-pseudo is, indeed, an appropriate baseline for measuring selection acting at other classes of sites in orthologous genes from closely related species.
Does the relationship between Ks, Ki-pseudo, and Ka reveal anything about the evolutionary forces that affect the positive and negative sets? We addressed this question by checking whether any of the variables were correlated, and whether the strength of such correlations differed between the sets. We observed a moderate but statistically highly significant positive correlation between Ks and Ka for the negative set (table 2 and fig. 4), which could be expected, given that genes in this set are under strong purifying selection; similar observations have been reported previously in several independent studies (Lipman and Wilbur 1984; Wolfe and Sharp 1993; Mouchiroud, Gautier, and Bernardi 1995; Makalowski and Boguski 1998; Smith and Hurst 1999). A somewhat weaker but also highly significant correlation between Ka and Ks was seen in the neutral set (table 2 and fig. 4). The significant correlation between Ks and Ka in the negative and neutral sets implies that the evolutionary forces that exert purifying selection on synonymous sites in the negative set are linked to the evolution of protein structure and function. In particular, it seems likely that the negative selection acting on synonymous sites has to do with the high level of expression that is characteristic of genes encoding highly conserved proteins (Pal, Papp, and Hurst 2001; Krylov et al. 2003; Wolf, Carmel, and Koonin 2006). By contrast, in the positive set, Ks and Ka showed a much weaker and not significant correlation (r = 0.08, P = 0.28; table 2 and fig. 4). To control for the possible effect of the smaller sample size of the positive set, we generated 10,000 random samples of 185 genes each from the total dataset and found only 93 sampled sets with r ≤ 0.08 (P < 0.01). Thus, the absence of a significant correlation between Ka and Ks in the positive gene set is not a sample size artifact. This observation suggests that, in sharp contrast with both the negative and the neutral sets, the forces that affect the evolution of synonymous sites in the positive-set genes are uncoupled from any selection acting at the level of protein structure and function.
We then examined the relationship between Ks and Ki-pseudo and observed strong positive correlations for all 3 gene sets; slightly weaker but also highly significant correlations were found for Ks and Ki (table 2). At first glance, this result suggests the possibility that evolution of introns might not be neutral, and accordingly, Ki-pseudo might not be a robust null model for measuring selection at synonymous sites. However, this does not seem to be the case, because the strength of the correlation was nearly identical among the 3 gene sets. It appears most likely that the correlations between Ks and Ki (and Ki-pseudo) reflect regional mutational biases across the genome. Such biases have been reported previously (Matassi, Sharp, and Gautier 1999), and for the rodent genes analyzed here, we observed a highly significant anticorrelation between the differential of the Ki and Ks values and the distance separating the respective genes on the chromosome: closely spaced genes, typically, had similar Ks, Ki, and Ki-pseudo values; no such effect was seen for Ka (supplementary figs. 2S and 3S).
Given that Ks is correlated with Ka in the negative and neutral sets, and with Ki in all 3 sets, we performed a partial correlation analysis in an attempt to disentangle these correlations. In the negative and neutral sets, the correlations between Ka and both Ks and Ki became smaller but remained highly significant after the removal of the effect of the other variable (supplementary table 7S). Thus, in the negative and neutral sets, the correlation between Ka and Ks appears to be valid in itself and might reflect similar selective pressures at synonymous and nonsynonymous sites. The correlation between Ka and Ki, which was, in part, independent of the Ks-Ki correlation and was greater in the negative set than in the neutral set (supplementary table 7S), is harder to explain. It cannot be ruled out that there is some pressure of purifying selection on intron sequences, the nature of which remains obscure. Should such a selective component, indeed, affect evolution of introns in the negative set, this would make our estimate of genes subject to purifying selection at synonymous sites even more conservative. By contrast, in the positive set, there was no significant correlation between Ka and either Ks or Ki, indicating that, in these genes, evolution of the protein sequence is completely decoupled from the evolution of noncoding sequences. Taken together, these results indicate that there are at least 2 distinct components in the evolution of synonymous sites, a selective one and a mutational one. The nature of the mutational component is the same across all analyzed genes. By contrast, the selective component is linked to the protein evolution in the negative set but apparently is of a different nature in the positive set.
What factors contribute to positive and negative selection in synonymous sites? Perhaps even more importantly, can we identify the probable causes of the lack of correlation between Ks and Ka in the positive set? We evaluated the roles of gene function, codon bias, gene expression, and mRNA stability as potential driving forces of selection in synonymous sites. There were no significant differences in the distribution across the Gene Ontology (GO) categories between the genes of the negative, neutral, and positive sets (data not shown), hence no straightforward explanation for the observed differences in the selection regimes through the biological functions of the respective genes. We then compared the codon bias [determined either as the effective number of codons (ENC) or as the codon adaptation index (CAI)] between the 3 gene sets. Moderate but statistically significant differences in CAI were detected between the negative and positive sets, with the highest value observed for the negative set (tables (tables33 and and4).4). Thus, the pattern of codon bias exhibited by the genes in the negative set is more similar to the pattern found among highly expressed genes (the reference set) than to the pattern found within the positive set. This result is consistent with the expectation that genes that are more biased in their choice of synonymous codons tend to be more conserved. A significant difference in ENC was also observed between the negative and neutral sets (tables (tables33 and and4).4). Given that a tighter control over codon usage would be a side effect of strong purifying selection within the negative set, this result is not surprising. The nonsignificant differences between the positive set and the other 2 sets are likely to result from the fast evolution in synonymous sites of positively selected genes. A comparison of gene expression, determined either as expression level (EL) or as expression breadth (EB), revealed no significant differences between the positive, neutral, and negative sets (table (table33 and and4).4). Some reports have suggested that purifying selection on synonymous sites affects the efficiency and accuracy of translation in certain model organisms such as Escherichia coli and Drosophila melanogaster (Ikemura 1985; Eyre-Walker 1996; Akashi and Eyre-Walker 1998); however, no strong indications of such selective pressures in mammalian genomes have been detected (Smith and Hurst 1999; Duret and Mouchiroud 2000; Iida and Akashi 2000). Furthermore, these findings were in agreement with previous reports indicating that rates of synonymous divergence are not correlated with patterns in gene expression (Lercher, Chamary, and Hurst 2004).
However, examination of the correlations between the rates of evolution in synonymous and nonsynonymous sites and characteristics of expression in the 3 gene sets produced more informative and, partly, unexpected results. In the negative set, there was a relatively low but statistically significant anticorrelation between Ks and expression (both EL and EB); these anticorrelations paralleled those between Ka and expression (table 2) and were compatible with the previous observations on slow evolution of highly and broadly expressed genes (Duret and Mouchiroud 2000; Pal, Papp, and Hurst 2001; Krylov et al. 2003; Wolf, Carmel, and Koonin 2006). The positive set showed a strikingly different pattern, with Ks being positively correlated with both EL and EB, whereas for Ka the correlation was negative and of roughly the same magnitude as in the other 2 gene sets (table 2). Thus, in a pattern that is the diametrical opposite of what is seen in the negative set (and, less pronouncedly, in the neutral set), fast evolution in synonymous sites that appear to be subject to positive selection is associated with higher and broader expression of the corresponding gene.
It has been proposed that purifying selection on synonymous sites is linked to increased mRNA stability (Duan and Antezana 2003; Chamary and Hurst 2005; Shabalina, Ogurtsov, and Spiridonov 2006). Thus, we looked for differences in patterns of mRNA stability between the positive, neutral, and negative sets. Using previously published methods (Shabalina, Ogurtsov, and Spiridonov 2006), we found that the average predicted mRNA stability (kcal/mol) was significantly greater in the negative set than in the neutral or positive sets (fig. 5 and tables tables55 and and6).6). It has been shown previously that the contributions of nucleotides to mRNA stability followed a periodic pattern dictated by the structure of the genetic code, with the third, degenerate position being the primary contributor (Shabalina, Ogurtsov, and Spiridonov 2006). This pattern was, indeed, apparent in all 3 gene sets analyzed here (fig. 5). Notably, the difference in RNA stability (estimated free energy) between the negative, neutral, and positive sets was consistent and significant over all 3 codon positions (fig. 5 and tables tables55 and and6).6). This suggests that positive selection for mRNA destabilization affects all codon positions, although in nonsynonymous sites this relatively weak effect is overshadowed by selection acting at the protein level. No significant differences in nucleotide content within the codon positions were observed between the positive and negative sets (supplementary table 8S), indicating that the differences in mRNA stability are not artifacts caused by different base compositions.
We further assessed correlations between ΔG and Ks in the 3 gene sets and found the results to be consistent with selection acting to maintain or establish the optimal stability of the mRNA secondary structure. The correlation between ΔG and Ks was not significant in the neutral set, whereas the correlations of opposite signs were observed in the negative and the positive sets. In the negative set, there was a low but significant anticorrelation between ΔG and Ks, whereas the positive set showed a somewhat greater, positive correlation (table 2). In other words, in the negative set, the genes that evolve relatively slowly tend to possess less stable mRNA secondary structure than faster evolving genes; conversely, in the positive set the faster evolving genes are less stable than slowly evolving ones. Thus, it appears that purifying selection in the negative set prevents the formation of excessively stable secondary structure, whereas in the positive set diversifying selection might drive mRNA destabilization.
To summarize, the correlations between Ks, gene expression, and predicted mRNA stability were all negative in the negative set but positive in the positive set (fig. 6A). These coherent but contrasting correlation structures, certainly, do not prove a cause-and-effect relationship between mRNA stability and expression in mammalian evolution, but are compatible with the hypothesis that both purifying and positive selection in synonymous sites act at the level of mRNA secondary structure, which affects stability and, through mRNA degradation process, the expression levels measured in microarray experiments. In a sharp contrast, the correlation structures for Ka were identical for the positive and negative set (fig. 6B), suggesting similar patterns of (purifying) selection in nonsynonymous sites. In this case, acceleration of evolution, on average, seems to result in mRNA destabilization and lower expression.
We developed and applied a robust statistical test to identify purifying and positive selection acting on synonymous sites of mammalian genes using shuffled intron sequences as the proxy for neutral evolution. As expected, considering many previous reports on strong, positive correlations between Ka and Ks, we observed that a substantial fraction of the analyzed genes was subject to significant purifying selection at synonymous sites (Akashi 1994; Mouchiroud, Gautier, and Bernardi 1995; Makalowski and Boguski 1998). By contrast, the finding that ~12% of the genes seemed to experience substantial positive selection at synonymous sites was surprising. A comparison of the distributions of Ks and Ki for the negative and positive gene sets (fig. 3) seems to rule out the possibility that the apparent positive selection in synonymous sites is actually due to purifying selection affecting the respective intronic sequences.
The fraction of rodent genes with apparent positive selection in synonymous codons is considerably greater than the fraction of mammalian genes that have been reported to exhibit positive selection on the level of the amino acid sequence, as reflected in Ka/Ks > 1 for the entire coding sequence. The specific numbers of mammalian genes that are subject to strong positive selection differ between studies, but a recent careful analysis of ~8,000 orthologous genes from humans and chimpanzees revealed only 35 genes with statistically significant positive selection manifest at the level of the entire protein sequence (Nielsen et al. 2005). Of the 1,562 genes analyzed here, only 2 had Ka/Ks >1 at the statistically significant level, indicating positive selection on the level of complete protein sequences (data not shown). Thus, it seems that, in comparable tests, 1 to 2 orders of magnitude more mammalian genes exhibit positive selection in synonymous than in nonsynonymous sites. This excess of positive selection in synonymous sites seems to reflect different balances of evolutionary forces that act at positions coding for amino acids and at synonymous positions. Protein sequences are almost universally subject to purifying selection of varying strength (Ka/Ks << 1 for most genes), which is likely to obscure more subtle effects of positive selection acting at some positions; indeed, site-specific positive selection detected by multiple alignment analysis appears to be common (Yang and Bielawski 2000; Zhang, Nielsen, and Yang 2005). By contrast, as shown here, readily detectable purifying selection on synonymous sites might affect only about one-quarter of the mammalian protein-coding genes such that positive selection is readily detectable against the, largely, neutral background of synonymous sites. Additionally, Ki and, especially, Ki-pseudo appear to be better neutral baselines than Ks such that positive selection at synonymous sites could be easier to detect than positive selection at nonsynonymous sites for which Ks is used as the baseline. This being said, a comparison of the distributions of Ka and Ks in rodent genes (fig. 1) indicates that Ka/Ks remains a reasonable measure of the strength of selection in proteins, given that protein sequences are subject to much more pronounced constraints than noncoding sites.
Finding the biological basis or at least a strong functional correlate of positive selection in synonymous sites turned out to be a challenge. The lack of significant correlation between Ks and Ka in the positive set indicates that positive selection in synonymous sites is decoupled from the evolution of the respective proteins. This conclusion is compatible also with the lack of any significant excess of a particular class of biological functions among the genes in the positive set. By contrast, analysis of links between selection in synonymous sites and gene expression and mRNA stability revealed nontrivial connections. Although there were no significant differences in the overall expression level or breadth between positively selected, negatively selected and neutral genes, the dependences between evolution rate and expression was remarkably different. In particular, and in contrast to negatively selected genes, among genes with positive selection in synonymous sites, those that are highly and widely expressed appear to evolve faster (i.e., under a stronger selective pressure) than lowly expressed ones. The second clear and statistically highly significant correlation was with predicted stability of the mRNA secondary structure: the transcripts in the set of positively selected genes are predicted to be considerably less stable than those in the negative and neutral sets. The correlations between Ks, on the one hand, and expression and mRNA stability, on the other hand, were all of the same sign within the positive and negative sets but of the opposite signs between the sets (fig. 6A). This is compatible with a causal relationship between mRNA stability and expression levels measured in microarray experiments, with the link probably actualized through regulation of mRNA degradation. Therefore, although we technically cannot ascertain the direction of evolution without using a third species as an outgroup, we suspect that mRNA destabilization could be an important factor that, through its effect on mRNA stability and possibly also translation rates, drives positive selection in synonymous sites. Additionally or alternatively, it is conceivable that positive selection in synonymous positions is driven by the necessity to maintain interactions with other RNA species (Mattick and Makunin 2006).
We cannot be confident that the correlates of selection in synonymous site detected here, indeed, reflect the principal underlying selective forces. However, it is our hope that the demonstration of the wide spread of positive selection in synonymous sites in mammalian genes stimulates further theoretical and experimental studies aimed at the deeper characterization of the causes of this phenomenon.
We thank King Jordan, Alexei Kondrashov, Yuri Wolf, and John Wootton for helpful discussions. This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS.