|Home | About | Journals | Submit | Contact Us | Français|
Recent large-scale studies of evolutionary changes in gene expression among mammalian species have led to the proposal that gene expression divergence may be neutral with respect to organismic fitness. Here, we employ a comparative analysis of mammalian gene sequence divergence and gene expression divergence to test the hypothesis that the evolution of gene expression is predominantly neutral. Two models of neutral gene expression evolution are considered: 1—purely neutral evolution (i.e., no selective constraint) of gene expression levels and patterns and 2—neutral evolution accompanied by selective constraint. With respect to purely neutral evolution, levels of change in gene expression between human–mouse orthologs are correlated with levels of gene sequence divergence that are determined largely by purifying selection. In contrast, evolutionary changes of tissue-specific gene expression profiles do not show such a correlation with sequence divergence. However, divergence of both gene expression levels and profiles are significantly lower for orthologous human–mouse gene pairs than for pairs of randomly chosen human and mouse genes. These data clearly point to the action of selective constraint on gene expression divergence and are inconsistent with the purely neutral model; however, there is likely to be a neutral component in evolution of gene expression, particularly, in tissues where the expression of a given gene is low and functionally irrelevant. The model of neutral evolution with selective constraint predicts a regular, clock-like accumulation of gene expression divergence. However, relative rate tests of the divergence among human–mouse–rat orthologous gene sets reveal clock-like evolution for gene sequence divergence, and to a lesser extent for gene expression level divergence, but not for the divergence of tissue-specific gene expression profiles. Taken together, these results indicate that gene expression divergence is subject to the effects of purifying selective constraint and suggest that it might also be substantially influenced by positive Darwinian selection.
Changes in gene expression have long been recognized as fundamental to the process of evolution (Britten and Davidson, 1969; Britten and Davidson, 1971; King and Wilson, 1975). However, most molecular evolution studies have focused on gene and protein sequence divergence (Li, 1997). This can be attributed largely to the paucity of gene expression data that existed until relatively recently. Now, thanks to the development of techniques for high throughput gene expression studies (Adams et al., 1991; Schena et al., 1995; Velculescu et al., 1995) and the creation of databases needed to store and disseminate the resulting deluge of expression data (Edgar et al., 2002; Gollub et al., 2003; Karolchik et al., 2003), enough gene expression data have accumulated to facilitate the evolutionary analysis of gene expression divergence on a large scale.
Initial attempts to study evolutionary changes in gene expression in a systematic way have led to the proposal that gene expression divergence might be predominantly neutral (Khaitovich et al., 2004; Yanai et al., 2004). The neutral model of evolution (Kimura, 1968, 1983; King and Jukes, 1969), with respect to gene expression, would hold that the vast majority of changes in expression do not affect fitness and so they accumulate rapidly and in a regular, clock-like manner (Kimura and Ohta, 1974). Here, we examine patterns of mammalian gene expression to test whether the levels of gene expression divergence are indeed consistent with a neutral model of evolution.
Before proceeding, it is necessary to delineate two distinct models of neutral evolution and clarify the implications of each model for the evolution of gene expression. In the strictest sense, neutral evolution of genes can be taken as accumulation of divergence in the absence of any selective pressure. A classic example of such purely neutral evolution is the divergence of pseudogene sequences (Li et al., 1981). The observation that gene expression levels (Khaitovich et al., 2004) diverge at comparable rates for both pseudogenes and intact protein-coding genes has been taken as evidence for purely neutral evolution of gene expression. Such rapid divergence of gene expression patterns is inconsistent with any constraints on expression imposed by natural selection.
However, it should be noted that evolution according to the neutral theory does not imply the total absence of natural selection. From the neutralist perspective, the primary role of natural selection in evolution is the removal of variants that reduce fitness. The elimination of such deleterious variants is referred to as purifying or negative selection. Of course, the neutral theory does not deny the qualitative importance of positive, Darwinian selection (i.e., the fixation of variants that increase fitness) either; quantitatively, however, positive selection is construed to be rare compared to purifying selection. According to the neutral theory, evolutionary constraints on genes are due to purifying selection and the differences that are observed between genes reflect (nearly) neutral changes that have no, or only slight (Ohta, 1973), deleterious effects on fitness. This can be taken to mean that changes in gene expression with negative functional consequences are constrained by purifying selection, while neutral expression changes are free to accumulate. This model predicts a regular, clock-like accumulation of gene expression divergence over the course of evolution and such a pattern for changes in gene expression level has been observed for both primates and rodents (Khaitovich et al., 2004). In this work, we interrogate patterns of mammalian gene expression divergence with respect to the predictions of models of (i) purely neutral evolution, and (ii) neutral evolution accompanied by selective constraint.
Protein-coding gene sequences (CDSs) do not evolve in a purely neutral manner as pseudogenes and some intergenic regions do (Li, 1997). On the contrary, purifying selection plays a critical role in constraining the levels of CDS divergence. Selective constraints on the evolution of CDSs show a broad range of variation and have been related to a number of functional characteristics of genes and the proteins that they encode (Krylov et al., 2003). These characteristics include protein dispensability (Hirsh and Fraser, 2001; Jordan et al., 2002) and interactivity (Fraser et al., 2002, 2003; Jordan et al., 2003) as well as gene expression level and breadth (Duret and Mouchiroud, 2000; Pal et al., 2001; Jordan et al., 2004; Zhang and Li, 2004). Of all of these factors, gene expression level is most strongly correlated with gene sequence divergence (Pal et al., 2003; Bloom and Adami, 2004; Rocha and Danchin, 2004), i.e., highly expressed genes tend to evolve slowly.
The correlation between gene expression levels and gene sequence divergence levels strongly suggests a connection between expression and natural selection. To more directly address whether gene expression levels change in a purely neutral manner or if they are constrained by purifying selection as CDS sequences are, the divergence of gene expression levels was compared to the level of CDS divergence between human and mouse. Gene expression data for human and mouse were taken from the mammalian gene expression atlas (Su et al., 2004) and gene sequence data were taken from the National Center for Biotechnology Information’s (NCBI) Refseq database (Pruitt and Maglott, 2001) as described in the Methods section. Consistent with previous results, the level of gene expression is negatively correlated with the level of protein sequence divergence (Fig. 1a). Thus, more highly expressed genes are, on average, more constrained by purifying selection. More importantly, the amount of interspecific change in gene expression level between orthologs is positively correlated with the level of sequence divergence (Fig. 1b). In other words, pairs of genes with greater relative differences in their expression levels tend to encode more divergent proteins on average. This suggests that changes in both sequence and expression level are similarly constrained by purifying selection. In addition, changes in gene expression level between human–mouse orthologs are significantly lower than changes in gene expression levels between randomly chosen pairs of human–mouse genes (Fig. 1c). Changes in gene expression levels between randomly chosen pairs of human and mouse genes are used to approximate the rate of neutral expression level divergence (i.e., evolution with no selective constraint on gene expression level). This is because, under a purely neutral model of evolution, no significant detectable similarity due to shared common ancestry of expression profiles is expected remain during the ~100 million years of divergence between human and mouse (Gu et al., 2002), just as negligible similarity survives between neutrally evolving sequences over this time frame (Shabalina et al., 2001; Ogurtsov et al., 2004). Thus, taken together, the data shown in Fig. 1 indicate that changes in gene expression level between species do not accumulate randomly as predicted by the purely neutral model.
The relationship between the number of tissues a gene is expressed in (i.e., the breadth of expression) and its substitution rate was assessed in the same way as described for gene expression levels. Consistent with previously published results (Duret and Mouchiroud, 2000; Jordan et al., 2004), gene expression breadth showed a significant negative correlation with gene sequence divergence. More broadly expressed genes tend to be more conserved, on average, than genes with a narrow range of expression (data not shown; r= −0.90, P=0.002). As was also the case with gene expression level, the relative amount of interspecific change in expression breadth was significantly positively correlated with sequence divergence: orthologous gene pairs with more similar expression breadths tend to encode more conserved proteins (data not shown; r =0.80, P =0.017). These results indicate that changes in gene expression breadth, just like changes in the expression level, are also constrained by purifying selection.
It has been shown previously that changes in tissue-specific gene expression patterns between human and mouse are unrelated to levels of sequence divergence (Jordan et al., 2004; Yanai et al., 2004). This was taken as evidence for the purely neutral evolution of gene expression patterns (Yanai et al., 2004). Here, we reproduced this result with the new, expanded gene expression atlas (Su et al., 2004) by showing that the Euclidean distance between expression profiles of orthologous genes is not correlated with CDS divergence (Fig. 2a). However, there appears to be some non-linear trend in the relationship between gene expression profile divergence and sequence divergence (Fig. 2a). This is likely to be due to the use of Euclidean distances to compare expression profiles. Such distances may be inflated for gene pairs with high expression levels, and this is consistent with the high average Euclidean distance seen for the most conserved bin of gene sequences, which is known to contain numerous highly expressed genes (Fig. 1a). To control for this effect, Pearson correlation coefficients were also used to compare tissue-specific gene expression profiles. When this is done, there is no apparent trend (linear or otherwise) in the relationship between gene expression profile divergence and gene sequence divergence (data not shown; r = −0.31, P =0.45).
Despite the lack of correlation between gene expression profile and gene sequence divergence, and also in agreement with previous results (Jordan et al., 2004; Yanai et al., 2004), human–mouse orthologs have tissue-specific gene expression patterns that are significantly more similar (t =10.8, P < 1e −10, Student’s t-test) than those for randomly chosen human–mouse gene pairs (Fig. 2b). Therefore, as with gene expression levels, interspecific divergence of gene expression patterns is constrained by purifying selection, too, and does not evolve in a purely neutral manner. However, there is also a notable difference in that changes in overall expression level clearly correlate with sequence divergence, whereas changes in expression profiles do not. It seems likely that expression level values are largely determined by one or a few tissues in which a given gene is highly expressed, functionally relevant, and subject to purifying selection which acts to retain the expression level, particularly for highly expressed genes with conserved sequences. In contrast, expression profiles are significantly affected by tissues with low and, perhaps, spurious expression of the gene in question; conceivably, in such tissues, the expression of a gene, indeed, evolves neutrally.
The model of purely neutral evolution of gene expression seems to represent an extreme and unrealistic view that is readily falsified by the data. This may not be particularly surprising because, after all, gene expression is surely an important aspect of gene function. A potentially more viable version of the neutral view of gene expression divergence holds that the functionally important component of gene expression is held constant by purifying selection, while the functionally irrelevant component evolves neutrally. Indeed, such a model is suggested by the differences observed between patterns of gene expression level divergence and gene expression profile divergence. Under this scenario, the vast majority of gene expression differences should reflect neutral (as opposed to adaptive) changes. As discussed previously here and elsewhere (Khaitovich et al., 2004), this model of neutral evolution with selective constraint predicts that gene expression divergence should accumulate in a clock-like manner, i.e., at a constant rate.
A relative rates test comparing the extent of gene expression divergence between human, mouse, and rat was used to test the prediction of rate constancy in expression divergence. This test is conceptually identical to a relative rates test previously employed with human, mouse, and rat gene sequences (Jordan et al., 2001). The idea is that human–mouse–rat evolutionary divergence can be partitioned into two components along the phylogenetic tree: 1—a within (W) rodent component and 2—a between (B) human–rodent component (Fig. 3). If evolutionary changes, in expression and/or sequences, accumulate linearly with time (i.e., in the clock-like manner), then there should be a constant ratio (W/B) of change. The clock-like model allows for different levels of selective constraint between genes, but holds the rate of change among lineages constant. Indeed, human–mouse–rat orthologous gene sequences do show such a (nearly) constant relative rate of change (Fig. 4a) as reported previously (Jordan et al., 2001). When differences in gene expression level for the same sets of human–mouse–rat orthologs were analyzed in this way, there was also a statistically significant, albeit substantially weaker, correlation between the two phylogenetic components of variation (Fig. 4b). Furthermore, as noted above for human–mouse gene pairs (Fig. 1c), changes in gene expression level observed between human–mouse –rat orthologs were significantly lower than changes between random human–mouse–rat gene sets (t =5.9, P =3.3e −9, Student’s t-test). These observations suggest that changes in expression level may be consistent with the model of neutral evolution accompanied by selective constraint.
When the divergence in tissue-specific gene expression patterns for the same sets of human–mouse–rat orthologs was partitioned along the phylogenetic tree, a different picture emerged. There is no evidence of a constant relative rate of tissue-specific gene expression profile divergence (Fig. 4c). Therefore, in the case of gene expression profiles, divergence does not accumulate in a regular clock-like manner as predicted by the model of neutral evolution accompanied by selective constraint. However, the gene expression profile divergence between human–mouse–rat orthologous sets is significantly lower than the divergence between random human–mouse–rat gene sets (t =18.1 P < 1e −10 Student’s t-test) indicating that expression profile divergence is still constrained to some extent by purifying selection.
The results of the comparison of gene expression divergence and CDS divergence in human and mouse orthologs seem to falsify the recently proposed purely neutral model of the transcriptome evolution (Khaitovich et al., 2004; Yanai et al., 2004). Instead, we show that, although there may be a neutral component in the evolution of expression as evidenced by the lack of correlation between the divergence of expression profiles and gene sequence divergence (Jordan et al., 2004 and this work), much, if not most, of the change in expression is subject to purifying selection. The origin of the difference in these results and those of Khaitovich and coworkers and Yanai and coworkers remains to be investigated. One issue that could be pertinent is the use of pseudogenes as a proxy for completely neutral evolution of gene expression (Khaitovich et al., 2004). Conceivably, the expression of many pseudogenes could still be subject to purifying selection, perhaps, leading to low expression levels because high-level expression of a pseudogene is likely to be deleterious. This would result in low divergence between expression levels of orthologous pseudogenes and, if these employed as the neutral background, could lead to a false conclusion on purely neutral evolution of the expression of functional genes.
Whether or not there is a substantial correlation between gene expression divergence and sequence divergence had been an issue of much contest and contradiction. A lack of correlation between expression profile divergence and gene sequence divergence has been reported for yeast paralogs (Wagner, 2000) and, more recently, for worm paralogs (Conant and Wagner, 2004). However, another series of studies presents a significant correlation between the divergence of expression and sequence of duplicated genes for both yeast and humans (Gu et al., 2002; Makova and Li, 2003). Another recent study found that the divergence of putative promoter sequences among nematode genes was coupled to the divergence of the coding sequence for orthologs but not for paralogs, in line with the notion of rapid functional diversification of gene after duplication (Castillo-Davis et al., 2004). This lack of consensus on the relationship between evolution of coding sequences and evolution of expression might reflect the combination of neutral and selective forces affecting the latter as discussed in this paper; clearly, further analysis is required to resolve this issue.
The conclusion of this work that evolution of gene expression is, in large part, subject to purifying selection is hardly unexpected. The resulting view of gene expression evolution is not unlike the model of sequence evolution under the neutral theory which is compatible with the well-supported observation that highly expressed genes tend to evolve slowly (Pal et al., 2001; Jordan et al., 2004; Zhang and Li, 2004). However, another finding reported here does seem surprising, namely, that tissue-specific gene expression profiles, unlike gene expression levels, do not change at a constant rate (in a clock-like fashion) in the course of mammalian evolution. This observation does not seem to be compatible with any version of the neutral model for evolution of gene expression, even one that includes purifying selection. Instead, this result suggests that positive selection could be a substantial factor affecting changes in the patterns of gene expression during evolution. Clearly, much more analysis of the intragenomic and intergenomic patterns of expression divergence between homologous genes is required before a robust model of evolution of gene expression is developed.
Human, mouse, and rat CDS and protein sequences were taken from the NCBI’s RefSeq database (Pruitt and Maglott, 2001). The NCBI’s LocusLink database (Pruitt and Maglott, 2001) was used to ensure that only one sequence per loci was retained for further analysis. For loci that encode multiple transcripts, the longest form was taken. Orthologous gene sets were identified as symmetrical best hits in all-against-all between genome BLASTP searches as described previously (Jordan et al., 2001). The program ClustalW (Thompson et al., 1994) was used to align orthologous protein sequences. Nucleotide CDSs were aligned to correspond to protein sequence alignments. Nucleotide sequence distances were calculated using the Jukes–Cantor correction for multiple substitutions (Jukes and Cantor, 1969) and protein sequence distances were calculated using the correction based on the gamma distribution of site rate variation (Ota and Nei, 1994) with α = 2. Pairwise nucleotide distances (d) were converted to branch lengths on the human–mouse–rat phylogeny as described previously (Jordan et al., 2001).
Gene expression data, based on Affymetrix microarray experiments, for human, mouse, and rat are from the mammalian gene expression atlas (Su et al., 2004). These expression data were retrieved from the UCSC Genome Browser (Karolchik et al., 2003). Affymetrix probe identifiers (ids) were mapped to human, mouse, and rat genomic loci using UCSC Genome Browser and NCBI annotations as shown below:
Only affymetrix probes that map to unique genomic loci were considered for further analysis. When loci were found to be covered by multiple probes, the probe yielding the highest overall expression level was used in subsequent analyses.
To measure changes in gene expression levels and changes in tissue-specific expression patterns across species, tissue samples common to all species being compared were identified. There were 28 common tissues with expression data for both human and mouse and 10 common tissues with expression data for human, mouse, and rat. For the comparison between protein sequence divergence and expression level, expression levels were taken as the sum of all expression levels over 28 tissues shared between the human and mouse expression data sets. Using these same 28 tissues, changes in gene expression level between species were calculated by dividing the absolute value of the difference between expression levels by the sum of the expression levels being considered:
For the comparison between protein sequence divergence and expression breadth, expression breadth was taken as the number of tissues (out of 28 total possible) in which a gene showed an expression level ≥ 350 in the mammalian gene expression atlas. The relative differences in expression breadth between species were calculated in the same way as shown above (Eq. (2)) for expression level differences. For the comparisons between human, mouse, and rat expression levels, each genes species-specific relative expression level was determined by taking the sum of all expression levels over the 10 shared tissues and dividing by the average of those sums for the species from which the gene is derived:
The absolute distances between relative expression levels were taken and converted into branch lengths as described previously (Jordan et al., 2001).
In order to compare tissue-specific gene expression profiles, each gene (probe) is represented as a vector of tissue-specific expression levels. Orthologous gene expression patterns are then compared by calculating the Euclidean distances between the expression vectors (n=28 for human–mouse and n=10 for human–mouse–rat) of each species-specific gene in the orthologous set. Between species Euclidean distances were converted into branch lengths for the relative rates test as was done for the gene sequences (Jordan et al., 2001). Tissue-specific expression profiles between human and mouse were also compared using Pearson correlation coefficients.
As a control for gene expression level and gene expression profile divergence, gene expression differences between 10,000 pairs, consisting of one randomly chosen human gene and one randomly chosen mouse gene, were compared as described above for orthologous gene pairs.