|Home | About | Journals | Submit | Contact Us | Français|
By comparing the patterns of evolution in the coding and upstream noncoding regions of yeast ribosomal protein (RP) genes duplicated in a genome duplication, we find that although nonsynonymous sites in the coding sequences show strong evidence for the fixation of recent gene conversion events, similar patterns are less evident among the synonymous positions and noncoding regulatory elements. This result suggests a potential explanation for the somewhat puzzling fact that duplicated RP genes are not functionally redundant despite their very high protein sequence identity. An analysis of the patterns of regulatory network evolution after genome duplication also indicates that the duplicated proteins have diverged considerably in expression despite their similar protein sequences.
With the completion of the genomic sequencing of numerous organisms, it has become evident that polyploidization (or whole-genome duplication [WGD]) events have occurred in diverse lineages, including flowering plants (Blanc et al. 2000; The Arabidopsis Genome Initiative 2000; Tuskan et al. 2006), amoeba (Aury et al. 2006), and vertebrates (Meyer and Van de Peer 2005). The first such event to be detected in a whole-genome sequence was in Saccharomyces cerevisiae (Wolfe and Shields 1997): striking confirmation of this event was found with the two-to-one mapping of chromosomal regions in S. cerevisiae to the genomes of other yeasts lacking the WGD (Dietrich et al. 2004; Dujon et al. 2004; Kellis et al. 2004).
Polyploidization events are often followed by substantial losses of duplicated genes (Sémon and Wolfe 2007). Which of the two duplicate copies is lost is generally thought to be selectively neutral: if two populations lose alternative copies such “reciprocal gene loss” can contribute to reproductive isolation and hence speciation (Werth and Windham 1991), events inferred to have occurred in yeast by Scannell et al. (2006). The nature of the duplicate genes that are retained is also of interest: functional classes of genes such as transcription factors, kinases, and ribosomal proteins (RPs) commonly remain duplicated after WGD but, surprisingly, are not generally duplicated in smaller events (Seoighe and Wolfe 1999; Blanc and Wolfe 2004; Maere et al. 2005; Aury et al. 2006; Freeling and Thomas 2006; Amoutzias et al. 2010). Wolfe (2000) has proposed the name ohnologs (in honor of Susumu Ohno) for these duplicate genes surviving from WGD. By mapping the relative genome orders of S. cerevisiae and seven related species, Byrne and Wolfe (2005) have provided an essentially complete list of S. cerevisiae ohnologs.
In S. cerevisiae, approximately 10% of surviving ohnologs are in fact genes encoding RPs (Planta and Mager 1998; Byrne and Wolfe 2005; Kim et al. 2009) despite the fact that these genes represented only 3.5% of the preduplication genome (Gordon et al. 2009). It has been suggested that the RP genes may be among those genes for which there is selection to maintain (high) relative gene dosage after WGD (i.e., the dosage balance hypothesis; Papp et al. 2003; Koszul et al. 2004; Freeling and Thomas 2006; Birchler and Veitia 2007; Edger and Pires 2009). Given this hypothesis, it is suggestive that many of the RP ohnologs are very similar in sequence; in fact, it is thought that these genes have undergone one or more gene conversion events post-WGD (Kellis et al. 2004). Gene conversion occurs when recombination causes the overwriting of variations in one gene with the corresponding bases from another allele or paralog. Initial within-genome comparisons in the S. cerevisiae genome suggested reasonably high rates of gene conversion among gene duplicates (Drouin 2002); later work showed that conversion events among those duplicated genes tended to be confined to duplicate pairs retaining high sequence identity (Benovoy et al. 2005). In yeast, conversion can occur through a number of mechanisms, both during mitosis and meiosis (Chen et al. 2007). Interestingly, one of those mechanisms is mitotic conversion mediated by an mRNA or cDNA intermediate (Derr and Strathern 1993; Storici et al. 2007), as originally suggested by Baltimore (1985); notably, such a mechanism could impart biases in the location of conversion events.
One effect of gene conversion can be to erase the historical sequence divergence between paralogs, and one can plausibly argue that any functional differences between the two genes would be erased simultaneously. Curiously however, there are examples in yeast of paralogous RPs with high sequence identity (>97%) that nonetheless differ in their functional roles (Ni and Snyder 2001; Enyenihi and Saunders 2003; Kaeberlein et al. 2005; Komili et al. 2007; Kim et al. 2009).
Here, we examined the patterns of surviving gene conversions in the yeast RP ohnologs, finding strong evidence for recent gene conversion at the nonsynonymous coding positions of these genes but little evidence of such conversion events in their upstream noncoding regions. An analysis of the RP ohnolog expression network also showed dissimilar expression patterns, consistent with regulatory divergence between the copies being responsible for the observed functional divergence.
A total of 55 previously described WGD-produced duplicate RPs (Planta and Mager 1998; Conant and Wolfe 2006) were analyzed. To this set, we added 84 pairs of enzyme genes duplicated at the WGD, identified by cross-referencing the list of metabolic genes of Kuepfer et al. (2005) to the set of S. cerevisiae ohnologs (Byrne and Wolfe 2005).
For these two lists (totaling 139 duplicate pairs), we next identified the corresponding orthologous genes in the genome of S. bayanus. Orthology inference in post-WGD species is challenging due to reciprocal gene loss, which can give rise to paired homologous genes that are paralogous rather than orthologous (fig. 1; Scannell et al. 2006, 2007). We have previously developed a maximum likelihood method that addresses this problem (Conant and Wolfe 2008). Briefly, the analysis begins with an inferred pre-WGD gene order (similar to that of Gordon et al. 2009). A model of duplicate gene loss after WGD allows us to estimate the relative speciation times of the taxa analyzed and the probability of all possible orthology assignments. Thus, in figure 1, we estimate with greater than 99.99% confidence that S. bayanus gene number 34.11 is the ortholog of S. cerevisiae gene RPL26B as opposed to the alternative possible assignment that makes gene 34.11 the ortholog of RPL26A. Importantly, these inferences rest only on the relative gene orders: gene sequences are not considered.
From our list of 55 RP gene duplicates and 77 enzyme duplicates (metabolic protein [MP] genes) for which orthology estimation was possible, we selected the 29 RP gene pairs and 76 MP pairs for which the probability of the orthology assignment between S. cerevisiae and S. bayanus was >0.98. These genes represent a set for which we have high confidence orthology information independent of the sequences themselves.
We next analyzed the sequence divergence in the coding regions of S. cerevisiae ohnolog pairs (fig. 2A). To do so, we aligned sequence triplets consisting of two ohnologs from S. cerevisiae (Scer1 and Scer2 below) and the S. bayanus gene orthologous to Scer1 (Sbay below) using T-Coffee (Notredame et al. 2000). From these alignments, we estimated the number of nonsynonymous substitutions per nonsynonymous site (Ka) for each of the three branches in figure 2A by maximum likelihood, using our previously described software (Conant and Wagner 2003). Similar calculations are possible using the program HyPhy (Kosakovsky Pond et al. 2005). The same calculations were made for the synonymous sites (Ks). Note that for most S. cerevisiae ohnolog pairs, there are actually two possible triplets because the corresponding S. bayanus genes are also ohnologs. In such cases, we performed both comparisons (meaning that the identity of Scer1 and Scer2 was switched in the second case).
To test the statistical support for an inference of gene conversion between genes Scer1 and Scer2, we employed a likelihood ratio test (Sokal and Rohlf 1995). First, we identified cases where KaB > Ka1, Ka2 (i.e., the signature of gene conversion; fig. 2C) and calculated the likelihood of the sequence alignment under this model (lnLH0). We then constrained the model such that Ka1 = KaB and calculated the likelihood under this alternative model (lnLHA). We compared 2·(lnLH0-lnLHA) to a chi-square distribution with 1 degree of freedom. An identical approach was used for the analysis of Ks.
To calculate pairwise noncoding region divergence, we first extracted the complete upstream intergenic region between each gene and the coding region of its 5′ neighbor. We then used the pairwise local alignment algorithm of Smith and Waterman (1981) to compute alignments between (Scer1 and Scer2), (Scer1 and Sbay), and (Scer2 and Sbay). Noncoding DNA tends to evolve rapidly (Lavoie et al. 2010), so to be sure that these alignments represent evolutionarily conserved regions and not simply statistical noise, we compared their local alignment scores against an expected distribution drawn from the genome at large. Briefly, using the complete collection of genes in the S. cerevisiae genome, we randomly selected 1,000 pairs and performed the same procedure of upstream region local alignment. We then compared the alignment scores obtained from these alignments with those seen for our orthologous and paralogous gene pairs. Scores in the upper 5% of this randomized distribution were inferred to show evolutionary conservation.
To determine if the RP gene expression data showed fewer crossing edges than would be expected by chance, we randomized the networks and recalculated the optimal partitioning. Randomization was performed by selecting every possible quartet of two pairs of duplicates. These four node subgraphs were replaced at random by another four node subgraphs with the same number of edges (Conant and Wolfe 2006). The probability of each such subgraph was calculated based on the inherent asymmetry in interaction degree between paralogs. Thus, we calculated the average fraction p of the total number of interactions for a paralog pair that belonged to the interaction-rich paralog. The probability of an interaction joining two interaction-rich genes is thus p2, whereas the probability of an interaction joining an interaction-rich and interaction-poor gene is then 2p (1 – p). Subgraph probabilities are calculated accordingly. The number of crossing edges in the original network was then compared with the distribution of number of crossing edges seen in 1,000 randomized networks.
Using our previously described maximum likelihood model of gene loss following WGD (Conant and Wolfe 2008), we inferred orthologous chromosomal regions between two species sharing the WGD (S. cerevisiae and Saccharomyces bayanus) using only gene order information (fig. 1). As an aside, we note that we selected S. bayanus as an outgroup because it represents the most closely related yeast species for which the duplicated chromosomal segments created by the WGD have been mapped to their orthologous segments in S. cerevisiae (Byrne and Wolfe 2005; Conant and Wolfe 2008; Gordon et al. 2009). Our gene order-based approach allowed us to conclude with high confidence that all the duplicate gene loci we considered evolved according to the species tree in figure 2B (see Materials and Methods). Despite this fact, it is not necessarily the case that the sequences themselves will follow this set of relationships. In particular, a gene conversion event between Scer1 and Scer2 that occurred after the speciation of S. cerevisiae and S. bayanus would overwrite the historical signal in the sequences of the two genes and give rise to a gene tree of the form of figure 2C.
Using estimates of Ka for triplets of RP genes (see Materials and Methods), we asked whether the pattern of nonsynonymous divergence in each triplet was most compatible with divergence after WGD (i.e., Ka2 > Ka1, KaB, fig. 2B) or with a recent gene conversion event (KaB > Ka1, Ka2, fig. 2C). Of the 29 pairs of duplicated RP genes in S. cerevisiae, two follow the pattern expected under WGD, two pairs present conflicting patterns of Ka values depending on the S. bayanus ortholog used, and the remaining 25 pairs have nonsynonymous divergences consistent with gene conversion. Intriguingly, when we use the same approach with the synonymous substitutions in the RP genes, we find many fewer cases where gene conversion needs to be invoked: only 10 of 29 duplicate pairs show any evidence of gene conversion at the synonymous sites. This difference is statistically significant (P = 0.0001, Fisher’s exact test).
We next assessed whether the signature of gene conversion in the sequence data was strong enough to statistically reject the possibility that phylogenetic relationships within the triplet were simply ambiguous. Thus, we compared a model allowing gene conversion (KaB > Ka1, Ka2) with an alterative model where KaB was constrained to be equal to Ka1. Of the 25 RP duplicate pairs with signatures of gene conversion in Ka, 17 showed statistically significant improvement when a model allowing gene conversion was used (P < 0.05, likelihood ratio test). For the synonymous sites, only 6 of the 29 pairs showed significant evidence for gene conversion (P < 0.05).
Using a GenomeHistory homology search (Conant and Wagner 2002), we identified two pairs of duplicated RP genes in the S. cerevisiae genome that do not appear to derive from WGD: RPL9A/B and RPS22A/B. Using synteny data from the Yeast Genome Order Browser (Byrne and Wolfe 2005) to assign orthology between S. cerevisiae and S. bayanus, we find that RPL9A and B show significant evidence of gene conversion in Ka (but not Ks) following the split with S. bayanus (P < 0.003), but RPL22A and B show no signature of recent conversion.
We applied the above approach to a similar set of WGD-duplicated MP genes. Among the 76 pairs considered, only three show any signs of gene conversion in Ka and only two of those have significant improvement when the gene conversion model is used (P < 10−6, likelihood ratio test). The corresponding number of pairs showing signs of gene conversion for Ks are 6 and 4, (P < 0.008). These differences in the proportion of observed gene conversion events in Ka and in Ks between the RP and MP groups are highly significant (P < 10−10 and P < 0.002, respectively; Fisher’s exact test).
For each of the RP gene pairs considered above, we measured the sequence identity in their upstream noncoding regions. Among the pairs considered, 15 RP ohnolog pairs had local alignment scores significantly larger than would be expected for unrelated regions. For these pairs, we compared the alignment score S1,2 of the ohnolog pair (Scer1, Scer2) to the scores from the comparison of each paralog to Scer1's ortholog in S. bayanus (i.e., S1,B for Scer1, Sbay and S2,B for Scer2, Sbay). Cases where S1,2 > S1,B, S2,B were interpreted as evidence of upstream gene conversion. We found that only 1/15 (6%) of the triplets of pairwise noncoding alignments showed evidence of gene conservation compared with the 12/15 (80%) for Ka in the coding regions (from the analysis above; table 1). This difference in the prevalence of conversion events between the two locations is highly significant (P = 0.0002, Fisher’s exact test; table 1). Interestingly, when this same approach is applied to 39 MP genes, we find very few instances of gene conversion in either region (<10%) and no significant difference in the proportion of conversion events between the noncoding and coding regions (table 1).
To further test for gene conversion in the RP gene upstream regions, we asked whether the RP ohnologs share more common upstream regulatory motifs than do the S. cerevisiae–S. bayanus orthologs. Using the motifs defined in Kellis et al. (2003), we counted the number of shared motifs between genes Scer1 and Sbay and between Scer1 and Scer2. Contrary to the prediction of gene conversion, the average number of shared motifs was higher for the orthologs than for the ohnologs (0.51 vs. 0.33), although this difference was not statistically significant (P = 0.28, likelihood ratio test). Similar results were seen for the MP genes (data not shown).
We have previously described an algorithm for detecting network partitioning among WGD-produced duplicate genes (Conant and Wolfe 2006). As is illustrated in figure 3, paralogs are divided into two columns with ohnologs opposite each other. Gene expression data for 51 pairs of RP ohnologs were obtained from the expression compendia of Hughes et al. (2000) and overlaid as graph edges. We defined an edge between any two genes if they shared a correlation (Pearson’s r) in gene expression of 0.8 or greater across the set of more than 300 experiments (a threshold of 0.75 produced similar results; data not shown). We divide these edges into internal edges, connecting nodes in the same column (arcs or vertical lines in fig. 3), and crossing edges, joining nodes in opposite columns (diagonal lines in fig. 3). Note that the initial assignment of a particular paralog to the first or second column is arbitrary, meaning that there are 2n –1 possible unique partitioning of the duplicates into columns. We thus calculated the network partitioning that resulted in the fewest number of crossing edges (see Materials and Methods). For these data, the optimal partitioning had 102 crossing edges, which, although it appears to be a large number, is significantly smaller than the minimum number of crossing edges seen in the randomized networks (P < 0.001). It is also relevant to note the extreme degree of asymmetry evident in this figure: the paralogous RPs, despite their sequence similarity, do not have identical expression patterns.
We have found that although duplicated RP genes created by WGD show strong evidence of surviving recent gene conversion in their coding regions, the same is not true of the upstream noncoding regions. Moreover, even in the coding regions, there is a bias toward detecting conversion events at nonsynonymous positions. One role of these results is to shed light on previous analyses. For instance, Gao and Innan (2004) argued that high rates of gene conversion had resulted in overestimates of the rate of yeast gene duplication. However, the data set used by these authors was heavily biased toward RP genes: of the 68 duplicate pairs considered, fully 50 of them fall into the set of the 55 WGD-produced RP duplicate genes. Our results imply that RPs are likely to be somewhat unique in both their patterns of duplication and of gene conversion and therefore probably should not be used as a proxy for the genome at large.
The conservation in the sequences of the encoded RPs induced by gene conversion is not unexpected as these proteins are highly conserved across a wide range of taxa (Alksne et al. 1993; Manuell et al. 2005; Ross et al. 2007). RP genes are also somewhat unusual in their response to genome duplication: they have survived in excess after WGDs in a variety of genomes including yeast (Seoighe and Wolfe 1999; Blanc and Wolfe 2004; Maere et al. 2005; Aury et al. 2006).
Given the existence of cDNA or mRNA-based gene conversion (Derr and Strathern 1993; Storici et al. 2007), events that would be naturally limited to transcribed regions, one might ask if the patterns observed here might simply represent a mutational bias in the conversion events themselves. This model is attractive because RP genes are highly expressed and recombination and gene conversion are associated with high levels of transcription (Aguilera and Gómez-González 2008). However, a purely mutational bias explanation for these repeated gene conversion events is unsatisfying for three reasons. First, no bias against upstream conversion is evident for meiotic recombination. Given that the per-cell-cycle rate of conversion is roughly four orders of magnitude greater for meiosis than for mitosis (Barbera and Petes 2006), even the roughly 1,000-fold excess of mitotic to meiotic cell divisions seen in wild relatives of S. cerevisiae (Tsai et al. 2008) is insufficient to give an overwhelming signature of coding region gene conversion. Secondly, a mutational bias does not explain the absence of gene conversion among the metabolic genes: of the nine MP gene pairs with expression levels (Holstege et al. 1998) that are as high as those of the RP gene pairs, only one shows evidence of gene conversion (data not shown). Finally, such a bias does not explain the fact that synonymous sites in the RP genes show much less evidence for conversion events than do nonsynonymous sites. Thus, we argue that in addition to any mutational biases, some intervening process of selection is helping to fix nonsynonymous conversion events.
Another partial explanation for the similarity in yeast RP coding sequences is selection for high dosages of these proteins. There is evidence for dosage benefits from RP gene duplication (Koszul et al. 2004). However, this explanation does not wholly explain the biology of these ohnologs, as we will discuss below. Moreover, such conservation does not appear to be a general response to dosage selection: metabolic genes also likely survived in duplicate partly due to dosage selection (Blank et al. 2005; Piškur et al. 2006; Conant and Wolfe 2007; Merico et al. 2007; van Hoek and Hogeweg 2009) and yet do not have strong signatures of conversion.
Instead, these results provide evidence that even if the coding regions of the duplicated RP genes are still being homogenized by gene conversion, their expression patterns have diverged considerably since the WGD. In fact, a number of recent analyses that have demonstrated that the duplicated yeast RP genes are not, in fact, functionally interchangeable. Thus, several RP genes, but not their paralogs, have been shown to be essential for determining bud location in S. cerevisiae (Ni and Snyder 2001) and for localizing proteins to that bud (Komili et al. 2007). Similar patterns have been seen in Brassica napus, where RP genes show tissue-specific expression despite high amino acid sequence identity (Whittle and Krochko 2009).
An equally intriguing case is the difference in protein localization between the RP paralogs Rpl7a and Rpl7b. Rpl7a is much more highly expressed than is Rpl7b (Ghaemmaghami et al. 2003) but while Rpl7a is only found in the cytoplasm, Rpl7b, despite its lower abundance, is found both in the cytoplasm and in the nucleolus (Kim et al. 2009). This difference does not appear to be caused by variations in the coding sequences of the two genes: replacing the RPL7B sequence with that from RPL7A does not alter the cellular localization of the protein encoded at that locus (Kim et al. 2009). These authors propose that the localization difference is instead driven by preferential incorporation of Rpl7a into ribosomal subunits, meaning that the free protein is rarely present at the site of ribosome subunit assembly in the nucleolus. However, the origins of this difference in incorporation rate remain unclear given the apparent equivalence of the two protein sequences seen after sequence replacement.
What combination of phenomena might give rise to expression divergence coupled to strong protein sequence homogenization? One possibility is expression subfunctionalization. This hypothesis is partly supported by our network analysis, which indicates partial expression isolation between two groups of RP duplicate genes. Because the ribosome represents a tightly integrated functional module, expression subfunctionalization might still require very high degrees of protein identity between the RP paralogs, such that the proteins encoded by these paralogs are able to substitute for each other under the different expression conditions. This hypothesis of strong purifying selection acting on RP coding sequences is supported by the observation that these genes show fewer single-nucleotide polymorphisms per base pair than do the MP genes (0.002 vs. 0.007; polymorphism data taken from Schacherer et al. 2009). Likewise, an increased frequency of gene-expression coupled gene conversion events, as discussed above, could very well improve the ability of natural selection to maintain such coding sequence conservation. On the expression front, the subfunctionalization itself might be either quantitative (only expression of both paralogs gives sufficient protein product) or qualitative (the expression of the two paralogs varies with respect to each other temporally). Similarly, the process might follow either the purely neutral DDC model originally proposed by several authors (Force et al. 1999; Stoltzfus 1999) or involve other selective forces, including adaptive ones (Des Marais and Rausher 2008): for review, see Innan and Kondrashov (2010). Neofunctionalization is another possible explanation for the expression divergence among the RP genes, but we are skeptical that it would occur on this scale (more than 50 genes).
Finally, we note that a broader perspective argues that gene dosage and subfunctionalization are not mutually exclusive explanations for the fate of a duplicate gene pair: He and Zhang (2005) propose that a duplication might be initially preserved by subfunctionalization and then might later undergo neofunctionalization. In the case of the RPs, we suspect that the initial preservation of the duplicated RP genes was for reasons of gene dosage; this process was likely followed by subfunctionalization (He and Zhang 2005; Innan and Kondrashov 2010).
The key prediction of our model is that gene conversion among the RPs helps to maintain a coadapted functional module: the ribosome. In future, it would be very interesting to test if this prediction of coadaptation holds.
We would like to thank K. Byrne for assistance with the Yeast Gene Order Browser (http://wolfe.gen.tcd.ie/ygob/) and P. Edger, R. Miller, D. Natvig, and M. Werner-Washburne for helpful discussions. This work was partly supported by the Reproductive Biology Group of the Food for the 21st Century program at the University of Missouri.