|Home | About | Journals | Submit | Contact Us | Français|
There is now considerable evidence supporting the view that codon usage is frequently under selection for translational accuracy. There are, however, multiple forms of inaccuracy (missense, premature termination, and frameshifting errors) and pinpointing a particular error process behind apparently adaptive mRNA anatomy is rarely straightforward. Understanding differences in the fitness costs associated with different types of translational error can help us devise critical tests that can implicate one error process to the exclusion of others. To this end, we present a model that captures distinct features of frameshifting cost and apply this to 641 prokaryotic genomes. We demonstrate that, although it is commonly assumed that the ribosome encounters an off-frame stop codon soon after the frameshift and costs of mis-elongation are therefore limited, genomes with high GC content typically incur much larger per-error costs. We go on to derive the prediction, unique to frameshifting errors, that differences in translational robustness between the 5′ and 3′ ends of genes should be less pronounced in genomes with higher GC content. This prediction we show to be correct. Surprisingly, this does not mean that GC-rich organisms necessarily carry a greater fitness burden as a consequence of accidental frameshifting. Indeed, increased per-error costs are often more than counterbalanced by lower predicted error rates owing to more diverse anticodon repertoires in GC-rich genomes. We therefore propose that selection on tRNA repertoires may operate to reduce frameshifting errors.
A growing body of evidence supports the idea that codon usage patterns partially reflect selection to avoid errors during translation (reviewed in Drummond and Wilke 2009). But what types of error are being selected against and why? Misincorporation errors have arguably received a lion’s share of recent attention but inserting the wrong amino acid is by no means the only and perhaps not even the most common or costly mishap that can occur during translation. For instance, the ribosome can also abandon the nascent polypeptide before completion (drop-off, premature termination error) or leave the correct reading frame and elongate the peptide chain based on nucleotide triplets never meant to serve as a template for protein synthesis (frameshifting error) (Parker 1989).
A failure to accurately decode the underlying codon lies at the heart of all of these translational errors. Consequently, detecting biased usage of more efficiently decoded (“translationally optimal”) synonymous codons is not in itself sufficient to implicate a particular error process. To go beyond diagnosing translational selection and attribute adaptive features of gene anatomy to specific error processes, we need to develop critical tests that can implicate one type of error to the exclusion of others. In this context, it is interesting to note that different translational errors exhibit variation with regard to the fitness costs involved (detailed below). Understanding and exploiting divergent cost dynamics might therefore hold the key to devising critical tests and assessing the relative evolutionary importance of different error processes.
Mistranslation events can be costly for a variety or reasons. Some cost models are focused on the erroneous “product” and propose that errors are deleterious because they abrogate function or because the mistranslated product elicits dominant negative effects downstream of translation (Drummond and Wilke 2009). For example, mistranslated proteins might misfold and, consequently, disrupt a variety of cellular processes, by interacting promiscuously with other proteins and forming toxic aggregates (Drummond and Wilke 2008) or by occupying quality control capacity (chaperones, proteases, etc.), thereby interfering with normal protein homeostasis.
Other cost models are centered on the notion that the act of generating an erroneous product can be costly in itself (“process cost”). Fitness costs here may arise through nonproductive occupation of ribosomal capacity, which can be rate limiting for growth (Shachrai et al. 2010) or through sequestration of other translational resources (amino acids, tRNAs, etc.), which may prevent other proteins from being made in a timely fashion (Stoebel et al. 2008). In addition, it has been suggested that the energy wasted in futile synthesis and degradation may constitute a relevant evolutionary cost (Wagner 2005, 2007).
One key prediction of error models that focus on process costs is that such costs should strongly covary with the length of the erroneous product because residency time at the ribosome, the level of resource sequestration and the amount of energy wasted in protein synthesis and degradation should all increase with length. In line with this prediction, Stoebel et al. (2008) found, when they induced lac genes in a lactose-free environment (i.e., expressing a protein without any functional benefit to the cell), that longer genes were associated with greater costs.
The strong theoretical link between product length and process-related fitness cost can inform strategies to pinpoint particular error processes behind adaptive codon usage patterns because different translational errors have stereotypically different effects on the length of the erroneous product. Misincorporation errors do not alter the length of the polypeptide relative to the wild-type protein. In contrast, premature termination errors lead to truncation of variable severity depending on where along the mRNA the error occurs. This has led to the prediction that nonsense errors should become increasingly more costly toward the 3′ end of the mRNA and that, concomitantly, selection should be more powerful in promoting accurate decoding toward the 3′ end (Eyre-Walker 1996). Consistent with this prediction, optimal codon usage increases toward the 3′ end of coding sequences in Escherichia coli (Qin et al. 2004; Stoletzki and Eyre-Walker 2007). Importantly, this constitutes a critical test for translational selection against errors other than missense errors because—unless missense errors also promote drop-off—misincorporation errors do not predict a gradient in the leverage of selection increasing toward the 3′ end of the mRNA.
In this study, we ask whether frameshifting errors show process cost dynamics that discriminate them from other types of translational error and can thus help us gain a better understanding of the role of frameshifting avoidance in shaping gene anatomy.
Building on previous work (Huang et al. 2009), we present a simple quantitative model of frameshifting cost centered on genome-specific tRNA concentrations and relative binding affinities. Comparing process cost estimates across 641 prokaryotic genomes, we demonstrate that frameshifting errors exhibit process cost dynamics that are different from both missense and premature termination errors and can be exploited to establish support for the hypothesis that selection against frameshifting at least in part explains differential codon adaptation at the 5′ and 3′ termini of mRNAs. Furthermore, our study highlights that comparative genomic estimates of the costs of translational error can be highly misleading when mRNA sequences are considered in isolation or with disregard to species-specific biology. This is principally because there are strong interactions between process cost, GC content, tRNA repertoire, and error rates that generate considerable variability in average expected frameshifting costs across prokaryotic genomes.
We downloaded protein-coding sequences for 1,035 complete prokaryotic genomes from the National Center for Biotechnology Information (NCBI) (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) in February 2010. Applying custom scripts, we filtered the data to limit analysis to genes with a multiple of three nucleotides (n = 4 genes excluded based on this criterion), without ambiguous nucleotides or internal in-frame stop codons, and with a proper stop according to the relevant NCBI translation table, either table 11 (TGA, TAG, TAA) or table 4 (TAG, TAA) where TGA is decoded as tryptophan.
For 756 of these genomes, we could obtain information on the copy number and diversity of different tRNA isoacceptors from tRNADB-CE (Abe et al. 2009). For the purpose of this study, we make the reasonable assumption (Dong et al. 1996; Kanaya et al. 1999) that tRNA copy numbers are an adequate proxy for cellular tRNA concentrations in unicellular organisms. For reasons described in the supplementary methods (Supplementary Material online), we discarded an additional 115 genomes to yield a final set of 641 genomes (supplementary table 1, Supplementary Material online).
We suggest that the genomic process cost of accidental frameshifting (CG) is approximated by
where pi is the probability that a frameshift occurs at codon i (detailed below); npre and npost are the number of peptide bonds made before and after an error occurs at codon i (fig. 1), respectively; and ti is the number of times codon i is translated.
The model is nested so that we can obtain a per-gene estimate through summing across the entirety of codons (L) in a given mRNA, and a per-genome estimate through summing per-gene estimates across the entirety of mRNAs (G). Below, we focus on average cost per site or per gene because summing across all genes to determine genomic cost is likely to be misleading in the absence of information on translation levels. Note that, for the purpose of this analysis, we assume that every frameshifting error will yield a completely nonfunctional product. Although there may be an argument that functionality is more likely to be preserved when frameshifting occurs at the 3′ end of the mRNA, it is difficult to see how to systematically discount costs in a biologically relevant manner without detailed, gene-specific information on the impact of truncation and mis-elongation on functionality. In addition, there is evidence that translational selection operates even at the very 3′ end of mRNAs (Tuller et al. 2010), strongly suggesting that these regions are typically not functionally dispensable.
We define npost as the number of codons translated before the ribosome encounters the first off-frame stop codon or the coding sequence ends. Note that in the latter case npost represents a conservative estimate because the translation unit may not end at the 3′ end of the coding sequence. This is particularly true for bacteria, where mRNAs are often polycistronic (Sorek and Cossart 2010).
Error propensity can differ considerably across sites and also depends on the state of the translational machinery. For example, homomeric nucleotide runs appear much more liable to frameshifting (Farabaugh 1996) than other sequence contexts, and there is ample evidence that the composition of the cellular tRNA pool is a critical determinant of decoding accuracy and, consequently, the propensity for frameshifting. Increasing the concentration of a particular tRNA results in reduced frameshifting frequencies at the corresponding codons (Atkins et al. 1979; Curran and Yarus 1989; Sipley and Goldman 1993). Conversely, codons matched by rare tRNAs are particularly liable to frameshifting (Sipley and Goldman 1993; Farabaugh and Björk 1999) and amino acid starvation can substantially increase the likelihood of frameshifting at codons read by the affected tRNA (Gallant and Lindsley 1992, 1993; Kolor et al. 1993).
Farabaugh and Björk (1999) have suggested that tRNA-mRNA interactions at the ribosome can, in fact, provide a unifying model to understand accidental frameshifts, where frameshifting probability is principally a function of relative tRNA concentrations and binding affinities. Briefly, the authors proposed that frameshifting can occur when a near-cognate tRNA erroneously binds to the codon in the ribosomal A site—more likely when there is a relative shortage of cognate tRNAs—and, after translocation to the P site, the weak anticodon:codon interaction permits downstream (+1) or upstream (–1) slippage by one nucleotide if a sufficiently stable interaction can be formed in the new reading frame. Huang et al. (2009) recently presented a quantitative formulation of the Farabaugh and Björk model, where the probability pi of (+1) frameshifting at any one codon i is determined as
where and are the sets of near-cognate tRNAs able and unable to slip one nucleotide downstream, respectively; nt represents tRNA gene copy number; ntci the number of cognate tRNA genes of codon ci, and b a positive constant <1, denoted “weak binding coefficient” by Huang et al. (2009), which models the fact that binding of near-cognate tRNAs is less stable than binding of cognate tRNAs. For each genome, we derived Vi and Ri for all codon contexts based on a set of parsimonious anticodon:codon matching strategies proposed by Grosjean et al. (2010) (for details, see supplementary methods, Supplementary Material online).
The parameter pi captures an important aspect of decoding accuracy, namely that error rate is intrinsically dependent on the relative (rather than absolute) concentration of cognate, near-cognate, and noncognate codons, so that it is critical to consider the diversity and relative abundance of tRNAs to assess tRNA-dependent translation parameters (Fluitt et al. 2007).
Different types of translational error are associated with different stereotypical process costs. Although premature termination errors incur costs approximately proportional to the number of residues translated before the error occurred (npre, fig. 1), frameshifting errors incur an additional cost (npost) because the ribosome carries on translating until it encounters an off-frame stop codon or the mRNA ends.
It is widely assumed that npost is typically small, courtesy of a high chance of encountering an off-frame stop codon in the immediate downstream neighborhood (Parker 1989; Farabaugh 1996; Farabaugh and Björk 1999; Itzkovitz and Alon 2007). Itzkovitz and Alon (2007) reported that, for an “average” genome (uniform codon usage and amino acid frequencies averaged over 134 genomes from all three kingdoms), the ribosome encounters a fortuitous off-frame stop on average only 15 codons downstream of the frameshifting error.
Figure 2A demonstrates that this figure can be profoundly misleading when genomic GC content is high. Standard stop codons (TGA, TAA, TAG) are AT-rich and the probability of encountering AT-rich in-frame codons, required to specify the off-frame stop, decreases with increasing GC content. This is all the more pronounced for –1 frameshifts where a T at the 3rd codon position is required to yield an off-frame stop (fig. 1). In contrast to the first two codon positions, where A/T nucleotides may be required to specify amino acid identity, GC variability is much more extreme at 3rd sites (Muto and Osawa 1987) so that encountering a 3rd site T in a high-GC genome is comparatively rare.
npost, however, only represents part of the process cost of an individual frameshifting error because it ignores the number of amino acids translated before the error occurred (npre). To approach a more realistic estimate of average genome-specific frameshifting cost, we computed npre + npost for every codon in every gene. Results suggest that GC-dependent differences in average cost between genomes might not be as pronounced as suggested by npost considered in isolation (fig. 2B). This is principally because npost typically contributes less than 20% (40%) of the total process cost (npre + npost) of +1 (−1) frameshifts even at extremely high GC (supplementary fig. 1, Supplementary Material online). At the same time, average npre varies across GC content only in as far as proteins tend to be slightly longer on average in genomes with higher GC content (supplementary fig. 2, Supplementary Material online).
This reduction in between-genome variability notwithstanding, the average process cost still appears to be higher in genomes with high GC content. But do GC-rich genomes really shoulder a greater fitness burden in relation to frameshifting? Clearly, that depends on whether any one particular error actually occurs and, if so, how frequently. This is a function of the probability pi that the error occurs at the focal codon i, and the number of times that site is translated (ti). Although by-gene estimates of ti are not available for the vast majority of genomes, we can derive relative frameshifting probabilities for every possible codon context with reference to genome-specific tRNA competition at the ribosome (see Materials and Methods). Incorporating genome- and context-specific frameshifting probabilities into our model of process cost, we unexpectedly find the positive correlation between GC content and average +1 frameshifting cost reversed (rho = −0.18, P = 5.81 × 10−06, fig. 2C). This is despite protein length increasing slightly with GC content (linear regression estimate of average protein length in genomes with 20% GC3: 256 amino acids [90% GC3: 278], supplementary fig. 2, Supplementary Material online). The average cost of –1 frameshifts, however, remains highest for high-GC genomes (rho = 0.35, P = 2.21 × 10−20). Considering only one prokaryotic species per genus name to reduce phylogenetic nonindependence does not affect overall trends (data not shown).
Why does incorporation of genome-specific frameshifting probabilities transform the relationship between GC content and estimates of frameshifting cost?
Comparing pi for each minimal shifting context (NNN|N for +1 shifts, N|NNN for –1 shifts) across genomes, we find that the majority of contexts exhibits a lower propensity for frameshifting with increasing GC content (negative tau in fig. 3). The altered relationship between GC content and cost is therefore not simply a function of different codon or dicodon usage, that is, less shifting-prone motifs being used more frequently at high GC content; systematic GC-linked changes in tRNA profiles must be a contributing factor. Conspicuously, GC-rich genomes typically sport a more diverse repertoire of anticodons (fig. 4, Kanaya et al. 1999; Rocha 2004; Higgs and Ran 2008; Ran and Higgs 2010). In particular, tRNAs with C or G in the first anticodon position, which we would expect to bind most stably to G- and C-ending codons, respectively, are typically present in high-GC genomes where G/C-ending codons are common but frequently spared in medium- or low-GC genomes (fig. 5) where these codons are read via wobble pairing with U in the first anticodon position. This is in line with theoretical expectations about the diversity of tRNAs required for efficient translation (Higgs and Ran 2008). We suggest that, in addition, larger anticodon repertoires in high-GC genomes will be selectively favorable as they reduce the burden of frameshifting error in genomes vulnerable to incurring large per-error costs.
What these results highlight, above all, is that comparing translational cost estimates between genomes will be misleading when sequence features are considered in isolation because other critical parameters (pi) can and do differ between genomes. In this context, we realize that our empirical evaluation falls short of giving a comprehensive comparative costing because we cannot at present incorporate translation levels (ti). We are keenly aware that, especially in fast-growing organisms, a large proportion of realized cost might be incurred by a relatively small number of highly expressed genes so that taking average cost across all sites (or even genes) might not adequately reflect genomic fitness burden. Once comprehensive quantitative transcriptome data becomes available for an extremely high-GC genome, it will be interesting to incorporate this information into our model to derive genuinely comparative genome-wide cost estimates.
Above we hypothesized that, in addition to selection on translational efficiency (Higgs and Ran 2008), increased richness of the tRNA repertoire in GC-rich genomes might be at least in part an adaptation to the comparatively larger per-error cost of frameshifting in these genomes. Is there, however, any evidence consistent with frameshifting as an important force in molecular evolution? Selection against premature termination errors predicts a gradient in codon adaptation toward greater decoding accuracy at the 3′ end of mRNAs, predicated on npre as the principal process cost. But npre also represents an important component of the process cost of frameshifting errors. Does selection against frameshifting errors contribute to intragenic gradients in codon adaptation?
The unique process cost dynamic of frameshifting errors, namely the existence of a post-error cost (npost), allows us to test for frameshifting involvement as follows: Consider an mRNA with very high GC content. At the extreme, even slipping up right at the start of the message leads to exactly the same cost as slipping up at the 3′ end because the ribosome will never encounter an off-frame stop and therefore keep on translating until the mRNA terminates. By implication, GC-rich genomes should benefit relatively less from greater robustness (1− pi) against frameshifting errors toward the 3′ end of genes. We therefore predict that, if frameshifting avoidance is a relevant force determining heterogeneity in codon composition along the mRNA, the difference in frameshifting robustness between 5′ and 3′ ends will decline with increasing GC content. In contrast, selection against premature termination errors does not predict 5′-3′ differential robustness to ameliorate with rising GC content.
Replicating Huang et al.’s (2009) approach, we computed pairwise differentials in average frameshifting robustness across the terminal 5′ and 3′ 100 codons. Note that this analysis is internally controlled so that we do not expect differences in expression across genes and genomes to affect results. We observe a clear-cut tendency toward less pronounced 5′-3′ differences with increasing GC content (fig. 6), supporting a role for selection against frameshifting errors. Results are virtually identical when we exclude the first and last 30 codons, which are likely under selection for translational regulation (Tuller et al. 2010; 30-codon cutoff conservatively estimated from prokaryotic data in their fig. 2E and F).
May this trend simply be a consequence of codon choice becoming less flexible at more extreme GC content? This would predict that differences in terminal robustness should also decline toward the AT-biased end of the spectrum. This we do not observe: We split the data into genomes with >50% GC and <50% GC and confined analysis to genomes where the most AT-biased genome was as far away from the 50% threshold as the most GC-biased genome (range 11–89% GC). We found significant positive relationships between GC3 and differential robustness for genomes with >50% GC (+1: rho = 0.32, P = 1.91 × 10−08; −1: rho = 0.19, P = 0.0008, N = 302), yet no significant negative trends for genomes with <50% GC (+1: rho = −0.067, P = 0.24; −1: rho = 0.014, P = 0.81, N = 312). Moreover, an exponential fit outperforms a quadratic fit (Akaike's Information Criterion: −8,635 vs. –8,995) suggesting that a model that lacks an increase toward the AT-biased end of the spectrum provides a better description of the data.
The simple model of frameshifting process cost presented above illustrates a number of key issues relevant to assessing the role of frameshifting errors in shaping gene anatomy.
First, the notion that npost is typically short is misleading for genomes with high GC content.
Second, depending on the evolutionary question under consideration, arguments concerning the likely costliness of frameshifting have been focused on either npre or npost. But it is important to acknowledge that frameshifting incurs a compound cost (npre + npost), which distinguishes this particular translational error from, for example, premature termination or drop-off errors, which only incur npre. Such differences in cost dynamics can be exploited to attribute signatures of selection for translational accuracy to specific error classes. We explore these differences in the context of process costs because the link between the length of an erroneous polypeptide and its fitness cost should be linearly proportional. This does not imply, however, that product cost is unrelated to length. In fact, it seems likely that longer frameshifted tracts will on average also be less likely to be soluble and, consequently, have a greater potential to be disruptive, although—in contrast to process costs—the specific amino acid context will be critically important in this regard. Thus, high-GC genomes are likely faced with higher per-error product costs as well as process costs.
Third, comparative genomic analysis of frameshifting costs reveals that considering mRNA sequences in isolation and ignoring vital differences in translational machineries between genomes will produce a deceptive guide to fitness burden. In order to arrive at a genuine comparative estimate of the selective leverage of translational error, it will be imperative to incorporate differences in translation levels between genes and genomes, but our results already highlight the importance of differences in tRNA repertoire for relative susceptibility to translational error. It is intriguing that systematic changes in tRNA repertoire with GC content correlate with a reduction in the expected fitness burden related to frameshifting. But does this imply that differences in tRNA repertoires represent selected adaptations to reduce frameshifting costs or is anticodon diversity under selection for other reasons, for example translational efficiency (Higgs and Ran 2008), and reduction in error rates constitutes a fortuitous side effect? These two explanations are by no means mutually exclusive and might assume different relative importance depending on the lifestyle of the organism under consideration. For example, one would expect translational efficiency to be relatively more important in r-selected species where fast growth is critical for fitness. Fundamentally, the answer to this question will hinge on accurate quantitative determination of fitness costs of erroneous versus slow protein production.
Although our results clearly demonstrate that the link between process costs and GC content is readily transformed by differences in the translational apparatus, more concrete quantitative aspects of the current model should be interpreted with caution. For example, is the higher cost of –1 frameshifting in high-GC genomes real or rather an indication that the model does not incorporate an important determinant of frameshifting dynamics? Although it is conceivable that GC genomes find it intrinsically hard to reduce the cost of frameshifting errors and therefore genuinely shoulder a greater fitness burden in relation to frameshifting, it remains a distinct possibility that this cost is not actually incurred because high-GC genomes exhibit certain (adaptive) features in cis or trans which our model fails to capture. Notably, we adhere to prokaryotic consensus rules for anticodon:codon interactions proposed by Grosjean et al. (2010) to model binding stabilities and therefore propensities for frameshifting (see supplementary methods, Supplementary Material online), principally because this allows us to compare cost estimates across genomes. These rules are inevitably generalizations because decoding capacities cannot be perfectly predicted from sequence information alone. Importantly, anticodon residues themselves as well as tRNA nucleotides outside the anticodon loop can be posttranscriptionally modified in a variety of ways, with marked effects on decoding capacity (Cochella and Green 2005; Daviter et al. 2006; Grosjean et al. 2010) and/or translational fidelity (reviewed in Saks and Conery 2007), explicitly including reading frame maintenance (Qian and Björk 1997; Björk et al. 1999; Herr et al. 1999; Urbonavicius et al. 2003). Decoding accuracy is further affected by variation in other components of the translation machinery. This includes nucleotide substitutions or modifications in ribosomal RNA, which can cause more or less accurate decoding (Rodnina and Wintermeyer 2001; Baxter-Roshek et al. 2007). In addition, differences in cellular environment, notably Mg2+ ion concentrations (Gromadski and Rodnina 2004), can affect translation kinetics with implications for accuracy, proofreading behavior, and anticodon:codon affinities. Finally, we characterize accidental frameshifting as a local error, solely dependent on interactions at the focal codon and its immediate upstream or downstream neighbor. However, it is apparent from the analysis of programmed frameshifts that downstream secondary structure (hairpins, pseudoknots, etc.) in particular can dramatically affect the rates of shifting, probably at least in part by affecting ribosomal progression and thus residency at a given site (Farabaugh 1996).
Despite these various simplifications and uncertainties, however, the results presented here reinforce the notion that translational errors have been an important force in shaping mRNA anatomy and further suggest that selection might have shaped tRNA repertoires to reduce frameshifting errors.
The authors would like to thank Ed Feil and Eduardo Rocha for useful discussions and two anonymous reviewers for their constructive comments on the manuscript. This work was supported by a Medical Research Council Capacity Building Studentship to T.W. and National Library of Medicine/National Institutes of Health intramural research program to Y.H. and T.M.P.