|Home | About | Journals | Submit | Contact Us | Français|
The mistranslation-induced protein misfolding hypothesis predicts that selection should prefer high-fidelity codons at sites at which translation errors are structurally disruptive and lead to protein misfolding and aggregation. To test this hypothesis, we analyzed the relationship between codon usage bias and protein structure in the genomes of four model organisms, Escherichia coli, yeast, fly, and mouse. Using both the Mantel–Haenszel procedure, which applies to categorical data, and a newly developed association test for continuous variables, we find that translationally optimal codons associate with buried residues and also with residues at sites where mutations lead to large changes in free energy (ΔΔG). In each species, only a subset of all amino acids show this signal, but most amino acids show the signal in at least one species. By repeating the analysis on a reduced data set that excludes interdomain linkers, we show that our results are not caused by an association of rare codons with solvent-accessible linker regions. Finally, we find that our results depend weakly on expression level; the association between optimal codons and buried sites exists at all expression levels, but increases in strength as expression level increases.
Synonymous mutations are usually referred to as “silent,” but increasing evidence shows that they experience significant selection pressures in a wide range of organisms. For example, selection on synonymous sites has been linked to transcription, splicing, DNA secondary structure, and messenger RNA secondary structure and stability (Xia 1996, Vinogradov 2003, Chamary and Hurst 2005a, 2005b; Hoede et al. 2006, Parmley et al. 2006, Warnecke and Hurst 2007, Stoletzki 2008). The strongest selection pressure on synonymous sites, at least in microbes, seems to be selection for translational efficiency. This pressure causes highly expressed genes to be encoded predominantly by codons corresponding to highly abundant transfer RNAs (tRNAs). It has been observed in bacteria, plants, yeast, fly, worm, and even mammals (Ikemura 1981, Ikemura1985, Sharpetal1986, AkashiEyre-Walker1998, Duret2002, UrrutiaHurst2003, Comeron2004, Wrightetal2004, LavnerKotlar2005).
Selection for translational efficiency may reflect selection for rapid translation (speed selection), selection for translation with high fidelity (accuracy selection), or both. Akashi (1994) argued that selection for translational accuracy should lead to inhomogeneous codon usage within genes. More important sites (i.e., sites that are less robust to translation errors) should be more frequently encoded by codons with high fidelity than other sites. Akashi (1994) found such a signal in Drosophila. Later, others found similar results in Escherichia coli, yeast, worm, and mammals (Stoletzki and Eyre-Walker 2007, Drummond and Wilke 2008).
Akashi (1994) suggested to classify sites as important or not depending on whether they are conserved at the amino acid level in an orthologous sequence. But although there is good evidence that evolutionary conservation reflects functional (Lichtarge et al. 1996) and structural (Koshi and Goldstein 1995, Mirny and Shakhnovich 1999, Mirny and Shakhnovich 2001a, 2001b; Schueler-Furman and Baker 2003) constraints, multiple other factors influence evolutionary conservation as well. Among them are the divergence time between orthologous sequences, the background rate of amino acid substitutions (which can vary over several orders of magnitude among genes within the same organism), mutational biases, and random chance. Moreover, the relative frequency of mutations at sites with different biochemical properties (e.g., solvent accessibility) changes with sequence divergence (Sasidharan and Chothia 2007). Ultimately, instead of linking codon usage bias to conserved or variable sites, we would like to link codon usage bias to sites with specific biochemical properties. In line with this reasoning, Akashi (1994) carried out a second analysis in which he showed that preferred codon usage was increased in putative zink finger and homeodomain regions of transcription factors.
Motivated by Drummond and Wilke's (2008) hypothesis that translational accuracy selection minimizes the misfolding of mistranslated proteins, we test here whether translationally optimal codons are associated with structurally sensitive sites, that is, sites at which translation errors are particularly likely to cause misfolding. We focus on residues’ solvent accessibility because substitutions in the solvent-shielded core of proteins tend to be particularly disruptive (Matthews 1993, Tokuriki et al. 2007). As a second measure of structural sensitivity, we use computationally predicted changes in free energy upon mutation (ΔΔG values).
We consider four model organisms, E. coli, Saccharomyces cerevisiae, Drosophila melanogaster, and Mus musculus. We define translationally optimal codons as those that are overrepresented in highly expressed genes and address the following questions: 1) Are optimal codons more likely to encode residues in the core of proteins or on the surface? 2) Is such an association a general characteristic for all amino acids or does it depend on the type of amino acid encoded? 3) Are the results affected by gene expression level? 4) Are the results affected by an excess of rare codons in interdomain linkers? 5) Are optimal codons more likely to occur at sites for which computational modeling predicts that amino acid substitutions are particularly disruptive?
To match gene sequences to protein structures, we used the GTOP (Genomes TO Protein structures and functions) database (Kawabata et al. 2002). We saved a match in the database for further calculation if its region of similarity was longer than 80% of the protein length and its sequence identity was larger than 40% of the sequence in the Protein Data Bank (PDB). This process yielded 822, 403, 947, and 1464 matches in E. coli, S. cerevisiae, D. melanogaster, and M. musculus, respectively.
For each protein with a match, we obtained the corresponding 3D crystal structure from the PDB. For every match, we retained only the specific matching peptide chain. After aligning the gene sequence and the sequence from the crystal structure with MUSCLE (Edgar 2004), we calculated the percent solvent-accessible surface area for each aligned residue with the DSSP program (Kabsch and Sander 1983). We normalized these results by the reference surface areas of an extended Gly-X-Gly peptide (Creighton 1992). We considered residues with less than 25% relative solvent accessibility as buried. Because we discarded all but the matching peptide chain, residues involved in protein–protein interfaces were considered as exposed for the purpose of our study.
We used the Rosetta ΔΔG module (Kortemme and Baker 2002, Kortemmeetal2004) to estimate the change in the free energy gap, ΔΔG, for all 19 possible single-point amino acid substitutions at each site. Although most mutations are destabilizing (ΔΔG > 0 kcal/mol) (Tokuriki et al. 2007), proteins are often quite mutationally robust (Markiewicz et al. 1994, Guo et al. 2004, Bloom et al. 2005, Bloom, Drummond, et al. 2006) and only approximately 20% of all mutations are significantly destabilizing (ΔΔG > 2.0 kcal/mol) (Tokuriki et al. 2007). We considered mutations with ΔΔG>3.0 kcal/mol as strongly destabilizing mutations and calculated the “structural importance” of a site as its fraction of strongly destabilizing mutations.
Because the protein sequences in our data set are not always 100% identical to the sequences of the structure homologs in PDB (see fig. S1), we tested the effect of sequence dissimilarity on our structural data (solvent accessibility and structural importance). We collected 58 groups of structure homologs from the PDB-REPRDB database (Noguchi et al. 2001, NoguchiAkiyama2003), which is a database of representative protein chains selected from the PDB. Each group of structure homologs contained one representative protein and three homologs with sequence identity to the representative protein between 80% and 100%, between 60% and 80%, and between 40% and 60%, respectively. We used the following criteria to collect groups of structure homologs: We only included structures for which we found at least one homolog in each sequence-identity interval. If there was more than one homolog falling into one sequence-identity bin, we retained only the homolog with the lowest sequence identity to the representative protein. This choice made our test more conservative. For each group, we calculated the correlation between the solvent accessibility of the representative protein and its homologs and the correlation between the structural importance of the representative protein and its homologs, respectively. We found that both solvent accessibility and structural importance tended to correlate strongly among homologous proteins, but the correlation coefficient decreased with decreasing sequence identity (see fig. S2). For solvent accessibility, the median Spearman correlation coefficient ranged from 0.964 (80%–100% identity) to 0.871 (40%–60% identity), and for structural importance, it ranged from 0.797 (80%–100% identity) to 0.626 (40%–60% identity).
We obtained protein domain boundary information from the CATH domain structure database (http://www. cathdb.info/) (Orengo et al. 1997, Greene et al. 2007). We defined domain linkers to be regions that are centered at CATH domain boundaries and extend at least 10 residues in both directions. Loops (continuous peptides composed of residues with DSSP classes S, T, and “-”), which neighbor on or overlap this region were also considered as parts of domain linkers. Finally, we considered both termini of the protein as domain boundaries. These criteria are strict and yield a conservative analysis of nonlinker regions.
We obtained genomic sequences from the following sources: the Comprehensive Microbial Resource (http://cmr.tigr.org/cmr.tigr.org/) for E. coli, the Saccharomyces Genome Database (ftp://genome-ftp.stanford.edu/) for S. cerevisiae, the Eisen Lab (http://rana.lbl.gov/drosophila/) for D. melanogaster, and Ensembl (http://www.ensembl.org/) for M. musculus.
We designated as evolutionarily conserved all sites at which the amino acid was unchanged compared with the orthologous gene in a closely related species. We used the following orthologs: For E. coli, we obtained orthologs between E. coli and Salmonella typhimurium from the Comprehensive Microbial Resource. For yeast, we obtained orthologs between S. cerevisiae and Saccharomyces bayanus from the Saccharomyces Genome Database. For fly, we obtained orthologs between D. melanogaster and Drosophila yakuba from the Drosophila 12-genome project AAAWiki at http://rana.lbl.gov/drosophila/. For mouse, we obtained orthologs between M. musculus and Rattus norvegicus from Biomart through the Ensembl Homology track. We aligned each pair of orthologs at the peptide level using MUSCLE (Edgar 2004).
We used previously published expression data for each species: For E. coli, we obtained gene expression levels measured in messenger RNAs per cell from Covert et al. (2004); for S. cerevisiae, we used expression data from Holstege et al. (1998); for D. melanogaster, we used as expression level the geometric mean of expression data from different tissues obtained by Stolc et al. (2004); and for M. musculus, we measured expression level as the breadth of expression among different tissues (Su et al. 2004). After combining the expression data with the structural data, we ended up with 698, 384, 123, and 569 genes for E. coli, S. cerevisiae, D. melanogaster, and M. musculus, respectively.
To identify which codons are translationally optimal in each species, we compared the codon usage pattern between the gene groups showing the lowest 5% and highest 5% expression level in each species. We defined codons as “optimal” if they showed a statistically significant increase in frequency in the highly expressed group, as determined by a chi-square test. We defined codon optimality (Copt) as the odds ratio of codon usage between highly and lowly expressed groups, calculated separately for each codon:
Here, nhigh and nlow are the observed numbers of the codon in the highly and lowly expressed groups, respectively, and Nhigh and Nlow are the observed numbers of the corresponding amino acid in the highly and lowly expressed groups, respectively.
We used two different methods to test for association, one using categorical variables and one using continuous variables. For pairs of categorical variables (e.g., optimal vs. nonoptimal codons and buried vs. exposed sites), we stratified the data by gene and synonymous codon family within each gene and constructed a separate 2×2 contingency table for each stratum. We then combined either the tables for all genes and a given codon family or the tables for all genes and all codon families into an overall analysis using the Mantel–Haenszel procedure (Mantel and Haenszel 1959, Mantel1963). The null hypothesis in this analysis assumes that the status of the site (e.g., buried or exposed) is independent of the codon type in any given stratum. Because the Mantel–Haenszel procedure yields undefined results on contingency tables whose sum of all four entries is less than 2 (i.e., 0 or 1), we excluded all such tables from the analyses.
For pairs of continuous variables (e.g., codon optimality and solvent accessibility), we also stratified the data by gene and synonymous codon family within each gene. Then, for each stratum, we separately calculated the Pearson correlation coefficient between the two variables. As test statistic, we used the mean of the correlation coefficients over all strata. We calculated the sampling distribution by randomly reshuffling, separately for each gene, synonymous codons among sites with identical amino acid, and recalculating all correlation coefficients. We generated 1,000 resampled sequences for each gene.
We carried out all statistical analyses using the software R (R Development Core Team 2008). In the analyses of individual amino acids, we corrected for multiple testing using the false discovery rate method of Benjamini and Hochberg (1995), as implemented in the R function p.adjust().
We first assessed whether there was any relationship between a codon's tendency to be preferentially used in highly expressed genes and the same codon's tendency to be preferentially used at buried sites. We calculated 2 odds ratios each for 59 codons (excluding ATG for Met, TGG for Trp, and three stop codons). The first odds ratio, which we refer to as codon optimality (Copt, see Materials and Methods), measures whether the codon is preferred in highly expressed genes compared with all other codons encoding the same amino acid. The second odds ratio, which we denote by Oburied, measures whether the codon is preferred at buried sites compared with all other codons encoding the same amino acid. To control for confounding effects of differing amino acid usage among genes, we calculated Oburied by first constructing 2×2 contingency tables of codon usage within each gene (see table 1 for an example) and then using the Mantel–Haenszel procedure (Mantel and Haenszel 1959, Mantel 1963) to combine the odds ratios for each individual contingency table into an overall odds ratio Oburied. We list the values of Copt and Oburied for each codon in table S1. In all species except fly, we found a significant positive correlation between Copt and Oburied (Spearman's ρ = 0.38,P = 0.003 for E. coli; ρ = 0.54,P<0.001 for S. cerevisiae; ρ = 0.24,P = 0.069 for D. melanogaster; and ρ = 0.78,P0.001 for M. musculus; see also fig. 1).
The correlation between Copt and Oburied reveals that there is an association between codon usage and protein structure. To determine whether this correlation is consistent across all amino acids or if different amino acids have different trends, we carried out a similar statistical test on each amino acid separately. We inferred a set of optimal codons for each species (see Materials and Methods and table S2). For each gene, we then constructed separate 2×2 contingency tables for the 18 amino acids encoded by at least two codons (see table 2 for an example). For each of these 18 amino acids, we calculated a joint odds ratio (Ojoint) of optimal codon usage between buried and exposed sites using the Mantel–Haenszel procedure. A value of Ojoint greater than 1 signifies a preference for optimal codons at buried sites (and nonoptimal codons at exposed sites).
We found that 13 of 18 amino acids showed, in at least one species, a significant preference for optimal codons at buried residues (table 3). Unexpectedly, in E. coli, 2 of these 13 amino acids (Ala and Val) showed a significant preference for optimal codons at exposed residues. The remaining codons (Cys, Glu, His, Pro, and Tyr) showed no significant preference for optimal codons to be buried or exposed in any of the four species tested. Of a total of 72 association tests, 23 showed a significant preference for buried optimal codons, whereas only 2 showed a significant preference for exposed optimal codons.
Why did some amino acids show a signal and others did not? We found no clear pattern related to amino acid biochemistry, such as polarity or volume. Most amino acids showed a signal in at least one species. Instead, we found that amino acid frequency was the best predictor of a significant association between codon optimality and solvent exposure. We observed a negative correlation between the significance level (P value after correction for multiple testing) from the Mantel–Haenszel test and the relative amino acid frequency (Spearman's ρ = − 0.33,P = 0.007, data of all four species pooled; see also fig. 2). This finding suggests that the absence of a significant association for some amino acids is likely caused by lack of statistical power rather than by a specific biochemical mechanism.
For each species, we also used the Mantel–Haenszel procedure to combine all 2 × 2 contingency tables for all genes and all amino acids into a single overall odds ratio. This analysis corresponds to Drummond and Wilke (2008)‘s analysis but using buried sites instead of evolutionary conserved sites. We found a statistically significant association between optimal codons and buried sites in all species (table 3). These results were not strongly dependent on solvent-accessibility cutoff used to identify buried residues (table S3).
To control for evolutionary conservation, we tested for an association between optimal codons and buried sites considering only evolutionarily conserved residues in each species. The results remained largely unchanged from the results including all residues (table S4).
Because many ribosomal proteins contain natively unstructured regions, which assume their structure only upon binding other parts of the ribosome, we also repeated the association test on a data set excluding all ribosomal proteins. Our full data set contained 40 ribosomal proteins in E. coli, 29 in yeast, 22 in fly, and 14 in mouse. The results remained largely unchanged after excluding these genes (table S5).
To determine if the association between optimal codons and buried sites was affected by expression level, we calculated the overall odds ratio separately for the highest (top 25%) and lowest (bottom 25%) expressed genes (see fig. 3). In all species except mouse, the overall odds ratio for the highest expressed genes tended to be higher than the one for the lowest expressed genes. However, with the exception of fly, which had the smallest sample size, the overall odds ratio was significantly larger than 1.0 even for genes with low expression level. Thus, highly expressed genes seem to show a stronger association between optimal codons and buried sites than genes with low expression level, but even the latter genes do show a significant association. This result mirrors the general observation that evolutionary constraints appear to increase with gene expression level (Duret and Mouchiroud 2000, Pal et al. 2001, Lemos et al. 2005, Drummond et al. 2006, Wolf et al. 2006, Eames and Kortemme 2007, Drummond and Wilke 2008).
The Mantel–Haenszel procedure properly controls for potentially confounding effects of differing codon or amino acid usage frequencies among genes. Its main drawback is that it requires categorical data, such as a classification of all residues into buried or exposed, or of all codons into optimal or nonoptimal. Solvent accessibility and codon optimality are continuous quantities, and by forcing them into dichotomous categories, we may be losing statistical power.
We devised a new statistical test in the spirit of the Mantel–Haenszel procedure but that made use of the specific values of codon optimality and solvent accessibility for each residue. For each amino acid in each gene, we separately calculated the Pearson correlation coefficient between the codon optimality of all codons encoding this amino acid and the solvent accessibility of the amino acid in the 3D protein structure. As test statistic, we used the mean of all these correlation coefficients. We calculated the sampling distribution of this statistic by randomly permuting synonymous codons within each gene (see Materials and Methods). We then carried out one-tailed tests. Our alternative hypothesis was that the mean correlation coefficient should be more negative than expected by chance if optimal codons associate with low solvent accessibility.
When we combined the correlation coefficients for all amino acids, we found that, for all organisms, we could reject the null hypothesis of no significant association between codon optimality and solvent accessibility. We found P<0.001 for E. coli, P < 0.001 for yeast, P = 0.005 for fly, and P<0.001 for mouse (fig. 4). The analysis for individual amino acids largely mirrored our Mantel–Haenszel results (fig. S3). However, the statistical power of the association test on continuous variables seemed to be generally lower than that of the Mantel–Haenszel procedure. In general, if we found a significant result with the association test on continuous variables, we also found it in the Mantel–Haenszel procedure, but the reverse was not true in all cases.
The advantage of the association test on continuous variables is that we could employ it also for amino acids for which we could not clearly identify optimal codons. These amino acids include Lys for E. coli and Cys, Ala, and Tyr for mouse. Of these four, Cys in E. coli showed a marginally significant result and Ala in mouse showed a strongly significant result (fig. S3).
So far, we tested for an association between optimal codons and buried sites. Our reasoning was that mutations at buried sites are more disruptive than mutations at exposed sites and that therefore buried sites should be more sensitive to translation errors. An alternative, and more direct, way of assessing whether a site is sensitive to mutation is to determine the distribution of free energy changes (ΔΔG values) that substitutions at this site effect. Mutations with negative ΔΔG stabilize the protein fold and should typically not have deleterious effects. Mutations with a small positive ΔΔG are mildly destabilizing. Whether such mutations will be disruptive or not is hard to predict. But once ΔΔG exceeds 2–3 kcal/mol, the mutation is almost certainly disruptive. Thus, as a measure of a site's sensitivity to substitutions, we calculated ΔΔG values for all 19 possible amino acid substitutions at that site and then determined the fraction of these substitutions with ΔΔG>3.0 kcal/mol. We refer to this fraction as the “structural importance” of the site because it assesses the likelihood that a mutation at this site is structurally disruptive.
Our hypothesis was that if selection for translational accuracy acts to minimize mistranslation-induced protein misfolding then sites with higher structural importance should associate with more optimal codons and vice versa. By classifying sites at which at least two mutations had ΔΔG>3.0 kcal/mol as important sites and all other sites as unimportant sites, we could employ the Mantel–Haenszel procedure to determine whether optimal codons associated with structurally important sites. Our results were similar to our previous results considering buried and exposed sites. In all organisms, the overall joint odds ratio was significantly larger than 1.0 (table S6). Our results for individual amino acids were also largely consistent with those obtained for buried/exposed sites, but there were some differences. For example, Glu in E. coli and Cys in yeast showed a significant association between optimal codons and structurally important sites but not between optimal codons and buried sites. By contrast, in fly and mouse, we mostly found the opposite result that several amino acids showed a significant association between optimal codons and buried sites but not between optimal codons and structurally important sites.
We also carried out the test of association between continuous variables. Because we expected codon optimality to increase with structural importance, we calculated one-tailed P values for the right tail of the sampling distribution of the mean correlation coefficient. We found a significant association between optimal codons and structurally important sites in yeast (P < 0.001) and mouse (P<0.001) but not in E. coli (P = 0.230) or fly (P = 0.230) (fig. S4). When considering the results for individual amino acids (fig. S5), we found again that they largely agreed with those of the Mantel–Haenszel procedure on the same data (table S6) but that the association test on continuous variables seemed to be overall less powerful. However, there were a few notable exceptions. Gln in mouse showed a highly significant association when considering continuous variables but no association under the Mantel–Haenszel procedure. For Ser in fly, optimal codons seemed to associate with structurally unimportant sites when considering continuous variables (a right-tailed P value of 1 corresponds to a left-tailed P value of <0.001), but we found no such signal under the Mantel–Haenszel procedure.
Thanaraj and Argos (1996b) reported that nonoptimal codons are overrepresented in linker regions between domains. Because linker regions are generally solvent exposed and less sensitive to amino acid substitutions, an excess of nonoptimal codons in these regions could lead to an apparent excess of optimal codons at buried or structurally important sites. To exclude the possibility that our results are caused by abundant nonoptimal codons in interdomain linkers, we repeated all our analyses but excluded linker regions (see Materials and Methods). The new results were largely unchanged from the previous ones. Tables 3 and S6 and figures 4 and S4 show a side-by-side comparisons of the same results for entire proteins and proteins with interdomain linkers excluded. Figures S6 and S7 show the same analyses as figures S3 and S5, respectively, but with interdomain linkers excluded. There were some cases in which results disappeared, for example, in table 3 for Leu and Ala in E. coli and for Phe in fly. Yet by and large, we could confirm that even when we excluded interdomain linker regions, we found a significant association between optimal codons and either buried sites or structurally important sites.
We examined the relationship between codon usage bias and protein structure in the genomes of four model organisms. We found that optimal codons tend to be associated with buried sites. In E. coli, yeast, and fly, the association was stronger in highly expressed genes than in genes with low expression level. The effect was present in most codon families in at least one organism. We found no clear relationship between the biophysical properties of the encoded amino acids and the presence or absence of an association between optimal codons and buried sites. Instead, the best predictor for a significant association was amino acid frequency. We also found that optimal codons tend to be associated with sites for which computational modeling predicts that substitutions are destabilizing.
This study is not the first to provide evidence that codon usage bias is affected by protein structure; previous studies include Thanaraj and Argos (1996a), Orešič and Shalloway (1998), Xie and Ding (1998), Orešič et al. (2003), and Gu et al. (2004). However, many of the previous studies suffer from statistical limitations such as not correcting for multiple testing or not controling for confounding effects of amino acid usage frequencies. Most previous studies suggest that the relationship between protein structure and codon usage bias is rather limited. For example, in a study of E. coli and human, Orešič and Shalloway (1998) found only a single codon each that associated with structural features of the encoded proteins. In human, this codon is GAU, and it is preferred at the N termini of α-helices. Most previous studies focused on protein secondary structure, whereas the structural features we considered here were solvent accessibility and structural importance (i.e., ΔΔG values). We believe that we found a comparatively strong and pervasive signal in part because of this choice. The extent to which mutations tend to disrupt protein folding and function depends only weakly on secondary structure and much more strongly on solvent accessibility (Goldman et al. 1998, Bustamante et al. 2000, Dean et al. 2002, Bloom, Labthavikul, et al. 2006).
What is the biophysical mechanism that links codon usage bias to protein structure? Experimental work has focused on the hypothesis that ribosome kinetics may affect protein folding (Komar et al. 1999, Cortazzoetal2002, Goymer 2007, Komar2007, Kimchi-Sarfatyetal2007). A cluster of suboptimal codons might stall the ribosome and as a consequence either facilitate or disrupt cotranslational folding. Thanaraj and Argos (1996b) argued that rare codons associate with interdomain linkers because the slowdown of the ribosome when translating these linker regions would allow the individual domains to fold independently. Yet more recently, Widmann et al. (2008) showed that evolutionarily conserved (and thus presumably important) rare codons are not restricted to interdomain linker regions. To verify that our results were not caused by an excess of rare codons in solvent-exposed linker regions, we redid our analysis on a restricted data set that excluded all interdomain linker regions. We again found an association between optimal codons and structurally sensitive sites. Thus, although nonoptimal codons may or may not be preferred in interdomain linker regions, our results are not caused by the codon usage in these linkers.
Most previous work has focused on the interplay between translation and folding kinetics, but translation fidelity may be as important or more so in shaping synonymous codon usage. According to the mistranslation-induced misfolding hypothesis (Drummond and Wilke 2008), selection for translational accuracy should lead to an excess of high-fidelity codons at sites at which translation errors would be particularly destabilizing. Our results are broadly consistent with this hypothesis, even though our evidence is only indirect. Few studies have attempted the direct measurement of codon fidelity under translation or of the propensity of mutations at different sites to cause misfolding. Thus, we had no comprehensive data set for either of these quantities and had to use proxies for both.
We equated high-fidelity codons with optimal codons and identified as optimal those codons that were significantly more frequent in highly expressed genes than in genes with low expression level. Because the sets of optimal codons we determined in this way were largely consistent with sets of optimal codons determined by counts of tRNA genes (Ikemura 1985, Moriyama and Powell 1997, Man and Pilpel 2007, Drummond and Wilke 2008), we believe that our optimal codons are by and large the high-fidelity codons of the respective organism. However, we cannot be certain that we correctly identified high-fidelity codons in all cases. In fact, for the two amino acids for which optimal codons were associated with exposed sites in our analysis (Ala and Val in E. coli), the codons we determined to be optimal may be of low fidelity. One issue that may arise in this context is that of speed-accuracy tradeoffs. The most rapidly translated codon may not be the most accurately translated one or vice versa because speed should be determined primarily by the absolute number of tRNA copies in a cell, whereas accuracy should depend on the relative abundance of the cognate tRNA compared with competing tRNAs. If an organism experiences both selection for translation speed and translational accuracy then it is possible that the most rapidly translated codon is the most abundant one in highly expressed genes but that the most accurately translated codon is preferred at sites at which translation errors need to be avoided.
Our analysis also assumes that all genes in an organism have the same optimal codons. This assumption may be violated if tRNA pools vary substantially over time or among tissues in multicellular organisms. Such differences have indeed been reported (Dong et al. 1996, Dittmar et al. 2006), but they do not appear to be large enough to invalidate our approach. Nevertheless, a future study could try to obtain more accurate, gene-specific codon optimality values.
As a measure of the extent to which translation errors at a site may lead to misfolding, we used two quantities, the solvent-accessible surface area and the site's structural importance as measured by the computationally predicted stability effects (ΔΔG values) of mutations at that site. Both of these quantities can be calculated only if an accurate 3D crystal structure is available. Because the number of crystal structures for proteins of a specific organism remains limited, we augmented our data sets with crystal structures from homologous proteins. Consequently, for many genes in our data sets, solvent accessible surface areas and ΔΔG values are only estimates. To assess how reliable these estimates are, we carried out a controlled analysis of how these quantities change with decreasing sequence identity among homologs. We found that both remain highly predictive even if the homologs have diverged substantially (fig. S2). We also found that solvent-accessible surface area is generally more conserved among homologs than ΔΔG values are.
We predicted ΔΔG values using the default energy function of the Rosetta ΔΔG module. This energy function was not necessarily optimized for the wide range of different protein structures to which we applied it. However, because we were using the ΔΔG predictions only to find sites at which substitutions have large disruptive effects, such as would be caused by steric clashes of the side chains, the ΔΔG predictions should be reasonably accurate nevertheless.
Previous works (Akashi 1994, Stoletzki and Eyre-Walker 2007, Drummond and Wilke 2008) studied selection for translational accuracy using the categorical Mantel–Haenszel procedure and we followed the same strategy here. But because the quantities we studied are not inherently categorical, we also devised an association test on continuous variables based on stratified correlation coefficients. The two statistical procedures yielded overall similar results, but the test on continuous variables seemed to be less powerful. Yet, this test has several advantages that make it worthy to pursue. First, the results of the categorical test depend on the arbitrary choice by which continuous variables are classified into categories. For example, we designated sites as buried if their relative solvent accessibility fell below a 25% cutoff. A different choice of cutoff led to somewhat (if only slightly) different results (table S3). The test on continuous variables does not suffer from this problem. Second, according to the criterion we chose to identify optimal codons, we could not select optimal codons for a few amino acids in some species. Thus, we had to exclude these amino acids from the categorical analysis but could include them in the continuous analysis.
The overall odds ratios we obtained from the Mantel–Haenszel procedure are of comparable magnitude, but generally somewhat smaller, than the odds ratios measuring the association between optimal codons and evolutionarily conserved sites (table 1 in Drummond and Wilke 2008). The biggest deviation arises in fly where we found an odds ratio of 1.04, whereas Drummond and Wilke (2008) found 1.36. Some of these differences are likely caused by the limitations on quality and quantity of structural data. Fly in particular had the lowest fraction of closely matching crystal structures (fig. S1). An alternative explanation for the smaller odds ratios in our study could be that a significant proportion of sites under translational accuracy selection are functionally important rather than structurally important and that the criterion of evolutionary conservation accurately identifies these sites. Future work could try to disentangle structural and functional constraints by assembling a data set of proteins for which structurally important sites are known and then testing whether optimal codons associate more strongly with structurally important sites, functionally important sites, or possibly the joint set of both structurally and functionally important sites.
We would like to thank D. Allan Drummond for stimulating discussions. This work was supported in part by NIH grant R01 AI065960. The University of Texas provided computing resources.