We examined the relationship between codon usage bias and protein structure in the genomes of four model organisms. We found that optimal codons tend to be associated with buried sites. In E. coli, yeast, and fly, the association was stronger in highly expressed genes than in genes with low expression level. The effect was present in most codon families in at least one organism. We found no clear relationship between the biophysical properties of the encoded amino acids and the presence or absence of an association between optimal codons and buried sites. Instead, the best predictor for a significant association was amino acid frequency. We also found that optimal codons tend to be associated with sites for which computational modeling predicts that substitutions are destabilizing.
This study is not the first to provide evidence that codon usage bias is affected by protein structure; previous studies include Thanaraj and Argos (1996a)
, Orešič and Shalloway (1998)
, Xie and Ding (1998)
, Orešič et al. (2003)
, and Gu et al. (2004)
. However, many of the previous studies suffer from statistical limitations such as not correcting for multiple testing or not controling for confounding effects of amino acid usage frequencies. Most previous studies suggest that the relationship between protein structure and codon usage bias is rather limited. For example, in a study of E. coli
and human, Orešič and Shalloway (1998)
found only a single codon each that associated with structural features of the encoded proteins. In human, this codon is GAU, and it is preferred at the N termini of α
-helices. Most previous studies focused on protein secondary structure, whereas the structural features we considered here were solvent accessibility and structural importance (i.e., ΔΔG
values). We believe that we found a comparatively strong and pervasive signal in part because of this choice. The extent to which mutations tend to disrupt protein folding and function depends only weakly on secondary structure and much more strongly on solvent accessibility (Goldman et al. 1998
, Bustamante et al. 2000
, Dean et al. 2002
, Bloom, Labthavikul, et al. 2006
What is the biophysical mechanism that links codon usage bias to protein structure? Experimental work has focused on the hypothesis that ribosome kinetics may affect protein folding (Komar et al. 1999
, Cortazzoetal2002, Goymer 2007
, Komar2007, Kimchi-Sarfatyetal2007). A cluster of suboptimal codons might stall the ribosome and as a consequence either facilitate or disrupt cotranslational folding. Thanaraj and Argos (1996b)
argued that rare codons associate with interdomain linkers because the slowdown of the ribosome when translating these linker regions would allow the individual domains to fold independently. Yet more recently, Widmann et al. (2008)
showed that evolutionarily conserved (and thus presumably important) rare codons are not restricted to interdomain linker regions. To verify that our results were not caused by an excess of rare codons in solvent-exposed linker regions, we redid our analysis on a restricted data set that excluded all interdomain linker regions. We again found an association between optimal codons and structurally sensitive sites. Thus, although nonoptimal codons may or may not be preferred in interdomain linker regions, our results are not caused by the codon usage in these linkers.
Most previous work has focused on the interplay between translation and folding kinetics, but translation fidelity may be as important or more so in shaping synonymous codon usage. According to the mistranslation-induced misfolding hypothesis (Drummond and Wilke 2008
), selection for translational accuracy should lead to an excess of high-fidelity codons at sites at which translation errors would be particularly destabilizing. Our results are broadly consistent with this hypothesis, even though our evidence is only indirect. Few studies have attempted the direct measurement of codon fidelity under translation or of the propensity of mutations at different sites to cause misfolding. Thus, we had no comprehensive data set for either of these quantities and had to use proxies for both.
We equated high-fidelity codons with optimal codons and identified as optimal those codons that were significantly more frequent in highly expressed genes than in genes with low expression level. Because the sets of optimal codons we determined in this way were largely consistent with sets of optimal codons determined by counts of tRNA genes (Ikemura 1985
, Moriyama and Powell 1997
, Man and Pilpel 2007
, Drummond and Wilke 2008
), we believe that our optimal codons are by and large the high-fidelity codons of the respective organism. However, we cannot be certain that we correctly identified high-fidelity codons in all cases. In fact, for the two amino acids for which optimal codons were associated with exposed sites in our analysis (Ala and Val in E. coli
), the codons we determined to be optimal may be of low fidelity. One issue that may arise in this context is that of speed-accuracy tradeoffs. The most rapidly translated codon may not be the most accurately translated one or vice versa because speed should be determined primarily by the absolute number of tRNA copies in a cell, whereas accuracy should depend on the relative abundance of the cognate tRNA compared with competing tRNAs. If an organism experiences both selection for translation speed and translational accuracy then it is possible that the most rapidly translated codon is the most abundant one in highly expressed genes but that the most accurately translated codon is preferred at sites at which translation errors need to be avoided.
Our analysis also assumes that all genes in an organism have the same optimal codons. This assumption may be violated if tRNA pools vary substantially over time or among tissues in multicellular organisms. Such differences have indeed been reported (Dong et al. 1996
, Dittmar et al. 2006
), but they do not appear to be large enough to invalidate our approach. Nevertheless, a future study could try to obtain more accurate, gene-specific codon optimality values.
As a measure of the extent to which translation errors at a site may lead to misfolding, we used two quantities, the solvent-accessible surface area and the site's structural importance as measured by the computationally predicted stability effects (ΔΔG
values) of mutations at that site. Both of these quantities can be calculated only if an accurate 3D crystal structure is available. Because the number of crystal structures for proteins of a specific organism remains limited, we augmented our data sets with crystal structures from homologous proteins. Consequently, for many genes in our data sets, solvent accessible surface areas and ΔΔG
values are only estimates. To assess how reliable these estimates are, we carried out a controlled analysis of how these quantities change with decreasing sequence identity among homologs. We found that both remain highly predictive even if the homologs have diverged substantially (fig. S2
). We also found that solvent-accessible surface area is generally more conserved among homologs than ΔΔG
We predicted ΔΔG values using the default energy function of the Rosetta ΔΔG module. This energy function was not necessarily optimized for the wide range of different protein structures to which we applied it. However, because we were using the ΔΔG predictions only to find sites at which substitutions have large disruptive effects, such as would be caused by steric clashes of the side chains, the ΔΔG predictions should be reasonably accurate nevertheless.
Previous works (Akashi 1994
, Stoletzki and Eyre-Walker 2007
, Drummond and Wilke 2008
) studied selection for translational accuracy using the categorical Mantel–Haenszel procedure and we followed the same strategy here. But because the quantities we studied are not inherently categorical, we also devised an association test on continuous variables based on stratified correlation coefficients. The two statistical procedures yielded overall similar results, but the test on continuous variables seemed to be less powerful. Yet, this test has several advantages that make it worthy to pursue. First, the results of the categorical test depend on the arbitrary choice by which continuous variables are classified into categories. For example, we designated sites as buried if their relative solvent accessibility fell below a 25% cutoff. A different choice of cutoff led to somewhat (if only slightly) different results (table S3
). The test on continuous variables does not suffer from this problem. Second, according to the criterion we chose to identify optimal codons, we could not select optimal codons for a few amino acids in some species. Thus, we had to exclude these amino acids from the categorical analysis but could include them in the continuous analysis.
The overall odds ratios we obtained from the Mantel–Haenszel procedure are of comparable magnitude, but generally somewhat smaller, than the odds ratios measuring the association between optimal codons and evolutionarily conserved sites ( in Drummond and Wilke 2008
). The biggest deviation arises in fly where we found an odds ratio of 1.04, whereas Drummond and Wilke (2008)
found 1.36. Some of these differences are likely caused by the limitations on quality and quantity of structural data. Fly in particular had the lowest fraction of closely matching crystal structures (fig. S1
). An alternative explanation for the smaller odds ratios in our study could be that a significant proportion of sites under translational accuracy selection are functionally important rather than structurally important and that the criterion of evolutionary conservation accurately identifies these sites. Future work could try to disentangle structural and functional constraints by assembling a data set of proteins for which structurally important sites are known and then testing whether optimal codons associate more strongly with structurally important sites, functionally important sites, or possibly the joint set of both structurally and functionally important sites.