By designing and synthesizing 81 individual genes encoding two different proteins, we have found that sequence differences entirely confined to non-coding changes within the open reading frame caused at least 40-fold differences in protein expression. We were able to create predictive sequence-expression models based on a strong correlation between expression and the codon bias of a subset of amino acids. The model correctly predicted the expression of variants not included in the model-building, and of new variants designed using improved codon bias tables.
Most of the codons that were identified as influential for expression encode amino acids that are highly represented in one or both proteins studied (). However, the most favorable biases for expression clearly do not correspond to those found in highly expressed native E coli
. This contradicts a widespread gene design principle that mimicking the codon bias of the host or of a selected group of host genes will ensure protein expression 
. The rationale for this approach has been that tRNA availability could limit translational elongation. However, translation is not limited directly by tRNA levels, but by the availability of amino-acylated (charged) tRNA 
In 2003, Elf et al 
predicted that charging of some tRNA isoacceptors would be much more sensitive than others to perturbations of the recharging rate. These are tRNAs used at high frequency relative to their level in the cell. Subsequently, these predictions were experimentally confirmed for a subset of tRNAs 
. Furthermore, heterologous overexpression is predicted to deplete intracellular amino acid and charged tRNA concentrations depending on the amino acid composition of the overexpressed protein 
. This may have a direct impact on translation rate and may also induce metabolic responses deleterious for expression yield 
PLS modeling suggests that most of the variation in our dataset can be explained by codons for serine (AGC favored and UCU disfavored), threonine (ACG favored), and leucine (UUG favored). These results fit well with the predicted sensitivities to amino acid starvation of the isoacceptor tRNAs that recognize these codons 
. The tRNA pools for all three favored codons (AGC, ACG and UUG) are the least sensitive to starvation for their respective amino acids (). The relative tRNA charging levels during starvation have been measured for threonine and leucine 
. From this data and from the tRNA abundance 
we can estimate the number of copies of each charged and uncharged tRNA per cell (). Considering either absolute numbers of charged tRNAs or the ratio of charged to uncharged tRNAs, UUG becomes a more attractive codon for encoding leucine relative to CUG as recharging is limited by starvation. Likewise ACG improves greatly relative to ACC for encoding threonine. Both trends are consistent with the codon preferences identified by our PLS model.
Charging of leucine and threonine codons under starvation conditions.
From this data it is tempting to speculate that much of the variation we see in expression is influenced by charged tRNA depletion and/or induction of a metabolic response from the host organism. High translation rates deplete the translational machinery 
. As amino acid charging of tRNA becomes limiting, only those tRNAs that can maintain charge can support high translation levels. The optimal codon bias for a gene probably depends both on maintaining high levels of charged tRNAs and minimizing the levels of uncharged tRNAs which may inhibit translation and/or cause a deleterious metabolic response 
In contrast with a recent study of GFP variants 
, we saw relatively little influence of mRNA structure near the initiation site. In three scFv genes weak expression, poorly predicted by the model, correlated with stronger than average mRNA structure in this region. Replacing the first 15 codons with a less-structured synonymous equivalent restored expression to levels predicted by the model, suggesting that mRNA structure may limit expression of these genes. In reconciling our results with those of Kudla et al, we note that the predicted 5′ mRNA structures of almost all of our genes are significantly weaker than those found to have a significant effect in the GFP study: only one of our gene variants had a free energy less than −9 kcal/mol in this region (Table S1
). Indeed, little correlation was observed in the GFP study between 5′ mRNA structure and expression for genes with structure strength >−9 kcal/mol despite greater than 20-fold variation in expression among these genes 
. Inhibition of initiation by especially strong mRNA structure would obscure effects resulting from factors that influence elongation, such as codon usage, which dominates our results.
Although we were unable to find any predictive correlations between expression and any parameter other than codon frequency, other sequence elements may contribute to some variation observed and could be important in optimal gene design. Differences in mRNA stability could also cause at least some expression variation observed. The translation rate itself can influence mRNA degradation rate making cause and effect in this case difficult to disentangle 
As direct synthesis replaces classic cloning as the preferred path for constructing functional genetic elements, it is critical to develop gene design algorithms for reliable heterologous expression. Here we have shown that sequences beyond the translational initiation region are critical and that codon usage is a key determinant of expression yield. Regardless of the mechanism by which codon bias affects expression, systematic analysis of the relationship between gene sequences and expression will be a powerful tool to refine our design algorithms, both for E. coli and other expression hosts.