Multiple factors contribute to the efficiency of gene translation, including the secondary structure of mRNA, the Shine-Dalgarno sequence, the drop-off rates of peptidyl-tRNAs, and the sequence of the mRNA proximal to the initiation site. In general, early studies favor nucleotide specific determinants for protein expression, namely codon bias. For example, early studies concluded that the AAA triplet leads to higher protein expression and that proteins with adenine-containing codons close to the initiation site express more efficiently [
8,
9]. Others have shown that efficient expression can be obtained when the second codon is AGA or when codons have G at the first position [
10,
11]. The second codon may also affect protein expression negatively, as has been found with the NGG triplets [
12]. Through mutational analysis, we systematically investigated the influence of individual amino acids at the +2 position to the expression of recombinant Igα. The results showed that amino acids at the +2 position affected significantly recombinant Igα expression and variations in the recombinant protein expression were observed to exceed 10-fold depending on the amino acid. Small amino acids, such as Ala, Ser, Cys, Pro, Thr, and Lys at the +2 position resulted in higher expression of Igα than bulky ones (Met, His, Trp, Glu, Leu, Arg). Met, His and Glu at this position produced the lowest amount of the protein. While codon-bias is often found to be important to gene expression, the variation in recombinant Igα expression at the +2 position appears to be determined by the type of amino acid rather than their nucleotide usage as no base-pair preference was observed among the high expression Igα mutants and the isocodon mutations of serine and alanine resulted in similar expression levels of Igα. Furthermore, when isocodon usage of Ser, Ala, and Lys were examined, Ser and Ala showed no significant codon preference between the high expressing and randomly chosen groups (). The AAA codon of Lysine appears preferred over the AAG codon among the highly expressed genes compared to the control group. Thus, while codon preference may influence gene expression, there are clearly amino acid related factors important for expression.
The statistical survey of amino acid composition in highly expressed
E. coli genes supports the preference of alanine and serine at the second position, correlating well with their observed high level of mutant protein expressions (). Similarly, the lowest expressing mutants of Igα, His and Met, are found much less frequently at the +2 position than their overall occurrence. Further, this observed disparity in amino acid preference is unique to the +2 position as the 3
rd, 4
th, 5
th and 40
th amino acid positions showed progressively less deviation in their frequency from their mean in both the 200 highly expressed and randomly picked genes. This correlation between the expression of Igα mutants and the frequency of the amino acid occurring at the +2 position suggest that amino acids at this position generally influence their gene expressions. While we did not observe low expressing mutations occurring with high frequency at the +2 position, namely all mutations of high frequency amino acids at +2 position expressed well, we did observe some well expressed mutations, such as Cys and Pro, that are not enriched at the +2 position among the well expressed genes. This suggests other adversary factors may disfavor their incorporation at the +2 position. For example, Cys and Pro can impose structural restrictions due to their properties of forming disulfides and disrupting regular peptide conformations. As to the contribution of AAA codon (lysine) to gene expression, our results showed that Lys at the +2 position yielded an above average level of Igα expression. Lys was the most common amino acid at the second position in the randomly chosen group of
E. coli genes, consistent with the earlier results by Looman et al. and Stenstrom et al. [
8,
9]. However, Lys was not enriched when comparing between the highly expressed and randomly selected
E. coli genes. Ser and Ala, instead, were the most frequent +2 position residues among highly expressed genes, in agreement with Tats et al. [
18].
To investigate if the variation in protein expression at +2 amino acid is determined by their mRNA structure and stability, we calculated the local RNA structures at the initiation site using RNAfold. Overall, there are significant structural variations among the mutant RNAs and their stability did not correlate with their expressions, suggesting factors other than mRNA structure may be important for gene expression. Charges do not appear to be a critical factor as very different expressions of Igα were observed between Asp and Glu mutations, and between Lys and His mutations.
Interestingly, the second amino acid was found to be crucial in determining a protein's half life stability in vivo [
19,
20]. In particular, the so called N-end rule stipulates that proteins with Phe, Leu, Trp, and Tyr at +2 position tend to have less in vivo stability, whereas those with Ser, Ala, and Thr at +2 position have longer half lives [
20]. Our current findings have a striking resemblance to the N-end rule effect. For example, Leu, Trp, and Tyr were found to be in the bottom half of the overall expression ranking, while Ser, Ala, and Thr were among the highest ranked mutations. Since both Igα and CXCL10 were expressed in a non-functional, aggregated inclusion body form in
E. coli that is relatively resistant to proteolytic degradation, their expression levels are unlikely related to their functional stability. On the other hand, a measured half-life of an intracellular protein may be affected by its expression level. It is tempting to speculate that the increased expression related to serine and alanine at the +2 position may contribute to their long half lives in vivo. Thus, it is conceivable that 1) the mechanism of N-end rule may be directly related to the +2 amino acid translational efficiency, and 2) our current conclusion regarding the contribution of +2 amino acid on recombinant protein expression also applies to soluble proteins.
In summary, our experimental data together with statisitical analysis demonstrate the advantage to have serine and alanine inserted as the second amino acid for recombinant protein production.