|Home | About | Journals | Submit | Contact Us | Français|
Deducing generic causal relations between RNA transcript features and protein expression profiles from endogenous gene expression data remains a major unsolved problem in biology. The analysis of gene expression from heterologous genes contributes significantly to solving this problem, but has been heavily biased toward the study of the effect of 5′ transcript regions and to prokaryotes. Here, we employ a synthetic biology driven approach that systematically differentiates the effect of different regions of the transcript on gene expression up to 240 nucleotides into the ORF. This enabled us to discover new causal effects between features in previously unexplored regions of transcripts, and gene expression in natural regimes. We rationally designed, constructed, and analyzed 383 gene variants of the viral HRSVgp04 gene ORF, with multiple synonymous mutations at key positions along the transcript in the eukaryote S. cerevisiae. Our results show that a few silent mutations at the 5′UTR can have a dramatic effect of up to 15 fold change on protein levels, and that even synonymous mutations in positions more than 120 nucleotides downstream from the ORF 5′end can modulate protein levels up to 160%–300%. We demonstrate that the correlation between protein levels and folding energy increases with the significance of the level of selection of the latter in endogenous genes, reinforcing the notion that selection for folding strength in different parts of the ORF is related to translation regulation. Our measured protein abundance correlates notably(correlation up to r = 0.62 (p=0.0013)) with mean relative codon decoding times, based on ribosomal densities (Ribo-Seq) in endogenous genes, supporting the conjecture that translation elongation and adaptation to the tRNA pool can modify protein levels in a causal/direct manner. This report provides an improved understanding of transcript evolution, design principles of gene expression regulation, and suggests simple rules for engineering synthetic gene expression in eukaryotes.
Arguably, the major challenge of functional genomics is to decipher how information encoded in an RNA transcript (e.g. the 5′UTR, ORF and 3′UTR) affects various aspects of its expression. Most previous studies aimed at understanding such signals are based on the analysis of endogenous gene expression.1-10
However, endogenous transcripts are subject to different forms of evolutionary selection, not necessary acting on their expression level. Thus, for example, if a highly expressed gene has a certain feature it can be difficult to discern if it is related to its function or rather to the corresponding protein abundance. In addition, endogenous transcripts vary widely in their features (for example, they have different length, promoters, UTRs, amino acid content, etc.); therefore, the effect of any particular feature on expression levels is significantly masked. For example, if a gene x with certain codons has higher protein levels than gene y with different codons we cannot be sure if these differences in protein levels are not due to the fact that they have different promoters or different UTRs, etc. Eventually, it is extremely difficult to determine causality based on the correlation between endogenous features and their expression levels measurements.
Previous large scale studies with heterologous genes shed considerable light on the subject, but were all performed on E. coli and either focused on the very start of the ORF or included synthetic libraries that did not resemble the sequence properties of endogenous transcripts.11-17 As a result, many questions regarding the causal relations between transcript features and protein levels in endogenous genes and specifically in eukaryotes remain poorly understood.
For example, it was suggested that in E. coli the folding strength of the mRNA near the START codon affects the translation initiation rate (and thus the protein levels), as it is related to the efficiency with which the pre-initiation complex recognizes the start codon11,13; however, most of the open questions in the field remain unanswered due to the fact that almost all previous studies focused on the analysis of endogenous genes and on indirect measures of translation; and due to the fact that the answers to these questions may be condition- and organism-dependent.16 The ribosome elongation speed and its association with expression regulation is also not fully understood: some studies suggested that it is constant,18,19 while others suggested that different codons have different decoding times, for example due to different tRNA levels.11,17,20-22 Other important questions relate to the effect of codon and nucleotide composition in different parts of the transcript on protein levels: while some studies proposed that the codon distribution in all parts of the ORF, and specifically in those related to translation elongation, can affect the protein abundance,1,17,23 others claimed that only the nucleotide composition near the beginning of the ORF impacts protein levels.11 In addition the exact cause and effect of codon usage bias on organismal fitness has not been identified yet: does it evolve due to neutral (or near neutral) evolution, or does it significantly affect fitness? Do highly expressed genes have stronger codon bias to improve their protein levels, or rather due to other reasons (e.g., ribosome allocation)?24-29
The aim of this study is to tackle these problems in a quantitative manner using a combined computational-synthetic biology approach in S. cerevisiae. Specifically, we aimed at deciphering how protein abundance is encoded in several regions of the transcript, some of them previously unexplored, by addressing the following questions: 1) What is the magnitude of the effect of synonymous changes in different parts of the transcript on protein abundance? 2) How does folding in different parts of the transcript affect protein abundance? 3) How do ribosome codon decoding rates affect protein abundance? 4) Do transcript features that are selected for in S. cerevisiae endogenous genes determine protein abundance?
We analyzed the gene HRSVgp04 and generated 3 libraries to understand the distinct effect of the nucleotide composition in different regions of the transcript on protein levels. The structure of all 3 libraries was identical: all had the same promoter followed by the 5′UTR (14nt) of the TEF gene, and the HRSVgp04 gene fused with a YFP reporter (Fig. 1A). The aim of the first library was to study the effect of the nucleotide composition of the 5′UTR (i.e. silent mutations in the 5′UTR) on protein levels; to this end, we randomized the 14 nt composing the 5′UTR, maintaining the nucleotide composition of the rest of the transcript (Fig. 1A, B). The aim of the second library was to study how the protein levels are affected by synonymous nucleotide substitutions in the first 40 codons of the ORF. To this end, we modified only the third nt of codons 2–41 of the HRSVgp04 ORF, maintaining the encoded protein and the nucleotide composition outside this region (Fig. 1A, C). Finally, the third library aimed at studying the effect of synonymous nucleotide substitutions in codons 42–81 of the ORF. To this end, we modified only the third nt of each of the codons 42–81 of the HRSVgp04 coding sequence, again maintaining the encoded protein and the nucleotide composition outside this region (Fig. 1A, C). As explained in the Methods section, we designed the library variants such that their features (adaptation to the tRNA pool and mRNA folding) resemble endogenous genes. Thus, the relations reported here are expected to represent the effect of mutations on S. cerevisiae endogenous genes (see, for example, 16). The three libraries include a total of 207/151/25 variants respectively, and were named L5UTR, L2-41C, L42-81C (more details in the Methods section).
We measured the YFP and Optical Density (OD) of each variant over time and calculated their estimated protein levels (See Fig. 2A-F and Methods section). Interestingly, though the number of randomized nucleotides is relatively very low (only a few dozen), and all the changes are strictly synonymous or silent, the changes in protein levels are between dozens to hundreds of percentages in all 3 cases. Specifically, the ratio (maximal estimated protein level)/( minimal estimated protein level) in the L5UTR, L2-41C and L42-81C libraries was 15.3, 3, and 1.6, respectively (after correcting for the different number of points in the 3 libraries it was 9.1, 2, and 1.6; see details in Fig. 2 and its caption), while the (STD/mean) (i.e., the Coefficient of Variance) was 0.42, 0.12 and 0.12, respectively (see Fig. 2G, H). These results suggest that while synonymous or silent mutations in all parts of the ORF/UTR can significantly affect protein levels, mutations at the 5′UTR end tend to have a higher effect compared to adjacent synonymous mutations at the beginning of the ORF. In contrast, synonymous mutations in different parts of the ORF have a relatively comparable effect with an a bit higher effect at the region closer to the 5′end of the ORF. Our findings also show (Fig. 2D-F) that the protein level values related to the variant that achieves the highest protein level in each of the 3 libraries are relatively similar (the differences between the maximal protein levels values in the 3 libraries are less than ~8.6%).
However, the lower protein level values among the libraries have more significant differences: the lowest protein level value in L2-41C was found to be 5.54 times higher than the lowest protein level in L5UTR; the lowest protein level value in L2-41C was 10.5 times higher than the lowest protein level in L5UTR. This result supports the conjecture that silent mutations (in terms of the encoded protein) near the beginning of the ORF may have strong negative effect on protein levels, probably due to their effect on translation initiation efficiency.11,23,30-32 Here we provide a novel quantification of this effect in eukaryotes.
To study the effect of various characteristics of the 5′UTR on protein levels, we generated 96 different features (see Table S1, and the Methods section) related to this region. Among others, the features include: the folding energy in different parts of the UTR (with respect to the beginning of the ORF); distance from the Kozak sequence (‘ACCATGG’), suggested to be the optimal sequence for START codon recognition by the pre-initiation complex33; START codon context score,34 which is a score aiming at estimating the optimality of the affinity of the nucleotide content surrounding the START codon to the pre-initiation complex; and similarity to binding sites of different RNA binding proteins.35 Additional features are the frequency of each nucleotide in different positions of the 5′UTR; specifically, according to the Kozak rule30,31,34 and analysis of endogenous genes the nucleotide A at distance 3nt before the beginning of the ORF is the most favorable and nucleotide T is the least favorable nucleotide. It was suggested that this nucleotide interacts with the pre-initiation complex and the existence of ‘A’ in this position improves the recognition of the START codon by the pre-initiation complex and thus the initiation efficiency.
All the correlations throughout this study were based on Spearman's rank correlation coefficient. We found that the features with the top correlation with estimated protein abundance (PA) were (see Table S1 which includes the correlation and also correction for FDR): the START codon context score (r = −0.46; p = 2.3·10−12), the frequency of the nucleotide A at distance 3nt before the beginning of the ORF (r = −0.45; p = 6.6·10−12), average folding energy (FE) over all the UTR windows (r = 0.36, p = 10−7), the frequency of the nucleotide T at distance 3nt before the beginning of the ORF (r = −0.34; p = 3.3·10−7), and the FE of the window starting 7 nt before the beginning of the ORF (r = 0.34, p = 3.3·10−7). We also checked other features such as the affinity to 22 different RNA binding proteins (RBP); however, all the correlations were found to be either not significant, or (in one case) borderline significant with a very low correlation (r = 0.14, p = 0.044 which didn't not pass FDR filtering). Since the single nucleotide features mentioned above are related to the Kozak and context scores,30,31,33,34 we conclude that the PA variability of the L5UTR can be significantly explained by features related to the translation initiation efficiency, the affinity to the pre-initiation complex and/or the folding of the mRNA in the region surrounding the start codon.
Ribosome profiling (Ribo-Seq) is a new approach that enables measuring ribosomal densities over the entire transcriptome at a single nucleotide resolution36,37 (see Fig. 3A). The method includes the following steps: cells are treated with cyclohexamide to arrest translation, ribosomes are fixed and ribosome-protected RNA fragments are recovered36,37 (see Fig. 3A). After processing and reverse-transcription, these are sequenced, mapped, and used to derive ribosomal density profiles. In order to estimate the typical nominal relative codon decoding times in endogenous S. cerevisiae genes under natural conditions, we implemented a novel statistical approach that filters biases and considers ribosomal traffic jams20 (Fig. 3A and Methods). Subsequently, we aimed at estimating the effect of codon usage bias on protein levels via the effect of codons on ribosomal elongation rates in natural conditions. To this end, we computed the correlation between the estimated Mean of the Typical codon Decoding Rates (MTDR20; based on data from37, details in the Methods section) and PA in the L2-41C and L42-81C libraries. Interestingly, the correlation in the first library was relatively low but significant when controlling for the folding energy (FE) of the mRNA in this region, specifically: r (MTDR, PA) = 0.14 (p = 0.08); r (MTDR, PA | FE) = 0.24 (p = 0.0034). However, in the case of the second library, it was found to be both relatively high and significant: r (MTDR, PA) = 0.56 (p = 0.004); r (MTDR, PA | FE) = 0.6175 (p = 0.0013) (details in the Methods section).
It is important to mention that results obtained in a similar analysis based on the tRNA adaptation index (tAI) (Methods38) were quantitatively similar to the ones obtained for MTDR, albeit weaker, demonstrating the advantage of the MTDR measure: in the case of L2-41C, r (tAI, PA | FE) = 0.24 (p = 0.0029); in the case of L42-81C, r (tAI, PA | FE) = 0.41 (p = 0.0466).
These findings may suggest that in S. cerevisiae the effect of codon usage at the very beginning of the ORF is strongly related to mRNA folding via its effect on initiation (as suggested in prokaryotes11,13,14), but it is also significantly related to elongation rates. Moreover, the results may also suggest that the decoding times of codons is different at the beginning of the ORF than afterwards.39
Furthermore, our results also suggest that after the beginning of the ORF the frequency of codons can affect protein abundance in a causal/direct way due to the fact that different codons have different decoding rates. Therefore, elongation rates can directly affect expression levels; this relation can be partially explained via the fact that codons that are recognized by tRNA species (and other translation factors) with higher abundance tend to have higher decoding rates.
Finally, the result demonstrates that it is possible to deduce codon decoding times for heterologous genes from the analysis of ribosome profiling in endogenous genes, potentially enabling new design tools for engineering gene expression in synthetic systems.
To demonstrate that the reported signals and changes in the estimated protein levels are not due to effects on mRNA levels, but are directly related to translation, we measured the mRNA levels of some of the variants for each of the libraries (Methods). For example, in the case of the L42-81C library, we found that mRNA levels do not correlate with protein levels (r = −0.371; p = 0.497; correlation in the “wrong” direction). Fot these measurements, the correlation between MTDR and protein levels was found to be significant (r = 0.886, p = 0.033), and remained significant even after controlling for mRNA levels ( r(PA, MTDR | mRNA) = 0.89, p = 0.04). In contrast, no significant correlation was found between MTDR and mRNA levels (r = 0.371, p = 0.497). These results support the conjecture that the correlation between protein levels and MTDR is indeed related to ribosome elongation speed and not to mRNA levels. Similar conclusions were obtained for the other libraries: non-significant correlation between mRNA levels and protein levels (r = 0.19 p = 0.66 for L5UTR; r = −0.22, p = 0.4 for L2-41C ) were observed.
At the next step, we aimed at understanding the effect of synonymous codons in different parts of the transcript on protein levels via their effect on mRNA folding. To this end, we computed the local mRNA folding energy in different windows for all the variants in all the libraries (Methods). The local mRNA folding energy is predicted for overlapping/sliding windows of length 40nt with a slide of 1nt (Methods); the folding energy of a nucleotide sequence is related to the energy needed for its unfolding after folding to its strongest structure: a more negative number is related to stronger folding.
For each nt position (which induces a window of length 40nt as described above), we computed the correlation between folding energy and protein levels. As can be seen in Figure 4A, B, the correlation between local mRNA folding energy and protein levels is significant and positive at the 5′UTR and the first ~10–13 positions/windows inside the ORF (stronger folding at the 5′ end of the ORF tends to have a negative effect on protein levels). This result supports previous conclusions based on the experimental analysis of the prokaryote E. coli11; however, this is the first time such a causal result is reported for a eukaryote. In addition, it supports previous evolutionary analyses of transcripts that suggested that in many organisms there is selection for weak local folding of the mRNA sequence at the beginning of the ORF, probably to improve initiation efficiency and the recognition of the START site by the pre-initiation complex.23,32 It may also explain the difference in the ‘linearity’ of the profiles that appear in Figure 2D, F, relative to the profile in E. (that is less ‘linear’), which may be related to the central effect of mRNA folding near the 5′end of the ORF on protein levels in L2-41C. It is probable that this affect is distributed in a bi-modal manner in the database, so that it is relatively easy to randomly generate variants with either strong folding / many base-pairs / relatively low protein levels, or weak folding / few base-pairs / relatively high protein levels.
Interestingly, as can be seen in Figure 4B, after the first 20 positions/windows both in the endogenous genes and in the heterologous genes there is a region where the correlation between folding energy (FE) and protein levels (PA) is reversed (i.e. negative, strong folding tends to increase protein levels).
In the case of the endogenous genes, the region is relatively long (dozens of codons afterwards) and significant. In the case of the heterologous genes, there are several such short regions which are not significant (probably due to the lower number of points).
This result supports a previous study that demonstrated based on a comparison to randomized versions of the genome that there is a selection for strong mRNA folding in after the first ˜14 positions/windows of the ORF40 (Fig. 4B); this region appears downstream of region with a selection for weak mRNA folding at the very beginning of the ORF. A possible explanation for this result is that strong mRNA folding further downstream improves the fidelity of translation initiation by blocking the pre-initiation complex from scanning it, and increasing the probability that it will remain in the vicinity of the START codon and recognize it correctly.34,41-43 It is also possible the strong structure downstream may help prevent strong folding at the START codon. In order to further demonstrate that the reported relation between mRNA folding and protein levels in our heterologous system can explain folding evolution in S. cerevisiae endogenous genes, we observed that the effect of mRNA folding on protein levels is similar in endogenous genes and in our heterologous libraries. To this end, we computed the correlation between 1) the vector of correlations between the mean local FE and PA in S. cerevisiae endogenous genes and 2) the vector of correlations between the local FE and PA in the first 50 windows heterologous gene; we found it to be highly significant (r = 0.7; p = 6.28·10−8; Fig. 4B, C). Similarly, we showed that the observed correlation between protein levels and folding energy in the heterologous library increases with the significance of the level of selection of the latter in endogenous genes. To this end, we computed the correlation between 1) the vector of the local FE Z-score in the first 50 windows of the wild-type S. cerevisiae endogenous genes, with respect to the corresponding randomized variants maintaining various genomic properties (Methods) and 2) the vector of correlations between the local FE and PA in the first 50 windows in the heterologous gene; we found it to be highly significant (r = 0.7; p = 2.4·10−8; Fig. 4B-D).
At the next step, we aimed at understanding how the effect of different codons on protein levels changes along the coding sequence. Each dot in Figure 4E correspond to one codon, and includes the correlation between its frequency and protein levels in the library L2-41C (x-axis) vs. the correlation between its frequency and protein levels in the library L42-81C (y-axis). Interestingly, while in general an agreement between the 2 libraries (r = 0.33; p = 0.07) can be observed, there are cases where the effect of codon frequency on protein levels in the first 40 codons is different from its effect in the second 40 codons, supporting the conjecture that the effect of codon frequency on protein levels is context dependent and changes along the ORF. For example, the codon GGA has a positive effect on protein levels when it appears in codons 2–42 (r = 0.24, p = 0.0035 ), but it has negative effect on protein levels when it appears in codons 42–81 (r = −0.47, p = 0.018 ) (Fig. 4E).
Our synthetic biology driven approach to study the effect of synonymous or silent mutations in different parts of the transcript on protein abundance, enabled us to gain an improved understanding of the relation between transcript features and their corresponding protein levels. Previous studies based on evolutionary systems biology of endogenous genes could not infer the causality of the relations between transcript features and protein levels. Our systematic study of rationally designed heterologous genes bypasses many of the pitfalls characterizing the investigation of endogenous gene expression. This approach is becoming increasingly available due to rapid advances in DNA library construction methodologies. The results reported in here emphasize the utility of applying synthetic biology for deciphering how transcripts modulate their expression and enables us to provide quantitative estimations of the relations between various features of the transcript and translation/protein levels in a eukaryote.
Our analyses support the hypothesis that in eukaryotes weak mRNA folding near the beginning/end of the ORF/5′UTR respectively, improves translation initiation and increases protein levels.11,13,23,32,40-42,44,45 Specifically, the results emphasize the negative effect of strong mRNA folding at the beginning/end of the ORF/5′UTR on translation initiation and protein levels.11,13,23,32,41,44,45 The analysis also quantifies the effect of the affinity of the nucleotide context surrounding the START codon to the pre-initiation complex on protein levels, demonstrating that this is one of the major determinants that can explain the effect of silent/synonymous mutations in this region on translation and protein levels. Previous studies addressing this problem either analyzed small numbers of heterologous or mutated variants,30,31 or analyzed endogenous genes.34,46
Our analysis provides the first comparative estimation of the possible effect of silent/synonymous mutations in different parts of the transcript on protein levels. Specifically, it suggests that the most deleterious mutations (the ones resulting in the lowest protein levels) at the initiation region (the 5′UTR) have one order of magnitude higher effect on protein levels abundance than the most deleterious mutations in the elongation region (the coding region): the lowest protein level obtained for 5′UTR (L5UTR) mutations was around 10 times lower than the lowest protein level obtained for coding region mutations (codons 42–81; L42-81C). It was suggested that the beginning of the ORF (codons 2–41; L2-41C) is related to both initiation and elongation16; indeed the lowest protein level obtained for a mutation in this region was around 5 times higher than the lowest protein level obtained for 5′UTR mutations (i.e., between the 2 other regions).
Finally, we show that codon decoding rates, inferred via ribosome profiling measurements, affect protein levels in a direct and casual way. This result supports previous studies performed on endogenous genes that suggested, based on various proxies of elongation rates, that protein levels can be affected by codon distributions in the ORF, possibly due to their effect on translation elongation.1,17,23,47 Our study demonstrates for the first time the strong relation between codon decoding times and protein levels in a direct causal manner. It is important to emphasize that this result does not contradict previous findings suggesting that the correlation between measures of codon frequencies and protein levels in endogenous genes is partially due to non-direct reasons such as global ribosomal allocation, protein folding, and translation fidelity, etc.16,24,26,48 Thus, predictors based on ribosomal profiling data may be used for inferring protein levels of heterologous genes. Such predictors may be helpful specifically in cases of genes that do not enable reliable measurements of protein levels (e.g. very short ones49), and can be used for engineering heterologous genes for tailored gene expression. Furthermore, this result demonstrates that elongation speed is not constant and that both this speed and the associated protein levels can be affected by synonymous features of the transcript (as was suggested for example in17,20-22), and not only by initiation rates and/or the amino acid content encoded in the coding sequence (as was suggested in18,19,50).
The current accepted model is that translation initiation is the rate limiting step of the translation process, and thus synonymous mutations near the beginning of the ORF modulate protein levels, while synonymous mutations downstream of this region do not (see, for example,11,51). The analyses reported here demonstrate that elongation and codon distribution downstream of the ORF 5′ end do significantly modulate protein levels (specifically, when the region near the START codon is ‘optimal’). This does not contradict the fact that the nucleotide composition near the beginning of the ORF, as well as translation initiation, can have stronger effects on protein levels compared to codon frequencies downstream of the ORF 5′ end and translation elongation.
Our study differentiates the nature and strength of the effect of synonymous mutations in different parts of the transcript/ORF on protein levels, and may be used to guide the design of synthetic genes.52 The reported results support the notion that the term ‘optimal codons’ (see, for example,27,53), which describes the preferred codon for each amino acid in a certain organism, should be fine-tuned; optimal codons are context dependent and may vary among different parts of the ORF. Specifically, we analyzed a viral gene (HRSVgp04) and demonstrated that silent and synonymous mutations in different parts of its transcript can significantly affect its protein levels. Thus, these results may serve as a proof of concept for the use of accurate design of such mutations to generate rationally tailored expression of genes.
Construction of all the DNA variants of the HRSV genes fused to the reporter gene was performed according to the methods described in.54,55 A cloned and sequenced wild type version of the HRSV gene was constructed as a Minigene (IDT DNA). HRSV variants were generated by fully or partially randomizing specific nucleotide positions within the HRSV gene. Randomized nucleotide positions were ordered as machine mixed synthetic nucleotides (IDT DNA) within DNA Ultramers (IDT DNA), that were used to edit the wild type HRSV fragment and fuse it to the YFP gene, following the methods described in54,55 with slight modifications, as follows:
The HRSV wild type Minigene was edited to generate variants by using it as a template in extension PCR reactions using different randomized Ultramers as the reverse primer. The Ultramers also contained at their 5′ a segment for homologous recombination into a promoter-less YFP in the yeast genome. The forward primer of the HRSV Minigene PCR amplification contained at its 5′ an overlap segment for its fusion with a second PCR fragment, that contained a promoter for the HRSV-YFP fusion and a URA3 selection cassette (amplified from a pre-made template).
300 pmol of single stranded DNA in a 50 μl reaction containing 70 mM Tris–HCl, 10 mM MgCl2, 7 mMdithiothreitol, pH 7.6 at 37 °C, 1 mM ATP and 10 units T4 Polynucleotide Kinase (NEB) was used. Reaction is incubated at 37 °C for 30 min, then at 42 °C for 10 min and inactivated at 65 °C for 20 min.
1 pmol of single stranded DNA of each progenitor in a 25 μl reaction containing 2.5 μl of Hot Start DNA Polymerase (Novagen, 71086-3) reaction according to manufacturer's guidelines was used. Three cycles of annealing were executed for each elongation to ensure full yield of elongation.
All PCR reactions were performed in 96 well PCR microplates, using KOD Hot Start DNA Polymerase (Novagen, 71086-3) according to its protocol.
1–5 pmol of 5′ phosphorylated DNA termini in a 30 μl reaction containing 67 mM Glycine–KOH, 2.5 mM MgCl2, 0.01% Triton X-100, 5 mM 1,4-dithiothreitol, 5.5 units Lambda exonuclease (Epicentre) and SYBR Green diluted 1:50,000. Thermal Cycler program is 37 °C for 15 min, 42 °C for 10 min, enzyme inactivation at 65 °C for 10 min.
Standard PCR primers for all experiments were ordered from IDT with standard desalting.
DNA Ultramers™ (IDT DNA) harboring modified HRSV sequences were used as PCR primers in order to insert the variable segments of the HRSV variants. Specifically, we integrated precise mixes of degenerate (N, K and others) bases at predefined positions that effectively recoded the genes according to a predefined DNA sequence specification.
DNA purifications required in the process of DNA construction were performed with the ZR-96 DNA Clean & Concentrator (Zymo research) kit using standard protocols.
The master strain was created by integrating into the yeast genome a cassette containing a promoter-less YFP, followed by a NAT (Nourseothricin) resistance marker under its own promoter. The entire sequence was inserted into the his3Δ1 locus.
All HRSV variants were transformed into the master strain using the LiAc/SS carrier DNA/PEG method following the procedures described in.56 Cells were plated on solid agar SD-URA selective media and incubated at 30 °C for 3–4 days. Transformant colonies were handpicked and patched on SD-URA + NAT (Werner BioAgents) agar plates in 384 format. Correct transformation was verified for all variants by PCR amplification from the yeast's genome, gel electrophoresis and DNA Sequencing. The constructs were transformed into the master strain which contained a promoter-less YFP coding sequence at the his3Δ1 locus. Each synthetic construct contained a URA3 selection marker under its own promoter followed by a TEF promoter, the relevant HRSV gene ORF, and the beginning of the YFP ORF (for recombination purposes).
Colonies were picked manually from the plates of each variant, the specific integration locus was PCR amplified from each clone. Correct size amplifications were verified by gel electrophoresis. Amplicons were sequenced in house using Sanger sequencing. Colonies with the correct sequences were chosen for analysis.
The variant strains of all genes were maintained in 384 well format on SD-URA + NAT solid medium using the Singer colony arrayer (RoToR, Singer instruments). In order to measure growth and fluorescent protein expression, the Singer colony arrayer was used to inoculate all colonies of the library from solid medium into 100 μl of SD-URA media in a 384 well growth microplate (Greiner bio-one, 781162). Following 24 h of pre-incubation, 5 μl of the yeast cultures was diluted into 80 μl of SD complete media in a 384 well microplate, to reach a starting O.D600 of ~0.1–0.2.
A microplate reader (Neotec Infinite M200 monochromator) was then set to measure the 384 well plates following parameters in cycles of 10 min: Cell growth (as extracted from absorbance at 600 nm) and YFP expression (Ex. 500 Em. 540). Each cycle contained 4 min of orbital shaking at amplitude of 3 mm. The number of cycles was set to 100 (16h) and the temperature to 30 °C. We performed triplicates of the expression measurements.
mRNA level measurements were performed using quantitative real-time PCR (qPCR). Yeast strains were grown to mid-log and RNA purification was performed using the MasterPure™ yeast RNA purification kit (Epicentre) according to manufacturer's protocol. First strand cDNA synthesis was performed using the SuperScript® III First Strand Synthesis kit (Invitrogen) according to manufacturer's protocol, and qPCRs on yeast cDNA's were performed in a LightCycler 480 Real-Time PCR system (Roche) in 384 well-format using SYBR®. Green detection mode and relative quantification analysis was performed using default parameters of the ΔΔCT method. Normalization of mRNA levels was performed according to mRNA levels of the highly expressed Actin gene, and several negative controls were performed to validate our results including (1) no reverse transcription, (2) no PCR template, (3) no YFP (Yeast strain without the HRSV fusion), as well as a positive control for RNA extraction with a known RNA extract. The estimated mRNA levels were based on the average of all replicates.
Sequences of primers for RT-PCR are as follows:
Actin Fwd: ‘CTGGGACGATATGGAAAAGAT’;
Actin Rev: ‘GTTCACTCAAGATCTTCAT’;
YFP Fusion Fwd: ‘ATTCACTTGGTGTTGTCCCAATTTTGG’;
YPF Fusion Rev: ‘GATCTGGGTATCTAGCAAAACACATC’.
As describe in the main text, we analyzed the gene HRSVgp04 and generated 3 libraries (L5UTR, L2-41C, L42-81C) to understand the distinct effect of the nucleotide compositions in different regions of the transcript on protein levels. The structure of all 3 libraries was identical: All had the same promoter, followed by the 5′UTR (14nt) of the TEF gene, and the HRSVgp04 gene fused with a YFP reporter (Fig. 1A).23,32
To make sure that the effect of the mutations of protein levels will resemble its effect in natural conditions,16,57 we chose for each amino acid of the HRSVgp04 protein the codon with the highest tRNA adaptation index (tAI) in S. cerevisiae coding sequences38_ENREF_23; then we choose the first 39 nucleotides of the coding sequence such the strength of the folding there will be minimal (Folding energy close to zero)23,32; in addition, the context/Kozak sequence related to the 6 last nucleotides of the 5′UTR was optimized based on the optimal sequence of S. cerevisiae endogenous genes.30,34 All the variant sequences appear in Table S2.
By randomizing this basic transcript we generated the 3 libraries (L5UTR, L2-41C, L42-81C).
Estimated protein levels were based on the mean YFP/OD over all cycles. Note that the reported results are robust to various definitions of outlier filtering.
The method for estimating codon decoding times, MTDR, was published in.20 For clarity we briefly describe the method here: S. cerevisiae ribosomal profiles were reconstructed using the data published in the GEO database, accession number GSE13750 (GSM346111, GSM346114).57
Ribosomal profiles were normalized (to get normalized footprint counts, NFCs) as in previous studies, by dividing each profile by its mean read count; this enables to control for variation in initiation rates and mRNA levels of different genes, and analyzing/comparing all the genes/profiles in a unified manner. Next, for each codon type we generated a vector consisting of NFC values originating from all analyzed genes. These vectors were used to generate, for each codon type, a histogram reflecting the probability of observing each NFC value in the expressed genes (the number of times each NFC value occurs in the data normalized by the total number of times the codon appears in the data), that was named the ‘NFC distribution’ of the codon.
Based on the characteristics of the NFC distributions (see some explanations below and in20) we suggest https://www.google.com/search?q=we+hypothesizedandspell=1andsa=Xandei=U6sEUtG6NcXmOaPVgfgMandved=0CCoQvwUoAA that their topology could result from a superposition of 2 distributions/components: the first one describes the ‘typical’ decoding time of the ribosomes, which was modeled by a normal distribution characterized by its mean and variance with a probability density function (for a random variable ) of 20:
The second component describes relatively rare translational pauses and ribosomal interactions, such as traffic jams due to the codons' different translation efficiency, and was modeled by a random variable with an exponential distribution characterized by one parameter , with a probability density function (for a random variable ) of:
The mean of the exponential distribution is 1/ λ and can reflect the average NFC of a non-typical/non-nominal phenomenon such as traffic jams, pauses, and biases.
It is known that the distribution of a random variable w(t,) which is the sum of 2 independent random variables and (i.e. w(t) = + ), is calculated as a convolution between the 2 distributions58:
Thus, the summation of 2 independent normal and exponential random variables corresponding to the distributions mentioned above results in a distribution which is named ‘exponentially modified Gaussian’ (EMG), which is a convolution of a normal and exponential distribution. Formally, the EMG distribution function , of a random variable (where )59 is:
The parameters were estimated by fitting the measured NFC distributions to the EMG distribution, under the log-likelihood criterion.
Intuitively, the model above is supported by the following points (see all the details in20): 1) A simulation of ribosome profiling when there are no traffic jams, biases, or pauses (the simulation was based on the S. cerevisiae genome and ribosomal profiling measurements). In this case we found the NFC of the codon to be normal/Gaussian. Thus, we determined that the component of typical decoding time to be normal/Gaussian. 2) When traffic jams (various codon decoding rates) and extreme NFC values are added to the model/simulation the distribution is skewed and it looks log normal or like an EMG. 3) We expect that extreme pauses or traffic jams (due to slower codons for example) will increase the NFC values of a codon, resulting in a right tail. A natural and simple way to describe such a right tail is via an exponential distribution. This is the reason we added the second component of the EMG distribution, the exponential distribution. 4) We performed a simulation of ribosomal profiling where we know the translation rate/time of each codon. We run the EMG filter on the simulation and show that it accurately estimates the true translation times (r = 0.99). 5) We used Akaike information criterion (AIC) to show that the NFC distributions are better described by an EMG distribution than by either exponential or normal distributions. 6) We found that the estimated typical decoding rate correlates with measurements related to the translation decoding rate (such as tRNA levels) and the mean typical decoding rate correlates with protein levels and proteins per mRNA levels of endogenous genes.
The typical decoding time of a codon is its μ; the mean typical decoding time/rate of a gene is the geometric mean of μ or 1/ μ respectively.
All the per-codon inferred values appear in Supplemental Table 3.
Note that the normalized footprint count (Z) is dimensionless since it is obtained via normalizing/dividing the vector of read count (RC) related to each coding region by the mean RC of the coding region. This means that in our case μ, λ, and σ are dimensionless parameters of this distribution(s) (normal distribution, exponential distribution, and EMG). Therefore (by definition) the normalized read count and also μ are dimensionless values that describe the ratio between 1) the read count of the codon and 2) the mean read count of the codons.
We expect that (given a certain initiation rate and mRNA levels) the read count related to a codon be proportional to the time the ribosome spends on it (or to the probability to see the ribosome on the codon relatively to other codons), thus, the normalized read count and also μ (and λ, and σ) are dimensionless values that describe the ratio between 1) the decoding time of the codon and 2) the mean decoding time of codons.
Hence we multiply the μ for a certain codon c by a constant related to the mean decoding time of codons in a genome, we should get an estimation of the decoding time of the codon c.
The local pre-mRNA folding profiles were computed based on the ViennaRNA Package60 with default parameters (e.g., temperature is 37 °C). We used a 40 nt length sliding window (with 1 nt step), corresponding to the approximated ribosome size in fungi.
In the case of the correlations reported in Figure 3, we considered the mean folding energy of all the windows intersecting with the variable part of the library. In the case of Figure 4, we considered the folding in windows of size 40 nt (details in the main text).
For the correlation reported in Figure 4E we considered only the 31 codons with non-constant frequency distribution both in the L2-41C and the L42-81C library.
The coding sequences and UTRs of S. cerevisiae were downloaded from.61 The coding sequences sequences were randomized as follows: for each amino acid in each gene, we sampled a codon from the distribution of genomic codon frequencies/codon-bias in the S. cerevisiae (i.e., more frequent codons in the genome have a higher probability of being sampled). Thus, the randomized variants maintain both the amino acid content of each coding sequence, and the codon frequencies of the original genome. 20 randomized versions of the genome were generated in this manner; local folding vectors were computed for each gene in the randomized genome, and were used to generate the z-score profiles that appear in Figure 4.
For S. cerevisiae endogenous genes we considered 4 quantitative large scale measurements of Protein Abundance (PA).6,62,63 We averaged across the 4 datasets (after normalizing each data set by its mean) to reduce experimental noise (resulting with 1,448 genes with measurements in all datasets).
Statistical analyses were performed using Matlab. All the reported correlations (including partial correlations) are Spearman. In this study we computed, among others, partial correlations between 2 variables (x and y) when controlling for the third variable (z), which is denoted by r(x,y|z). Partial correlation is a standard way to measure the degree of association between 2 random variables, when the effect of a set of controlling random variables is removed. False Discovery Rate (FDR see Table S1) was performed based on Benjamini and Hochberg64 and Storey65 methods (we used FDR cutoff of 5%).
The tAI index38 uses the adaptiveness of the codons of a gene to the tRNA pool. Denote the adaptiveness value of codon of type with . Let be the copy number of the -th anti-codon that recognizes the -th codon, and let be the selective constraint of the codon-anti codon coupling efficiency. The vector38,66 , , , , , , , , ] was defined for eukaryotes as [0, 0, 0, 0, 0.561, 0.28, 0.9999, 0.68, 0.89]. Then, the absolute adaptiveness value of a codon is defined by
Let us mark the relative adaptiveness value of codon with , by normalizing each with the maximal value among the 61 values. The tAI index for a gene is an average over the of its codons.
In addition to local mRNA folding calculated across 40nt windows, we analyzed the following features:
ATG context score was computed as in34:
Specifically, we calculate the context score (corresponding to 6 nt upstream of the START codon and 3 nt downstream of it) according to the following steps: 1. Select percentage of highly expressed/translated endogenous genes (in our case we used 2% of highly expressed genes, according to the ribosomal load). 2. Calculate a position specific scoring matrix (PSSM) based on the nucleotide context surrounding the start codon of the selected highly expressed genes. Let 3. Calculate the context score for a START codon according to the PSSM:
, where j is the variant index, i the nucleotide position, the probability that the ith nucleotide of the j-th gene appears in the i-th position (based on the PSSM).
The Kozak score was computed as the hamming distance (i.e. number of mutations) from the Kozak consensus sequence: ‘ACCATGG’.33
The similarity to binding sites of different RNA binding proteins was based on consensus sequences of 22 RBP taken from35. The score of each variant was based on the hamming distance of a window in the variant (we checked all sliding windows with a length identical to the consensus sequence) with the minimal number of mutations relative to the consensus sequence.
In addition, for each position in the UTR, and for each of the 4 nucleotides, we defined a binary variable (i.e., 14·4= 56 variables) as follows: its value is ‘1’ if the nucleotide appears in the position; otherwise the value is 0. Correlation of the different UTR features with protein levels appear in Table S1.
No potential conflicts of interest were disclosed.
This study was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. The study was partially supported by the Israel Cancer Research Fund (ICRF) and German-Israeli Foundation (GIF).
Supplemental data for this article can be accessed on the publisher's website.