Microbial engineering is a time-consuming procedure that often requires multiple rounds of trial-and-error genetic mutation. As it becomes possible to construct larger pieces of synthetic DNA 14
, including whole genomes 15
, automated methods for genetic circuit assembly and metabolic pathway optimization will be critically important. As genetic systems grow in size and complexity, the application of a trial-and-error approach to optimizing these systems is more difficult.
A genetic system's function is optimized by varying the sequences of its regulatory elements to control the expression levels of its protein coding sequences. Each rate-limiting step in gene expression offers the opportunity for rationally modulating the protein expression level. In bacteria, ribosome binding sites (RBSs) and other regulatory RNA sequences are effective control elements for translation initiation 16-19
. As a consequence, they are commonly mutated to optimize genetic circuits, metabolic pathways, and the expression of recombinant proteins.
Previous studies have generated libraries of RBS sequences with the goal of optimizing the function of a genetic system 1, 7, 18
. Generation and selection of a sequence library can become impractical as the number of participating proteins increases, especially if measuring the function requires a low-throughput assay or screen 6
. For example, randomly mutating 4 nucleotides of an RBS generates a library of 256 sequences. The library size increases combinatorially with the number of proteins in the engineered system (16.7 million sequences for 3 proteins, 2.8×1014
sequences for 6 proteins).
A biophysical model of translation initiation would aid the optimization process by enabling the design of an RBS sequence to obtain a desired translation initiation rate. Using thermodynamics, the free energies of key molecular interactions involved in translation initiation have been characterized 20, 21
. Thermodynamic models are made possible by measuring the sequence-dependent energetic changes during RNA folding and hybridization 22-26
. These methods have enumerated and characterized the attributes of a RBS sequence that affect its translation initiation rate, but a predictive model that combines all of the interactions together has not been created and tested.
Bacterial translation consists of four phases: initiation, elongation, termination, and ribosome turnover () 27
. In most cases, translation initiation is the rate-limiting step. The translation initiation rate is determined by the summary effect of multiple molecular interactions, including the hybridization of the 16S rRNA to the RBS sequence, the binding of tRNAfMET
to the start codon, the distance between the 16S rRNA binding site and the start codon, and the presence of RNA secondary structures that occlude either the 16S rRNA binding site or the standby site 20, 21, 28-31
Figure 1 A thermodynamic model of bacterial translation initiation. (A) The ribosome translates an mRNA transcript and produces a protein in a four step process: the rate-limiting assembly of the 30S pre-initiation complex, translation initiation, translation (more ...)
We have developed an equilibrium statistical thermodynamic model to quantify the strengths of the molecular interactions between the 30S complex and an mRNA transcript and to predict the resulting translation initiation rate. The thermodynamic model describes the system as having two states separated by a reversible transition (). The initial state is the folded mRNA transcript and the free 30S complex. The final state is the assembled 30S pre-initiation complex on an mRNA transcript. The difference in Gibbs free energy between these two states is quantified by the Gibbs free energy change ΔGtot. The ΔGtot depends on the mRNA sequence surrounding a specified start codon and will become more negative when attractive interactions are present and more positive when mutually exclusive secondary structures are present.
The translation initiation rate r is related to the ΔGtot according to
is the Boltzmann factor for the system. The derivation of Equation 1
is presented in the Supplementary Methods
. Importantly, Equation 1
describes the differences in translation initiation rate that result from differences in mRNA sequence. The amount of expressed protein E
is proportional to the translation initiation rate where the proportionality factor K
accounts for any ribosome-mRNA molecular interactions that are independent of mRNA sequence and any translation-independent parameters, such as the DNA copy number, the promoter's transcription rate, the mRNA stability, and the protein dilution rate (Supplementary Figure 1
Given a specific mRNA sequence surrounding a start codon, called the subsequence, the ΔGtot is predicted according to the energy model:
where the reference state is a fully unfolded subsequence with ΔGref = 0.
The ΔGmRNA:rRNA term is the energy released when the last 9 nucleotides (nt) of the E. coli 16S rRNA – 3′-AUUCCUCCA-5′ – hybridizes and co-folds to the mRNA subsequence (ΔGmRNA:rRNA < 0). Intra-molecular folding within the mRNA is allowed. All possible hybridizations between the mRNA and 16S rRNA are considered to find the highest affinity 16S rRNA binding site. The binding site minimizes the sum of the hybridization free energy ΔGmRNA:rRNA and the penalty for non-optimal spacing ΔGspacing. Thus, the algorithm can identify the 16S rRNA binding site regardless of its similarity to the consensus Shine-Dalgarno sequence.
term is the energy released when the start codon and the initiating tRNA anti-codon loop – 3′-UAC-5′ – hybridize together. The ΔGspacing
is the free energy penalty caused by a non-optimal physical distance between the 16S rRNA binding site and the start codon (ΔGspacing
> 0). When this distance is increased or decreased from an optimum of 5 nt (or ~17 Å) 29
, the 30S complex becomes distorted, resulting in a decreased translation initiation rate.
is the work required to unfold the mRNA subsequence when it folds to its most stable secondary structure, called the minimum free energy structure (ΔGmRNA
< 0). The ΔGstandby
is the work required to unfold any secondary structures sequestering the standby site (ΔGstandby
< 0) after 30S complex assembly. We define the standby site as the 4 nucleotides upstream of the 16S rRNA-binding site, which is its location in a previously studied mRNA 28
To calculate the ΔGmRNA:rRNA
, and ΔGstandby
free energies, we use the NUPACK suite of algorithms, developed by Pierce and coworkers 32
, with the Mfold 3.0 RNA energy parameters 22, 23
. These free energy calculations do not have any additional fitting or training parameters and explicitly depend on the mRNA sequence. In addition, the free energy terms are not orthogonal; changing a single nucleotide can potentially affect multiple energy terms.
We designed a series of experiments to quantify the relationship between the aligned spacing s
and the free energy penalty ΔGspacing
. Thirteen synthetic RBSs are created where the aligned spacing is varied from 0 to 15 nucleotides while verifying that the ΔGmRNA:rRNA
, and ΔGstandby
free energies remain constant (Supplementary Table I
). The translation initiation rates of RBS sequences are measured using a fluorescent protein measurement system (Methods
). Steady-state fluorescence measurements are performed on E. coli
cultures over a 24 hour period. Under these conditions, the average fluorescence measurement is expected to be proportional to the translation initiation rate r
The quantitative relationship between the aligned spacing and ΔGspacing
is obtained from the fluorescence measurements (Methods
). According to the data, it is conceptually useful to treat the 30S complex as a model barbell connected by a rigid spring, where either stretching or compressive forces cause a reduction in entropy and an increase in the ΔGspacing
penalty. We empirically fit these measured ΔGspacing
values to either a quadratic (s > 5 nt) or a sigmoidal function (s < 5 nt). Following this parameterization, we tested the accuracy of these equations on an additional set of synthetic RBS sequences (Supplementary Figure 2
For an arbitrary mRNA transcript, the thermodynamic model (Equation 2
) is evaluated for each AUG or GUG start codon. The algorithm considers only a subsequence of the mRNA transcript, consisting of 35 nucleotides before and after the start codon. This subsequence includes the RBS and part of the protein coding sequence. The model predictions do not improve when longer subsequences are considered (Supplementary Figure 3
The development of the thermodynamic model makes certain assumptions. Contributions related to the ribosomal S1 protein's potential preference for pyrimidine-rich sequences are omitted from the free energy model33
. The model also assumes that the reversible transition between the initial and final state of 30S complex assembly reaches chemical equilibrium on a physiologically relevant timescale and without any long-lived intermediate states. The presence of overlapping or neighboring start codons, overlapping RBS and protein coding sequences, regulatory RNA binding sites, or RNAse binding sites also pose a challenge to the predictive accuracy of the thermodynamic model. The presence of multiple in-frame start codons, each with significant translation initiation, may distort its predictive accuracy. A genetic system can be designed to avoid many of these complications.
The thermodynamic model can be used in two ways. First, it can predict the relative translation initiation rate of an existing RBS sequence for a particular protein coding sequence on an mRNA transcript. We refer to this as “reverse engineering” because the RBS sequence already exists. Second, it can be used in conjunction with an optimization algorithm to identify a synthetic RBS sequence that is predicted to translate a given protein coding sequence at a user-selected rate. We refer to this mode as “forward engineering” because it generates a de novo sequence according to a user's specifications.
We use the thermodynamic model to predict the translation initiation rates of 28 existing RBS sequences () that were obtained from a natural genome or taken from a list of commonly used sequences (Supplementary Table I
). The lengths of these sequences, as measured by the distance from the transcriptional start site to the fluorescent protein's start codon, vary from 24 to 42 nucleotides. The steady-state protein fluorescences from the sequences are then assayed in the measurement system (Methods
). The growth rates of the cell cultures did not correlate with protein fluorescence (Supplementary Figure 4
). According to the theory (Equation 1
), we expect a linear relationship between the predicted ΔGtot
and the log protein fluorescence. Using linear regression, the squared correlation coefficient R2
is 0.54 with Boltzmann factor β
= 0.45 ± 0.05 mol/kcal (). The average error is
= 2.1 kcal/mol ().
Figure 2 The design method has two modes of operation: (A) The method can predict the relative translation initiation rate of an existing RBS when placed in front of a protein coding sequence. The method calculates the ΔGtot from the input sequence. According (more ...)
While these commonly used RBS sequences vary the protein expression by 1500 fold, the thermodynamic model predicts that both stronger and weaker RBSs are possible. For example, one of these RBS sequences contains a strong 16S rRNA binding site (ΔGmRNA:rRNA = −15.2 kcal/mol), but did not yield a high protein expression level due to a strong mRNA secondary structure and non-optimal spacing (ΔGmRNA = −11.4, ΔGspacing = 1.73 kcal/mol). By optimizing the RBS sequence towards a selected ΔGtot, we gain the ability to rationally control the translation initiation rate over a wide range with a proportional effect on the protein expression level.
Using the thermodynamic model, we developed an optimization algorithm that automatically designs an RBS sequence to obtain a desired relative protein expression level. The user inputs a specific protein coding sequence and a desired translation initiation rate. The rate can be varied over five orders of magnitude on a proportional scale. Equation 1
and the experimentally measured β
= 0.45 mol/kcal is used to convert the user-selected translation initiation rate into the target ΔGtot
. The method then generates a synthetic RBS sequence according to the desired specifications.
The design method combines the thermodynamic model of translation initiation with a simulated annealing optimization algorithm to design an RBS sequence that is predicted to have a target ΔGtot
(). The RBS sequence is initialized as a random mRNA sequence upstream of the protein coding sequence. The method then creates new mRNA sequences by inserting, deleting, or replacing random nucleotides. For each new sequence, the ΔGtot
is calculated and compared to the target ΔGtot
. The sequences are then accepted or rejected according to the Metropolis criteria and three additional sequence constraints that are based on the model's assumptions (Methods
). The procedure continues until the synthetic sequence has a predicted ΔGtot
to within 0.25 kcal/mol of the target. For a given target ΔGtot
, multiple solutions are possible, creating an ensemble of degenerate RBS sequences. The characterization of these ensembles is described in the Supplementary Discussion
The forward design method is tested by generating 29 synthetic RBS sequences (Supplementary Table I
) and comparing their predicted ΔGtot
values to the measured protein fluorescences. The coding sequence for a red fluorescent protein is specified and the ΔGtot
target is varied from −7.1 to 16.0 kcal/mol. The design method then generates a synthetic RBS sequence for each target ΔGtot
. These RBS sequences vary in length from 16 to 35 nucleotides and were highly dissimilar. The steady-state protein fluorescence for each sequence is measured (Methods
). The growth rates of the cell cultures did not significantly vary across sequences (Supplementary Figure 4
). As expected from the theory (Equation 1
), we obtain a linear relationship between the log protein fluorescence and the predicted ΔGtot
with β = 0.45 ± 0.01 (R2
= 0.84) (). The average error is
= 1.82 kcal/mol, corresponding to a 2.3-fold error in the protein expression level. The probability distribution of the ΔΔG for a synthetic RBS is well fit by a Gaussian distribution ().
We next tested the ability of the design method to control the translation initiation rates of different proteins. Two chimeric proteins are constructed that fused the first 27 nucleotides from commonly used transcription factors to a red fluorescent protein (TetR27
-RFP and AraC27
-RFP). The design method is then used to generate 23 synthetic RBSs with ΔGtot
targets ranging from −8.5 to 10.5 kcal/mol (Supplementary Table I
). The thermodynamic model correctly predicts the translation initiation rates of the TetR27
= 0.54) and AraC27
= 0.95) chimeric protein coding sequences (). Notably, the linear relationship between the predicted ΔGtot
and the log protein fluorescence yields a similar slope β
= 0.45 ± 0.05 mol/kcal.
Figure 3 The design method can control the expression level of different proteins by predicting the impact of changing the protein coding sequence. (A) The fluorescence levels from 23 synthetic RBSs in front of two different protein coding sequences are measured (more ...)
A common practice is to reuse the same well-characterized RBS sequence for the expression of different proteins. Interestingly, the thermodynamic model predicts that this can yield dramatically different translation initiation rates. This absence of modularity will occur when the RNA sequence, containing the RBS, forms strong secondary structures with one protein coding sequence, but not another 30
We designed experiments to test the model's ability to predict the impact of changing the protein coding sequence on the translation initiation rate. We use the design method to generate 14 synthetic RBS sequences; these sequences are then placed upstream of two different protein coding sequences: the fluorescent protein (RFP) and a chimeric fluorescent protein (TF-RFP: LacI27-RFP, TetR27-RFP, or AraC27-RFP). The optimization procedure for these synthetic RBSs was modified to maximize the objective function |ΔGRFP − ΔGTF-RFP|, where ΔGRFP and ΔGTF-RFP are the predicted ΔGtot's when the RBS sequence is placed upstream of either the RFP or TF-RFP protein coding sequences, respectively. As predicted by the model, the translation initiation rates of these synthetic RBS sequences greatly change when they are reused with different protein coding sequences (); for example, replacing the fluorescent protein with the TetR27-RFP chimera resulted in a 530-fold increase in expression level.
The thermodynamic model can accurately predict these differences in translation initiation rate when the correct protein coding sequence is specified (R2 = 0.62 and 0.51, ). When the incorrect protein coding sequence is used, the translation initiation rate is not accurately predicted (R2 = 0.04, 0.02). Consequently, when designing a RBS sequence, the beginning of the protein coding sequence must be included in the thermodynamic calculations.
Altogether, 119 predictions of the design method were tested, revealing that the translation initiation rate can be controlled over at least a 100,000-fold range. The thermodynamic model is most accurate when all free energy terms are included in the ΔGtot
calculation (Supplementary Figure 5
). By themselves, each free energy term is a poor predictor of the translation initiation rate (Supplementary Figure 6
) and excluding one free energy term from the ΔGtot
calculation results in a poorer prediction (Supplementary Figure 7
). According to the distribution of the method's error (), an optimized RBS sequence has a 47% probability of expressing a protein to within 2-fold of the target. The probability increases to 72%, 85%, or 92% by generating two, three, or four optimized RBS sequences with identical target translation initiation rates (Supplementary Discussion
We now demonstrate how combining the design method with a quantitative model of a genetic system enables the efficient optimization of its RBS sequences towards a targeted system behavior. Here, our objective is to optimize the connection between the arabinose-sensing PBAD
promoter and an AND gate genetic circuit7
. The AND gate genetic circuit is regulated by the expression levels of two input promoters (PBAD
) and controls the expression level of an output gene, which is selected to be a gfp
reporter (). The desired AND logic requires that the output gene is only expressed when both input promoters are active. The digital accuracy of the AND logic is highest when the maximum expression level from the PBAD
promoter is an optimal value between underexpression and overexpression. When the promoter is underexpressed, the gfp
expression is never turned on; when overexpressed, transcriptional leakiness causes gfp
expression to turn on even in the input's absence.
Figure 4 Optimal connection of a sensor input to an AND gate genetic circuit. (A) A functional AND gate genetic circuit will only turn on the gfp reporter output when both the PBAD and Psal promoter inputs are sufficiently induced by arabinose and salicylate, (more ...)
The quantitative model relates the RBS sequence downstream of the PBAD
promoter to the accuracy of the AND gate genetic circuit's function (). We use previously characterized transfer functions7
to relate the arabinose and salicylate concentrations to the expression levels of the PBAD
) (Supplementary Figure 8
). The PBAD
promoter has a maximum protein expression level of gref
= 590 au at full induction (x
= 1.3 mM arabinose) and when using an RBS sequence with a predicted ΔGtot
value of ΔGref
= −1.05 kcal/mol. We then substitute I1
into the AND gate genetic circuit's transfer function to determine the output gene's expression level, which is in turn substituted into the fitness function F
that quantifies the ability of the genetic system to carry out AND logic (Supplementary Methods
We can interconvert between the maximum protein expression level of the PBAD promoter and the predicted ΔGtot of its RBS sequence according to the equation,
where g is called the gain. The experimentally measured β = 0.45 mol/kcal is utilized. Consequently, we create a quantitative curve F(ΔGtot) that relates the predicted ΔGtot of the PBAD promoter's RBS sequence to the fitness of the genetic system. The fitness curve identifies an optimal region at ΔGtot = −1.17 ± 2 kcal/mol where the genetic system will exhibit the best AND logic with respect to the PBAD promoter's RBS sequence ().
Using the forward engineering mode of the design method, we then generate 2 synthetic RBS sequences targeted to the optimum region of the genetic system's function (predicted ΔGtot = −1.48 and −1.15 kcal/mol). We also design 7 additional synthetic RBSs to test the accuracy of the F(ΔGtot) fitness curve, where the ΔGtot ranged from 0.60 to 17.2 kcal/mol. Each RBS sequence (32 to 35 nt) is inserted downstream of the PBAD promoter and the resulting genetic circuit's response to varying inducer concentrations is assayed ( and Methods). The fitness values of these rationally mutagenized genetic systems are then compared to the predictions of the model and design method (). The insertion of two stronger RBS sequences (ΔGtot = −2.5 and −3.0 kcal/mol) cause the genetic system to exhibit a fatal growth defect.
Both optimally designed synthetic RBS sequences result in a successful connection between the arabinose-sensing PBAD
promoter and the AND gate genetic circuit (mean fitness > 0.85, ). The experimentally determined optimum in the F
) curve is nearby ΔGtot
= 0.60 kcal/mol, which is only a 1.8 kcal/mol deviation from the model's prediction. The quantitative model and design method also correctly predict how the fitness of the genetic system deteriorates with an increasing ΔGtot
. Thus, our approach enabled us to rationally connect two synthetic genetic circuits together to obtain a target behavior while performing only a few mutations and assays (additional design calculations are located in the Supplementary Discussion
A central goal of synthetic biology is to program cells to carry out valuable functions. As we construct larger and more complicated genetic systems, models and optimization techniques will be required to efficiently combine genetic parts to achieve a target behavior. To accomplish this, biophysical models that link the DNA sequence of a part to its function will be necessary. As engineered genetic systems scale to the size of genomes, the integration of multiple design methods will enable the design of synthetic genomes on a computer to control cellular behavior.