|Home | About | Journals | Submit | Contact Us | Français|
Escherichia coli strains are widely used in academic research and biotechnology. New technologies for quantifying strain-specific differences and their underlying contributing factors promise greater understanding of how these differences significantly impact physiology, synthetic biology, metabolic engineering, and process design. Here, we quantified strain-specific differences in seven widely used strains of E. coli (BL21, C, Crooks, DH5a, K-12 MG1655, K-12 W3110, and W) using genomics, phenomics, transcriptomics, and genome-scale modelling. Metabolic physiology and gene expression varied widely with downstream implications for productivity, product yield, and titre. These differences could be linked to differential regulatory structure. Analysing high-flux reactions and expression of encoding genes resulted in a correlated and quantitative link between these sets, with strain-specific caveats. Integrated modelling revealed that certain strains are better suited to produce given compounds or express desired constructs considering native expression states of pathways that enable high-production phenotypes. This study yields a framework for quantitatively comparing strains in a species with implications for strain selection.
Escherichia coli are widely used as a model prokaryote for physiology studies. Some strains are important pathogens and others are key host strains for metabolic engineering and synthetic biology. This diversity in lifestyle and application reflects the high level of genetic diversity within the species. Thanks to the genomics revolution in microbiology that has enabled sequencing of diverse strains for any species, it is now known that the genomes of different strains of E. coli range in size from 4.5 to over 5.5 Mbp, and the species has a pan-genome composed of more than 15,000 unique proteins (Lukjancenko et al., 2010, Gordienko et al., 2013). Part of this large pan-genome consists of unique metabolic capabilities that have been shown to have important implications for infectious disease studies and pathogenic niches (Monk et al., 2013, Baumler et al., 2011, Vieira et al., 2011). This metabolic diversity is likely to be equally impactful on synthetic biology applications (Lee and Kim, 2015). The massive genomic diversity of the E. coli species provides a deep pool of strains to use for basic research and for metabolic engineering and synthetic biology applications. It also raises an important question: what range of phenotypic behaviours exist and how can these be leveraged to further exploit E. coli as a model organism and host strain?
A review of industrial biotechnology publications and patents that use E. coli as a host strain yielded seven representative E. coli strains that are used often and are good candidates for detailed study: the K-12 strains MG1655, W3110, and DH5a, as well as strains BL21, C, Crooks, and W (Figure 1A). These strains all have genetic tools available – a required factor when choosing a strain for metabolic engineering. The selection of both closely related strains (K-12 strains) and more distantly related strains also allowed an examination of whether close genetic relatedness is a useful predictor of physiological relatedness and production potential. The existing body of work evaluating different E. coli strains in metabolic engineering and synthetic biology (Archer et al., 2011, Arifin et al., 2014, Yoon et al., 2012, Vijayendran et al., 2007, Marisch et al., 2013, Chae et al., 2010) demonstrated a need for the comprehensive analysis of strain-specific differences. Despite significant success in engineering E. coli for industrial production of chemicals and proteins (Lee et al., 2012b, Kim et al., 2015), there is no unified fundamental basis for selection of one strain over another for a given metabolic engineering project or expression of a given construct. Previous studies have shown that the choice of host strain for production of a given compound has a significant impact on results (Na et al., 2013, Kim et al., 2014) and up until now represented a major brute force screening effort. Thus, an important question remains to be addressed: what strain of E. coli is best suited for production of a desired product?
Here, a comprehensive comparison incorporating transcriptomics, genomics, and phenomics with genome-scale modelling of seven common E. coli production strains is presented and a mechanistic basis for the selection of a given E. coli strain for production of particular compound is established. The data and models are further used to develop a general strategy for synthetic biology host strain selection that can be applied to any production organism with sufficient genetic diversity. The work presented here establishes a workflow (Figure S1) and represents a resource for similar efforts with other organisms and/or additional omics data types.
Seven strains of E. coli were sequenced to comprehensively compare and examine their strain-specific genetic differences (Accession numbers and identified differences are available in Table S1, Data S1). Accurate genome sequences were determined to be essential due to recent studies that demonstrate several differences between the reference sequence of E. coli K-12 MG1655 and the stock strains of laboratory E. coli available from culture collections (Freddolino et al., 2012). These differences were shown to have substantial physiological effects that could confound experimental results and have downstream impacts on bioprocess design (Nahku et al., 2011). One of the widely used E. coli strains, C, had no public genome sequence available, thus whole genome sequencing was performed to establish the genetic parts list for this strain (STAR Methods). The E. coli C draft genome was predicted to be 4.54 Mbp in size and has 4,424 open reading frames.
The whole genome sequences of the seven strains were then used to classify the strains based on their genetic content. First, a classical MLST scheme (Jaureguy et al., 2008) was used to assign the E. coli strains to phylogroups (Figure S2). All strains were assigned to group A except for E. coli W that was assigned to group B1. All seven strains are generally regarded as safe and non-pathogenic. A full genome alignment and comparison of conserved proteins was also performed (STAR Methods). A total of 6,626 unique protein-coding sequences were discovered across all seven genomes. Of these, 3,316 genes were shared between all seven strains, forming a “core” genome. Of the non-core genes, 1,493 were present in 2 to 6 of the strains and 1,817 of the genes were unique to a single strain alone (Figure 1C, Data S1). A full-genome DNA alignment showed that the E. coli K-12 strains, MG1655, W3110, and DH5a were all part of the same clade. E. coli BL21 and C were also part of a similar clade, and E. coli Crooks and W strains were separate from the others with E. coli W being the most distantly related strain (Figure 1B). A full analysis of amino acid differences in shared coding sequences between E. coli K-12 MG1655 and each of the other strains was performed (Data S1). Such differences may have effects on protein activities including catalytic activity, protein folding, and translation efficiency.
To assess growth dynamics and by-product secretion rates, phenotypic characterizations were performed in aerobic and anaerobic M9 minimal media (STAR Methods). Major differences were observed between the strains during exponential growth phase. Aerobically, the growth rates ranged from 0.61 h−1 (W3110) to 0.97 h−1 (W), with a mean growth rate of 0.80 ± 0.12 h−1, see Table 1. Anaerobically, DH5a grew slowest (0.18 h−1) and W grew fastest (0.90 h−1), with a mean growth rate of 0.53 ± 0.25 h−1. These differences are stark given that the strains share more than 95% of genes in central metabolism at greater than 95% amino acid identity (STAR Methods) indicating vastly different utilization of similar central metabolic genetic content. It is also worth highlighting that some strains, such as W, could grow nearly as fast anaerobically as they did aerobically through a major increase (2.7×) in glucose uptake rate.
While the overall biomass and by-product yields between strains were similar, the strains exhibited different organic acid secretion profiles. In aerobic conditions, four of the strains, C, DH5a, MG1655, and W3110 exhibited acetate overflow metabolism in this well-aerated experiment (Figure 1D), in agreement with past studies (Archer et al., 2011, Marisch et al., 2013). Anaerobically, all strains exhibited common mixed acid fermentation with production of acetate, formate, ethanol, and succinate. Only two strains, the slowest growers, BL21(DE3) and DH5a, produced lactate anaerobically (Figure 1E). This physiological characterization clearly shows that strains differ in their propensity to make certain molecules, e.g., lactate, an industrially relevant, biologically produced chemical, when growing in their wild-type state (Jang et al., 2012). The rate of substrate consumption in the different strains (Table 1, Figure S3) also exhibited significant variation (a 1.9 and 3.6 fold difference aerobically and anaerobically, respectively), a fact that has important implications for productivity and bioprocessing costs.
The large physiological differences across the selected E. coli strains motivated the construction of seven strain-specific GEMs (Data S1 and S2) that were used to integrate, model, and contextualize the measured physiological data. The models were first validated by demonstrating that they could recapitulate a functional flux state by setting the measured physiological data (i.e., inputs and outputs – glucose uptake rate, growth rate and by-product production rates – See Figure S9). All models passed this test, indicating consistency between the models and physiological data. Next, each model’s metabolic content was compared to classify reactions as part of “core” or “pan” metabolic capabilities. The core content (reactions present in all seven strains) consisted of 1,265 genes, catalysing 2,315 reactions that utilize 1,776 different metabolites. The total content, present in at least one strain, but not shared among all, consisted of 2,526 reactions – indicating that 211 reactions were variably present in different strains. The average model had 2,425 +/− 17 reactions. In a recent study of 55 strains of E. coli (Monk et al., 2013) including pathogens and environmental isolates, the average model had 2,337 +/− 52 reactions, indicating that there was more diverse metabolic content among the 55 strains than exists between the seven industrially useful strains examined here. However, several of the differences between the seven strains are present in subsystems important for metabolic engineering, including the pentose phosphate pathway and amino acid biosynthesis. For this reason, strain-specific GEMs of metabolism were used to examine maximum theoretical yield of growth precursors and industrial chemicals to explore the functional differences and metabolic capabilities of each strain.
The theoretical yields of industrially relevant native and non-native compounds were examined by utilizing strain-specific models. A total of 245 heterologous pathways for the production of non-native compounds (Campodonico et al., 2014) were integrated with each strain-specific model to compare theoretical yields. The yields were calculated using glucose as the sole carbon source in both aerobic and anaerobic conditions (Data S4). Overall, the majority of the maximum theoretical yields were similar across strains. However, several differences were identified between the seven strains. For example, the model of E. coli BL21 is unable to produce acrylic acid from heterologous pathways 23 and 24 (Data S4, Heterologous aerobic and anaerobic tabs) because it lacks N-acetylglucosamine kinase. Likewise, DH5a cannot make 3-hydroxypentanoic acid via a predicted heterologous route (pathway 223) due to the lack of homocysteine S-methyltransferase encoded for by mmuM (Song et al., 2015). A histogram of differential yield by pathway in each strain is given in Figure S4.
All seven strains have theoretical yields greater than 95% of the highest yield predicted for any of the strains in most of the 245 pathways. However, 582 (~17%) of the 3,340 combinations of 7 strains, 245 pathways, and 2 conditions have theoretical yields less than 95% of the highest yield predicted for any strain in a pathway. Strain W was alone or tied for the highest predicted yield in the most aerobic (218) and anaerobic (194) pathways; BL21 (76 and 41) and C (92 and 40) equalled the highest yield in the fewest pathways. A histogram of the 341 combinations of strain and condition that have predicted yields of 45–95% of the highest yield in the 245 pathways are given in Figure S4 and Data S4; another 240 combinations are predicted to yield no product. Strain BL21 had minor reductions in production yields of all compounds in aerobic conditions due to the lack of 6-phosphogluconolactonase (PGL) reaction activity (Meier et al., 2012) in the oxidative pentose phosphate pathway (PPP), encoded by the gene pgl. This requires an alternate pathway for production of ribulose-5-phosphate that does not generate NADPH, one of the primary purposes of the oxidative PPP (Fan et al., 2014) (Figure S5).
Analysis using strain-specific models revealed several increased maximum theoretical yield advantages. E. coli Crooks and W had a 4–12% greater yield of 2-oxobutanoate on five of the different heterologous pathways in anaerobic conditions because of an alternate isoleucine biosynthesis pathway (STAR Methods). Furthermore, models of BL21 and Crooks had 21% higher yield of 1,4-butanediol in anaerobic conditions for two of the heterologous pathways (i.e., pathways 176 and 177) due to the ornithine aminotransferase reaction (STAR Methods). These differences in maximum theoretical yields demonstrate that major differences in strain behavior exist based solely on internal reaction content and the unique metabolic network structure of each strain. Next, to gain a deeper understanding of strain specific behavior, the measured physiological data was integrated with each strain-specific model.
The analysis of theoretical yields presented above represents the maximum (i.e., ideal) capabilities of each strain. In vivo wild-type strain-specific behaviour can be analysed by integrating the measured strain-specific physiological data with its corresponding model. The constraint-based modelling techniques of flux variability analysis (FVA) (Mahadevan and Schilling, 2003) and Monte Carlo Markov Chain (MCMC) sampling (Schellenberger and Palsson, 2009) were performed to determine minimum, maximum, and likely flux through each reaction in each strain based on the imposed physiological constraints (for example E. coli C, Figure 2A, Figure S9). The resulting probable flux distributions were used to classify reactions that must carry high flux (STAR Methods) to achieve the measured physiological secretion and growth rates, and were compared in both aerobic (Figure 2B) and anaerobic (Figure 2C) conditions.
High flux reactions were compared across the different strains (Figure 2D). Aerobically, there were 62 reactions classified as high flux in at least one strain. Of these, 37 were shared among all seven strains. Most of the shared reactions were involved in glycolysis, the TCA cycle, and the PPP (Data S5). In addition, reactions involved in glutamate metabolism were classified as high flux across all seven strains. The remaining 25 reactions were classified as high flux in at least one strain, but not shared by all. Some of these differences were obvious on a genetic level – for instance, five reactions in the oxidative PPP were classified as high flux in all strains except BL21, because, as discussed above, BL21 lacks the pgl gene, disabling flux through the oxidative PPP in this strain. Other differences in high flux reactions were related to differences in physiological behaviour. For example, acetaldehyde dehydrogenase was only a high flux reaction in two strains (DH5a and MG1655 – two of the strains that exhibited acetate overflow metabolism). Acetate secretion negatively correlated with flux through TCA cycle reactions, including citrate synthase (CS), aconitase (ACONTa/b), and isocitrate dehydrogenase (ICDHyr) (Data S5). Under anaerobic conditions, there were a total of 64 high flux reactions classified in at least one strain. Of these, 29 reactions shared high flux across all seven strains. These included predominantly glycolysis reactions and pentose phosphate pathway reactions as well as pyruvate formate lyase (PFL).
To delve deeper into strain-specific behaviour and the observed genetic and physiological differences, RNA-seq was used to collect genome-wide transcriptomic profiles of each strain at exponential phase in aerobic and anaerobic conditions (Data S6). Pairwise differential expression was compared between each of the seven strains and correlation coefficients were calculated to quantify the level of similarity between full expression profiles of shared genes for the different strains (Figure 3A and B). A Principal Component Analysis (PCA) was also performed that focused on metabolic genes (Figure 3C and D). The analysis highlights major differences in expression states. For example, BL21 displayed significantly different expression profiles in anaerobic conditions due to high expression of TCA cycle genes. This difference is most likely due to a nonsense mutation in the gene encoding the global oxygen-responsive transcriptional regulator FNR (Pinske et al., 2011) making this strain’s gene expression behave more similarly to an aerobic state. Further differences are discussed in the STAR Methods Section: Transcriptome analysis classifies shared and strain-specific gene expression profiles.
As with reaction flux, gene expression values were analysed for each growth condition and classified into highly expressed gene sets (STAR Methods). This analysis identified a group of genes that were highly expressed species-wide. In aerobic conditions, 199 metabolic genes were classified as highly expressed in at least one of the seven strains (Figure S6, Data S5), but only 16 of these genes were significantly highly expressed across all strains. Three of these were involved in glycolysis: enolase (eno), fructose-bisphosphate aldolase (fbaA), and glyceraldehyde-3-phosphate dehydrogenase (gapA). In anaerobic conditions, 174 metabolic genes were classified as highly expressed in at least one of the strains, and 23 of the genes were highly expressed in all seven strains including eno and fbaA as well as acetaldehyde dehydrogenase (adhE) and methionine adenosyltransferase (metK).
The major differences observed in transcription profiles demonstrate unique regulatory mechanisms between strains. Knowledge of transcriptional control is directly applicable to bioprocessing and synthetic biology applications for tuning gene expression levels. Most transcription factors (TFs) have been characterized in E. coli K-12 MG1655, thus gene expression profiles between this strain and the other six were compared in both aerobic and anaerobic conditions. An enrichment analysis of TFs known to regulate gene expression was performed (STAR Methods). There are 196 TFs with known regulons available in Regulon DB (Huerta et al., 1998). For each strain, an average of 28±3 TFs were enriched for differential control of expressed genes in aerobic conditions and 29±6 TFs were enriched in anaerobic conditions (Data S6). An informative example is that of the galactitol regulon which includes gatYZABCD and is negatively repressed by the gatR TF (Nobelmann and Lengeler, 1995). The gatR TF is highly enriched for differential expression in all of the strains except W3110. In MG1655 and W3110, the gatR gene is inactivated by an IS3E insertion leading to constitutive expression of these genes (Nobelmann and Lengeler, 1996). This aberrant regulation leads to expression and translation of gat genes that are ultimately responsible for nearly 1% of the wild-type E. coli K-12 MG1655 proteome (Li et al., 2014). In the other strains, gat gene expression is low, in part due to repression by gatR.
Other TFs that were significantly enriched for differential expression include, in aerobic conditions: arcA (anoxic redox control), cra (the catabolite repressor activator), and gadE (glutamic acid decarboxylase involved in maintenance of pH homeostasis), and anaerobically: fnr (mediates aerobic to anaerobic transistion), IHF (integration host factor, responsible for maintaining DNA architecture), and purR (controls purine nucleotide biosynthesis). Transcription factors known to control genes in a shift from aerobic to anaerobic states were also examined (Table S2). Examining TF enrichment between strains identifies unique, strain-specific control mechanisms for different genes, even those that are conserved between strains. Further analysis will aid in determining differential regulatory mechanisms between strains of E. coli with the ultimate goal of manipulating gene expression to enhance metabolic engineering strategies as well as combating additional pathogenic members of the species.
A quantified correlation between high flux reactions and gene expression is key to understanding overall cell physiology and is of great interest to industrial biotechnology as overexpression of genes desired to carry high flux is a widely adapted approach to increase production of a target molecule (Lee et al., 2012a). In this study, 50±8% of model-determined high flux reactions also had encoding genes that were highly expressed. This overlap occurred significantly more often than random (empirical p-value < 0.001, permutation test, STAR Methods, Figure S7). Several genes, such as eno, fbaA, and gapA, were consistently high flux and highly expressed in all seven strains (Figure 4A, Data S7). Other gene/reaction pairs were less conserved, including those involved in amino acid metabolism such as ilvD, serC, and aspC, perhaps indicating large differences in amino acid use and biosynthesis between each of the strains. While a correlation between high flux reactions and gene expression is observed, it is unsurprising that several genes/reactions do not correlate as it has demonstrated that gene expression can be a poor indicator of enzymatic activity (Machado and Herrgard, 2014).
Prior to determining which strain might be best suited to produce a given target compound, an analysis was performed to answer the question of whether GEMs can be used to a priori predict changes in gene expression from one state to another. Using physiological data in aerobic and anaerobic conditions, fluxes were predicted for a shift from aerobic to anaerobic conditions. Overlap between model-predicted changes in reaction fluxes and experimentally observed changes in gene expression were analysed. On average, the metabolic models correctly predicted major changes in flux during a shift from aerobic to anaerobic conditions for 82±8% of the major reaction flux changes (30±12 genes per strain, see Table S3 and S4, Data S7). The results of this analysis indicated a level of predictability suitable for de novo strain-specific prediction in production strains (examples are given in Figure 4B–C).
An analysis was performed to determine the strain best suited for the production of a given compound as well as expression of a given construct from the set of E. coli strains examined in this study. A common metabolic engineering approach is to increase expression of the genes in a pathway of interest that lead to a product (Lee et al., 2007, Lee et al., 2012a, Huo et al., 2011). Based on this approach, it was reasoned that strains with natively high expression in a pathway of interest are likely better poised to produce a given product, as they would require fewer interventions to achieve a production goal. Therefore, genome-scale modelling was integrated with expression data to determine strains that are inherently best poised for production of a given product. Strain-specific models were used to predict the optimal flux distribution for production of two different sets of compounds in aerobic and anaerobic conditions: 1) all 20 amino acids using native E. coli pathways and 2) 20 non-native compounds using 245 heterologous pathways (Campodonico et al., 2014) (Data S8). Combining predicted fluxes with gene expression values allowed for the generation of a relative production potential score (‘R-score’, STAR Methods) that gauges a strain’s suitability for producing a given compound (e.g., Figure 5C and and5D5D).
An integrated analysis using transcriptomic data and genome-scale modelling revealed that each of the seven strains may be preferentially suited for production of different target metabolites. Strains that most often had an R-score >1 for amino acid production were MG1655 and DH5a for aerobic conditions (12/20 and 5/20, respectively) and MG1655 and W for anaerobic conditions (7/20 and 3/20, respectively). The targeted product also highlighted strain-specific differences. For example, in aerobic amino acid over-production (Figure 5A), it was found that E. coli W was predicted to be better at production of pyruvate-derived amino acids leucine and valine due to a more than two-fold greater expression of leuC, leuD, and ilvE compared to the other six strains. Variations in production potential were also prevalent across the 245 heterologous pathways examined (corresponding to one of 20 different industrial compounds, some targeted products originated from multiple native precursors in the cell). Similarly, R-scores >1 were distributed across all seven strains examined. K-12 MG1655 had the highest number of R-scores >1 for 94 pathways aerobically, and W and C had 42 and 41 under anaerobic conditions, respectively (Figure 5B).
Grouping the 20 different targeted heterologous products leads to a further characterization based on which strains were best suited for production of a particular class of compound. For example, strain W was best suited for production of 5/20 compounds (2-methyl-1-butanol, 1-butanol, 3-methyl-1-butanol, 2-keto-isovaleric acid, and 2,3-butanediol) independent of the heterologous pathway used (Table S5). In contrast, the best production strain for 1,4-butanediol varied based on the heterologous pathway used. For example, strain K-12 MG1655 had high expression of 2-oxogluterate dehydrogenase encoding genes sucA, sucB, and lpd (2-fold greater than expression for strains C, Crooks, DHa, and W3110) that produce succinyl-CoA, a branch point for several of the pathways leading to 1,4-butanediol production. However, other heterologous pathways leading to production of 1,4-butanediol start from 4-aminobutanal and DH5a was predicted to be best suited for these pathways.
Extending the model-driven analysis to selection of host strains (i.e., chassis) for synthetic biology applications revealed strain preferences based on amino acid requirements of a given construct. Coding sequences of synthetic biology constructs were obtained from the registry of standard biological parts (2015) and their amino acid composition was calculated. Further, the overall amino acid makeup of the E. coli proteome is stable (Li et al., 2014) and this trend holds true for amino acid frequencies across bacteria (Gilis et al., 2001, Hormoz, 2013, Latif et al., 2015). Thus, constructs with amino acid compositions that are significantly over-represented may require higher demand for a given amino acid if the goal is to significantly produce the construct as a large part of the host strain’s proteome. Analysing this concept, the R-score analysis for amino acid production capability was applied to each construct by comparing the overlap of a strain’s highly expressed amino acid biosynthetic pathways (found to be 1–4 amino acid pathways per strain based on the R-score) with those overrepresented in each construct. This approach led to a prediction of which strains may be best at expressing a certain synthetic biology construct considering both construct required and total amino acid pathways enriched in a strain (Figure S8). Under aerobic conditions, strain DH5a was predicted to be the best producer for the most constructs (568/3,983 or 14% of constructs) due to its inherent high expression of the biosynthetic pathways for tyrosine (Y) and phenylalanine (F) (amino acids that are often small fractions of the proteome) followed by BL21 (473/3,983 or 12% constructs) for similar reasons. This result aligns well with the fact that DH5a is often preferred and used in cloning applications (Taylor et al., 1993, Song et al., 2015) and BL21 is popular for expression of recombinant proteins (Robichon et al., 2011, Marisch et al., 2013).
In summary, this approach emphasized the importance of strain-specific advantages in terms of network structure and native expression states that should be considered when choosing a host strain or chassis. Full results are provided in Data S8 and S9.
This study establishes a workflow to quantitatively compare strains in a species. This workflow was used to guide selection of the best host for applied biotechnology and, in general, presents a multi-omic resource for the important bacterial species E. coli. The omics data generated here addresses a gap in E. coli knowledge for comparing strains of this well-known species and its strain-specific information for seven industrially important strains grown in two well-defined conditions. This unified multi-omics dataset was integrated with GEMs to characterize strain-specific and species-wide properties of E. coli by comparing metabolic fluxes, gene expression, and differential regulation across the strains. New, quantified relationships between these datasets were drawn, along with an evaluation of the production potential of the strains based on maximum theoretical production yields and strain-specific native expression states. The compendium of data, GEMs, and production pathway analyses presented here provide the basis for analysing the overall diversity and production capabilities of the seven E. coli strains studied and could further be leveraged for additional applications such as antimicrobial strategies in healthcare. Key findings are available in Table S6.
A number of important strain-specific and species-wide properties for E. coli were identified. The K-12 strains are genetically very similar considering the overall genetic diversity of the 7 strains, yet their expression profiles under aerobic conditions showed significant variability (Figure 3). Previous studies have shown that W3110 has an amber mutation (stop codon) at position 33 in rpoS which is not found in MG1655 (Vijayendran et al., 2007). This mutation has been shown to reduce RpoS activity (Subbarayan and Sarkar, 2004). RpoS is one of the primary global regulators of E. coli’s complex regulatory network. Thus, a small change can have a large effect on cellular expression patterns. This highlights the need to better understand and elucidate transcription factor network architecture in even closely related strains of E. coli; the data presented here enables such a study.
The phenotypic differences observed between the strains, despite the fact that they have largely similar genomes and metabolic reaction networks compared to other sequenced E. coli strains (Monk et al., 2013, Baumler et al., 2011, Vieira et al., 2011), were among the most striking results from this study. The glucose uptake rates measured for the different strains were observed to vary more than 3-fold in anaerobic conditions. If the measured wild-type uptake rates can be even partially conserved when generating a bioprocessing strain, selection on this criterion alone could have major implications for strain productivity and bioprocess titres (Arifin et al., 2014). Also, there are a number of cases where some strains have additional or are lacking certain metabolic enzymes. The maximum theoretical production analysis presented here (Figure S4) demonstrates that these details are crucial to consider when selecting strains for a metabolic engineering project. Further, the pan-genome of this set is relatively small compared to all E. coli strains which have been sequenced thus far (Gordienko et al., 2013), implying that other strains may have pathways and enzymes available to mine for production purposes. Another key result was the identification of a 50±8% overlap of high-flux reactions with highly expressed genes that is in line with other studies (Holm et al., 2010, Ishii et al., 2007). This significant overlap defines an expected outcome for such data sets. Failure modes may be unnecessarily expressed for a given bioprocess and are therefore targets for expression reduction.
Maximum theoretical production and the native expression state of the cell are important considerations when choosing a strain. The case studies presented here show that specific strains have unique flux and gene expression patterns that, in turn, may affect the production capacity of a compound or construct. The native expression of genes within a pathway of interest is not the only factor influencing the generation of a successful production strain. For example, E. coli strain DH5a is often used in cloning applications due to an endA1 mutation that inactivates an intracellular endonuclease (Taylor et al., 1993) and BL21 is well established in recombinant protein production due to a lack of the Lon and OmpT proteases (Ratelade et al., 2009). Thus, aspects such as transformation efficiency (Liu et al., 2014), phage resistance (Furukawa and Mizushima, 1982), product tolerance (Lennen and Herrgard, 2014), and other traits must also be considered. Furthermore, maximizing theoretical yield does not necessarily lead to increases in titre or productivity. However, the workflow presented here, combining GEMs and omics data, could result in significant time and cost savings by reducing the number of genetic modifications necessary to develop high-level production strains or find a host to produce a construct of interest in a sufficiently high amount.
The new multi-omics data set provided in this study was generated using consistent and defined conditions for multiple strains of a species. Combined with the integrated analysis performed here, it will be of great use for industrial, basic biology, and human health applications. For example, this data and the R-score method could be applied to examine the production of reactive oxygen species across different strains to determine the impact on antimicrobial treatment (Brynildsen et al., 2013, Adolfsen and Brynildsen, 2015). This unified and normalized data set allows one to quantitatively compare strains and represents a comprehensive compendium of unique strain characteristics. The generation of similar datasets integrated with genome-scale modelling will enable rational strain-selection and design for metabolic engineering and synthetic biology projects in other common production host organisms.
The work was funded by the Novo Nordisk Foundation and by grant 1R01GM057089 from the NIH/NIGMS.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Author ContributionsConceptualization: JMM, AMF; Methodology: JMM, AK, MC, MH, AMF; Investigation, JMM, AK, MC, DM, JMS, BOP, AMF; Writing: JMM, AK, AMF, MJH, BOP; Funding Acquisition: AMF, MH, BOP; Resources: MJH, AMF, BOP; Supervision: MJH, BOP, AMF.