|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide gene essentiality data sets are becoming available for Escherichia coli, but these data sets have yet to be analyzed in the context of a genome scale model. Here, we present an integrative model-driven analysis of the Keio E. coli mutant collection screened in this study on glycerol-supplemented minimal medium. Out of 3,888 single-deletion mutants tested, 119 mutants were unable to grow on glycerol minimal medium. These conditionally essential genes were then evaluated using a genome scale metabolic and transcriptional-regulatory model of E. coli, and it was found that the model made the correct prediction in ~91% of the cases. The discrepancies between model predictions and experimental results were analyzed in detail to indicate where model improvements could be made or where the current literature lacks an explanation for the observed phenotypes. The identified set of essential genes and their model-based analysis indicates that our current understanding of the roles these essential genes play is relatively clear and complete. Furthermore, by analyzing the data set in terms of metabolic subsystems across multiple genomes, we can project which metabolic pathways are likely to play equally important roles in other organisms. Overall, this work establishes a paradigm that will drive model enhancement while simultaneously generating hypotheses that will ultimately lead to a better understanding of the organism.
The advent of whole-genome sequencing and other high-throughput experimental technologies provides system level measurements that are driving efforts to develop computational models of the cell. The constraint-based reconstruction and analysis (COBRA) approach (36) has emerged in recent years as a successful approach to modeling systems on a genome scale. The COBRA approach begins with developing a metabolic network reconstruction based on the annotated genome sequence, known biochemistry, and other physiological data (38). Known constraints, such as enzymatic-reaction reversibility and maximum flux capacity, are then imposed on the network reconstruction to generate a model that defines all attainable network states (36). A current metabolic and regulatory model of Escherichia coli contains 932 unique metabolic reactions and Boolean logic statements for how 104 transcription factors regulate the expression of 479 out of the 906 metabolic genes (6). COBRA methods are available to predict which metabolic and regulatory genes are required for growth under given environmental conditions (7, 11, 43, 44).
Knowledge of which genes in an organism are essential and under what conditions they are essential is of fundamental and practical importance. This knowledge provides us with a unique tool to refine the interpretation of cellular networks and to map critical points in these networks. Examples of applications in which this information may be useful include engineering industrial microbial strains, as well as developing novel anti-infective agents. The importance of this emerging field devoted to investigations of gene essentiality is widely accepted, as witnessed by the rapid accumulation of genomewide essentiality data, which are now available for several model and pathogenic microbial species (1, 3, 16, 17, 19, 25, 27, 30, 42, 45, 48).
From a modeling perspective, a major limitation of the previous gene essentiality studies of E. coli was that they were performed using only partial (18, 24, 52) (i.e., not all mutants were evaluated) or heterogeneous (“historical” single-gene studies of a variety of strains and conditions compiled in the Profiling of E. coli Chromosome database [http://www.shigen.nig.ac.jp/ecoli/pec/]) data. Data provided by the first published genome scale genetic-footprinting study of E. coli (16) are generally not amenable to immediate model-based interpretation, as they (i) captured a rather complex phenotype (fitness within a competitive growth environment) and (ii) were obtained in undefined rich medium.
The recent release of the first complete collection of viable single-gene knockout E. coli strains (1) has opened an opportunity for systemic, genome scale gene essentiality studies in minimal and defined growth media. The group responsible for generating this valuable resource also reported the first genome scale conditional-essentiality screen on rich medium and glucose-supplemented minimal medium (1). In this study, we used this strain collection to integrate high-throughput experimental data and computational modeling to assess E. coli gene essentiality for growth on glycerol-supplemented minimal medium. The results of this conditional-essentiality screen were analyzed in the context of the most current genome scale metabolic and transcriptional regulatory model (6).
A systematic cross-validation of genome scale gene essentiality data with in silico predictions would play a critical role in refining the current metabolic reconstruction and the underlying model. At the same time, such an integrative analysis would assist in data analysis and interpretation in the structured-network context. For example, a recent study utilized the previously described integrated E. coli transcriptional-regulatory and metabolic model to validate its predictive capability against 13,750 growth phenotypes corresponding to 110 gene knockout strains grown under 125 different defined conditions (6). Discrepancies between the model predictions and experimental results pointed to poorly understood metabolic or regulatory events requiring further experimental investigation. The gene deletions evaluated in this previous study, however, covered less than 11% of the genes included in the current model.
Here, we identify the set of genes needed for growth on glycerol-supplemented minimal medium and analyze the results using a genome scale metabolic and regulatory model. We show this approach to be useful for a rigorous global evaluation of the genome scale modeling predictive power while simultaneously identifying directions for model improvement. The gene essentiality data obtained in this study were generally in good agreement with the model predictions, as well as with the results of the previously reported screen on glucose-supplemented minimal medium (1). This work represents the most thorough assessment on a gene-by-gene basis of the E. coli constraint-based metabolic model and is the first model-based evaluation of a truly genomewide gene essentiality screen on a single defined minimal medium for E. coli.
A recently described collection of 3,888 E. coli single-gene deletion mutants was constructed (1; http://ecoli.naist.jp/) by the method of Datsenko and Wanner (9). To determine the phenotypes of deletion mutants in M9 minimal medium containing glycerol as the carbon source, the mutants were inoculated in LB medium in the presence of kanamycin (30 mg/liter) using a 96-pin tool and were grown overnight at 37°C. The overnight cultures were washed twice with phosphate-buffered saline and then inoculated in glycerol-supplemented M9 liquid medium with kanamycin. The liquid culture was grown at 37°C with agitation for about 24 h, and the optical density (OD) was measured at 600 nm. The ODs from all wells of a plate were averaged, and the mutants in the wells with less than one-third of the average OD were considered nongrowers or slow growers. The experiment was done in triplicate, and mutants that were below the one-third average OD cutoff in at least two of three experiments were selected. This initial screen yielded about 230 deletion mutants that had slow or no growth on M9-glycerol medium. A secondary screen using the same procedure was repeated on this subset of mutants, using the same one-third of the average OD as the cutoff and yielded a final set of 119 E. coli deletion mutants that represented the conditionally essential complement of genes required for growth on glycerol. This second round of screening confirmed the genuine hits and eliminated false and nonreproducible hits. Each liter of M9 medium (Sigma catalog no. 6030) contained Na2HPO4 · 7H2O (6.8g), KH2PO4 (3g), NaCl (0.5g), NH4Cl (1g), MgSO4 (2 mM), CaCl2 (0.1 mM), glycerol (1%), and kanamycin (10 mg).
For comparison with the conditionally essential genes reported in the recently published data for growth on glucose-supplemented minimal medium (1), we selected the 119 slowest growers based on the observed OD at 24 h. This set coincidently included nearly all of the strains with less than one-third of the average OD at 24 h for all strains tested.
A previously developed metabolic model of E. coli (6, 39) was used to predict the metabolic genes and reactions essential for growth on glycerol minimal medium. The model was modified to take into account genetic differences between MG1655 and BW25113 and recent changes in the genome annotation (40). Five metabolic reactions were removed (l-arabinose isomerase, l-ribulokinase, rhamnulokinase, l-rhamnose isomerase, and rhamnulose 1-phosphate aldolase), since the associated genes (araBAD, rhaBAD, and lacZ) are absent in the BW25113 strain that was the parental background for the genetic manipulations. Based on recent updates to the E. coli genome annotation (40), two additional metabolic genes (dfp and coaE) were also included in the metabolic model by associating them with three reactions involved in coenzyme A (CoA) biosynthesis that previously had no genes associated with them. Furthermore, atpI was removed from the model, since evidence suggested it did not participate in the ATP synthase complex (14). Additional changes in the genome annotation (40) also have merged (tdcG, araH, and ytfR) and split (dgoAD and glcEF) some genes included in the model. As a result, 899 metabolic genes are accounted for in the metabolic model and an additional 104 transcription factors are used in the combined metabolic and regulatory model.
Growth on glycerol minimal medium was simulated by maximizing flux through a defined biomass objective function and allowing the uptake of glycerol, NH4, SO4, O2, and Pi and the free exchange of H+, H2O, and CO2 (see reference 39 for further details). The biomass objective function is specified to define the weighted consumption of metabolites required to generate the cellular biomass. Simulations conducted in this manner represent approximations of the maximum attainable growth rate under the given environmental conditions and model specifications.
The maximum growth rates of gene knockout strains were calculated with each gene independently removed from the network. When simulating the deletion of a gene, all associated reactions were removed from the network except for those reactions with isozymes. Gene deletions where the predicted maximum growth rate was zero were categorized as essential. To evaluate the effects of transcription factor mutants, a combined metabolic and regulatory model was used to evaluate whether the deletion of a transcription factor is lethal for growth on glycerol minimal medium (6, 39). The regulatory model contains Boolean logic statements describing the transcription factors and environmental conditions needed for metabolic genes to be expressed (7, 8). All calculations with only the metabolic model were done using SimPheny (Genomatica, San Diego, CA), and LINDO (Lindo Systems, Inc., Chicago, IL) was used to calculate growth rates for the combined metabolic and regulatory model.
We used The SEED genomic platform (http://theseed.uchicago.edu/FIG/index.cgi) for a cross-genome comparison of metabolic subsystems implicated by the set of conditionally essential E. coli genes identified in this study. A subsystem is defined in The SEED environment as a collection of functional roles (enzymes, transporters, or regulators) known to be involved in a well-defined biological process, such as a subnetwork (a cluster of pathways) associated with a particular aspect of metabolism (e.g., glycolysis) (34). A populated subsystem is defined as a table of tentative role-to-gene connections asserted by curators for a broad range of species containing a functional variant of this subsystem (51). In this study, we used The SEED tools to generalize the data from the described essentiality screen in a broader phylogenetic context. This approach circumvents certain limitations of traditional gene-by-gene comparisons, as there are reported cases where the same reaction or functional role can be implemented by nonorthologous enzymes in different species (28).
Briefly, a table was constructed that relates conditionally essential genes (both identified by the experiment and predicted by computational modeling) to The SEED collection of metabolic subsystems. For further analysis, this table was simplified to a set of binary associations (one gene to one “primary” subsystem) and limited to the approximately 20 key subsystems that contained more than two experimentally defined essential genes. We then examined operational variants of these subsystems (as defined by a subsystem curator) over a diagnostic set of 31 species with available completely sequenced genomes spanning much of the known bacterial phylogeny. For illustrative purposes, we used the same set of genomes as in the previous analysis of genetic-footprinting data (16) (see supplementary Table 4 [http://systemsbiology.ucsd.edu/publications/supplemental_material/JBact2006/]). For this simplified analysis, we monitored only the presence or absence of at least a minimal functional variant for each subsystem and each genome in the set. The results were hierarchically clustered for visualization and analysis purposes (see Fig. Fig.6)6) using the Hamming distance metric and average linkage.
Real-time RT-PCR was used to quantify gene expression levels for genes related to glycerol metabolism (glpK, glpD, glpB, gpsA, gldA, and dhaM). Total RNA was extracted from cells harvested from mid-log-phase cultures of E. coli strain BW25113 (9) grown on glucose-supplemented (A600 ≈ 0.5) and glycerol-supplemented (A600 ≈ 0.3) M9 minimal medium (2 g/liter). Triplicate RNA samples (biological replicates) were stabilized using RNAProtect Bacterial Reagent (QIAGEN) and isolated using the RNeasy mini kit (QIAGEN). Synthesis of cDNA was performed using SuperScript III (Invitrogen) and purified using the QIAquick PCR Purification kit (QIAGEN).
The resulting cDNA samples were used in subsequent real-time reverse transcription (RT)-PCR assays using the QuantiTect SYBR Green PCR kit (QIAGEN) and iCycler iQ system (Bio-Rad). Nine replicate measurements (three technical replicates for each biological replicate) were performed for each assayed gene under both growth conditions. The acyl carrier protein (ACP)-encoding gene acpP was used as a reference for each assay. A standard curve was generated by varying amounts of genomic DNA with fixed primer concentrations and was used to calculate primer efficiencies. The reported relative expression levels for each gene were determined by normalizing the amount of cDNA product to acpP cDNA quantified from the same cDNA sample.
By evaluating single-gene deletion strains for growth on glycerol-supplemented minimal medium, we identified genes essential for growth in a minimal-medium environment that are not essential in a rich-medium environment. A genome scale metabolic and regulatory model was used to evaluate the data and to identify any discrepancies between the model and the experimental data. In addition, the essential genes identified in this study were compared to gene essentiality data for growth on glucose-supplemented minimal medium (1), and their phylogenetic distribution across multiple genomes was evaluated.
Of the 3,888 single-gene deletion E. coli mutants viable on rich medium and screened in this study, 119 were reproducibly incapable of growth on glycerol minimal medium (Table (Table1;1; for complete results, see supplementary Tables Tables11 and and22 [http://systemsbiology.ucsd.edu/publications/supplemental_material/JBact2006/]). Most of these conditionally essential genes are involved in core metabolic processes: amino acid metabolism (59 genes), nucleotide metabolism (19 genes), cofactor metabolism (15 genes), and transport (5 genes). Seventeen genes are involved in other miscellaneous processes, and four regulatory genes were also found to be conditionally essential.
Only seven (cysQ, fes, leuL, prfB, rpsU, yhhK, and yjhS) of the 119 identified essential genes are not accounted for in the current metabolic and regulatory model, since the genes do not encode metabolic enzymes or transcription factors with known functions. While the specific role of cysQ in sulfate assimilation is unknown (33), it is an important component of cysteine biosynthesis. fes is important for iron transport and utilization of ions in low-concentration environments, such as that used in this minimal-medium study (10). Alteration of transcriptional attenuation (32) mediated regulation of the leuLABCD operon (50), which encodes the proteins critical for leucine biosynthesis, and likely explains the essentiality of the leader peptide encoded by leuL. Several other nonmodel genes encoding PrfB (a peptide chain release factor) and RpsU (30S ribosomal subunit S21), as well as the observed conditional essentiality of the uncharacterized genes yhhK (a putative acyltransferase) and yjhS, cannot be readily interpreted without further experimental investigation. The remaining 112 essential genes and nonessential genes can be compared to predictions made with the current metabolic and regulatory model.
Given that most of the essential genes involve metabolic genes and metabolic regulators, we conducted a detailed comparison of the experimentally observed and computationally predicted essential genes (Fig. (Fig.11 and Table Table1).1). Computational analysis of single-gene deletion events predicted 182 genes (177 metabolic and 5 regulatory genes) to be lethal and thus required for growth in glycerol minimal medium. Nearly half of these genes were still predicted to be essential by the model even if all transportable metabolites were allowed to be taken up by the cell simultaneously, so they are likely to be essential for growth on rich medium, as well. Among the 182 model-predicted lethal mutants, 63 were not present in the analyzed collection. Although a fraction of these missing mutants may reflect technical failures, most of them are associated with genes expected to be essential under any environmental conditions. Such genes are typically responsible for producing essential metabolites that cannot be salvaged even from rich medium.
As shown in Fig. Fig.1,1, ~69% of experimentally identified conditionally essential genes covered by the model (77 of 112) were predicted to be essential by evaluating in silico single-gene deletions. An additional 8% of experimentally essential genes (9 of 112) would be correctly predicted by the model to be essential if additional isozymes were not present, possibly indicating that the expression of alternative isozyme-encoding genes is not sufficient to compensate for growth on glycerol minimal medium. Alternatively, these nine cases may point to incorrect functional assignment of some paralogs.
This leaves 26 essential genes unexplained by the model, in which the experimentally observed essential genes are associated with predicted nonessential model genes (Table (Table22 and supplementary Table 3 [http://systemsbiology.ucsd.edu/publications/supplemental_material/JBact2006/]). Six genes out of these 26 discrepancies (atpA, atpB, atpC, atpF, atpG, and atpH) are part of the ATP synthase complex. According to the model, the deletion of the ATP synthase reaction should not be lethal but it should reduce the maximum growth rate by ~75%, which may be close to the viability threshold used in this study. Interestingly, two other components of the ATP synthase complex (atpD and atpE) were deemed nonessential in our experimental screen.
An additional large subset of these discrepancies (9 of 26) appear to be caused by the existence of alternative pathways available within the metabolic model but whose genes are probably not expressed in vivo under the conditions of this screen. For example, proA and proB can be functionally replaced in the model by the combined action of argA, argB, argC, and argE gene products in proline biosynthesis, since both result in the production of glutamate-5-semialdehyde (Fig. (Fig.2).2). However, this alternate pathway is observed experimentally only in double-deletion strains, where an argD deletion leads to increased levels of N-acetylglutamic γ-semialdehyde, which is then converted into glutamate 5-semialdehyde by argE, thereby allowing compensation for the second deletion, either proA or proB (23).
Another subset of these discrepancies (7 of 26) are associated with the biosynthesis of vitamins and cofactors: pyridoxal 5-phosphate (pdxABHJ), thiamine (iscS), and ubiquinone (ubiGH) (Fig. (Fig.3),3), largely reflecting the fact that the need to produce these cofactors was not duly accounted for in the biomass objective function. The ubiG and ubiH gene products are essential for growth on glycerol minimal medium, while other gene products involved in the ubiquinone biosynthesis pathway are essential during growth on rich medium (1) (ubiA, ubiB, and ubiD) and still others are not essential under either condition (ubiC, ubiX, and ubiF).
Several discrepancies related to phosphoenolpyruvate (PEP) metabolism and the PEP-carbohydrate phosphotransferase systems (PTS) likely resulted from posttranscriptional regulation of GlpK (glycerol kinase) that is not accounted for in the metabolic or regulatory model. It is known that deletion of ppc (encoding the enzyme PEP carboxykinase) leads to the accumulation of PEP, which allosterically inhibits glycolytic enzymes, such as Pgi and Pfk (12). This inhibition would lead to an increase in Pgi and Pfk metabolic intermediates, including fructose 1,6-bisphosphate, a potent allosteric inhibitor of GlpK (22) (Fig. (Fig.44).
Two PTS genes, ptsI and crr, were also detected as discrepancies in this study, in which the model predicts an observed essential gene to be nonessential. PTS enzyme I, encoded by ptsI, is phosphorylated in a reaction with PEP in the first step of the PTS, and crr encodes PTS glucose-specific enzyme IIA (EIIAGlc), which is another intermediate that transfers the PTS phosphate to glucose. EIIAGlc is also a central regulatory molecule in E. coli metabolism (35), and in its unphosphorylated form, EIIAGlc binds and allosterically inhibits GlpK, thus ultimately impeding glycerol uptake and metabolism (21, 22). Phosphorylation of EIIAGlc releases GlpK, however, and facilitates normal glycerol uptake and metabolism. Therefore, a ptsI deletion would interfere with the transfer of a phosphate to EIIAGlc and block the release of GlpK inhibition. The deletion of crr is more difficult to explain in this context, as one might expect that the resultant constitutive relief of EIIAGlc inhibition would lead to enhanced glycerol uptake and metabolism. The observed essentiality of crr likely stems from the general disruption of its other critical cellular roles. For example, phosphorylation of EIIAGlc activates adenylate cyclase, and accordingly, the crr mutant has reduced cyclic AMP levels (29), likely resulting in potentially harmful pleiotropic effects due to improper global gene regulation by crp. Despite these readily explained results, we do not yet have a rationale for the observed nonessentiality of ptsH and cyaA, which encode the PTS protein HPr and adenylate cyclase, respectively.
In addition to the strong correlation between conditionally essential genes, there is also good agreement between the predicted and observed nonessential genes. Of the 3,769 observed nonessential genes, 784 are represented in the model, and ~95% (742 of 784) of these are correctly predicted to be nonessential by the model (Fig. (Fig.1).1). This leaves 42 discrepancies (listed in Table Table22 and supplementary Table 3 [http://systemsbiology.ucsd.edu/publications/supplemental_material/JBact2006/]) where the model incorrectly predicts genes to be essential. Some of these 42 predicted essential genes not identified in the experimental screen are involved in the biosynthesis of biomass components, such as lipopolysaccharide (LPS), spermidine, and glycogen, which in fact may not be essential biomass components. For example, it is known that a complete LPS is not required for growth (37).
For other biomass components like arginine and lysine, a rationale for the observed discrepancies may be related to the existence of alternative reactions and/or isozymes that are unaccounted for in the model. For example, argD encodes an enzyme with dual activity as both acetylornithine aminotransferase (EC 22.214.171.124; required for arginine biosynthesis) and N-succinyl-l,l-diaminopimelate aminotransferase (EC 126.96.36.199; required for lysine biosynthesis) and is predicted to be essential by the model. The astC gene (also known as argM) encodes an enzyme with succinyl- and acetylornithine aminotransferase activities and has been speculated to have N-succinyl-l,l-diaminopimelate aminotransferase activity, as well (16). As another example, both coaA and coaE gene products are required to produce CoA; however, neither gene was essential in rich medium or glycerol minimal medium, while the remaining genes involved in the pathway were essential (Fig. (Fig.3).3). Other enzymes may be present which can carry out these essential reactions, although it is likely that the apparent viability of at least one of these strains (coaE) was due to a yet-unknown artifact, since the coaE gene (formerly yacE) was shown to be essential in a number of mutant studies (16, 20).
Two transporters were also computationally predicted to be essential, glpF and amtB. Although in the model these are the only transporters for glycerol and ammonia, respectively, both compounds freely diffuse through membrane vesicles (13, 26) and their transporters are likely essential only at very low solute concentrations. We subsequently tested the growth capabilities of the ΔglpF mutant (after removal of the kan gene as previously described ) on different concentrations of glycerol to confirm this hypothesis. As the glycerol concentrations were reduced (from 2 g/liter to 0.25 g/liter), the ΔglpF mutant strain had increasingly lower growth rates than the BW25113 parental strain (see the supplementary figure [http://systemsbiology.ucsd.edu/publications/supplemental_material/JBact2006/]). At a glycerol concentration of 0.125 g/liter, the parental strain was able to grow at a lower rate, whereas growth for the ΔglpF mutant strain was abolished. Similar observations have been made in previous ammonium-limited growth experiments for amtB mutants, and it was speculated that 10 μM NH4+ concentrations would be needed to see growth defects in ΔamtB strains (46).
Combined analysis of both essential and nonessential genes indicated a total of 68 discrepancies (only ~8% of total predictions) between experimental and computational essentiality assignments (Table (Table2).2). These discrepancies can be grouped into three types, pointing to possible model improvements with respect to boundary conditions (a formula for essential biomass components), gene-reaction associations (annotations), and quantitative constraints for the passive uptake of nutrients (nonspecific transport).
In the recently published description of the “Keio collection” (1), the authors described the conditional essentiality of the single-gene knockout strains when grown on glucose-supplemented minimal 3-N-(morpholino) propane sulfonate (MOPS) medium. Using this data set, we identified the 119 slowest growers on glucose-supplemented minimal medium by ranking the ODs measured at 24 h. For the purposes of this analysis, this subset represents the conditionally essential genes required for growth on glucose minimal medium. The collection of conditionally essential genes largely overlaps the glucose-specific and glycerol-specific data sets (Fig. (Fig.5).5). The genes found in this overlapping group primarily include those required to form biomass components in the absence of rich medium, such as nucleotides and amino acids, as well as those needed to generate required cofactors, such as NAD(P), CoA, folates, and pyridoxal 5-phosphate. Accordingly, these genes represent a conserved conditionally essential core that is required for E. coli to grow under minimally supplemented growth conditions and is not required for growth under rich (i.e., LB medium) conditions.
Relatively few genes are conditionally essential for growth on glucose relative to growth on glycerol (Fig. (Fig.5).5). Among the glucose-specific conditionally essential genes are 10 that may simply be slow growers, as their ODs after 48 h were substantially increased. Furthermore, three (argB, argC, and metE) were likely false positives, given their nonessentiality in independent phenotype microarray screens (18, 24), while one (argG) agrees with prior studies (24). Perhaps more interesting are the six biotin biosynthesis-related genes that are essential in glucose- but not glycerol-supplemented growth on minimal medium. This discrepancy involving all biotin biosynthesis genes may indicate an unidentified source of biotin in the glycerol essentiality screens. Five additional genes (ilvE, cysG, ubiE, exoX, and hflD) are also glucose-specific essential genes, although the rationale for their conditional essentiality remains unclear.
An equal number of genes have been observed to be essential for growth on glycerol as opposed to specific growth on glucose. Four genes in this set of glycerol-specific conditionally essential genes are directly related to glycerol metabolism or its regulation. As previously described, glpK and glpD are involved in the initial steps of glycerol catabolism, while crr and cra (also known as fruR) are key components of the PTS and mediators of catabolite repression. The differential essentiality of ubiG and ubiH can be explained by the requirement for an electron acceptor for growth on glycerol and the utilization of ubiquinone in oxygen respiration (15). This suggests that ubiC, ubiE, and ubiF should also be essential for aerobic growth on glycerol; however, this conflicts with the observed experimental results.
Another six genes in this glycerol-specific set are involved in sulfate transport and assimilation (cysADKPQU). This result likely stems from the fact that the medium used in the glucose essentiality screen contains MOPS, which can be utilized as a sulfur source under sulfate-limited conditions (4), whereas the M9 minimal medium used in this glycerol-specific screen does not contain an alternative sulfur source besides sulfate. M9 minimal medium does not include iron, whereas MOPS minimal medium contains 10 mM of iron; this difference in medium formulations accounts for the fact that fes (encoding an iron-scavenging protein) is essential in glycerol-supplemented M9 medium and not in MOPS-supplemented glucose medium. A glmM deletion has previously been reported to be essential (31), which agrees with the essentiality of glmM reported in this glycerol lethal data set and may represent a false-negative result in the glucose conditional-essentiality data set. ATP synthase components were also found to have different essentiality results, with atpABCFGH being essential for growth on glycerol and only atpBC being essential for growth on glucose. For both minimal-medium conditions, another ATP synthase component, atpD, was not essential. Finally, seven additional genes conditionally essential for growth on only glycerol-supplemented medium remain difficult to explain.
The analysis of conditionally essential genes in the context of metabolic subsystems described in The SEED projected over a diagnostic set of 31 diverse bacterial genomes is illustrated in Fig. Fig.6.6. Only those subsystems that contained more than two experimentally defined genes conditionally essential for growth on glycerol minimal medium are shown. Overall, 103 out of 119 experimentally essential genes (as well as 11 additional genes predicted by the model to be essential) are covered by a rather small set of 18 subsystems (a complete list of gene-to-subsystem correspondences is provided in supplementary Table 4 [http://systemsbiology.ucsd.edu/publications/supplemental_material/JBact2006/]).
Although this deliberately simplified analysis masks substantial differences between the specific variants of subsystems (or pathways) implemented in different species, it reveals some important trends. First, the majority of organisms possess an operational variant of most of these conditionally essential subsystems. Not surprisingly, the group of organisms that lack functional versions of many of these essential subsystems, albeit phylogenetically quite diverse, are all obligate pathogens or symbionts, many of them intracellular. In particular, five species (Borrelia burgdorferi, Chlamydia trachomatis, Mycoplasma pneumoniae, Rickettsia prowazekii, and Treponema pallidum) lack functional variants in all but two to four subsystems. Moreover, the most conserved subsystem across all organisms examined (glycine, serine, and threonine synthesis) is represented in these species by only a single-enzyme pathway (serine hydroxymethyltransferase [EC 188.8.131.52]). In stark contrast, 15 organisms share each of the 18 identified conditionally essential subsystems with E. coli. This observed dichotomy reflects two drastically different lifestyles, as these 15 organisms are able to thrive outside of a host. This analysis confirms that nearly all subsystems implicated by this conditional-essentiality study in a single model organism are universally important for a broad range of phylogenetically distant free-living bacteria.
The screening of single-gene deletion mutants on glycerol minimal medium provides a meaningful addition to the collection of data regarding essential genes for E. coli. With the combination of other such genome scale gene essentiality studies, we continue to refine our notion of what genes are required for growth on rich and minimal media. From a comparison of genes required for growth under rich- and minimal-medium conditions, a toolkit of genes enabling growth in limiting environments can be identified. By studying the genes required for growth on glycerol minimal medium, we showed that (i) our understanding of the roles that these essential genes play in this toolkit is clear and relatively complete, as only two putative genes of unknown function (yjhS and yhhK) were identified as essential in this phenotyping screen; (ii) the current metabolic and regulatory model is highly accurate in its essentiality predictions; and (iii) comparisons of model predictions and high-throughput phenotyping data represent a powerful approach to rapidly generate model refinements and hypotheses likely to lead to an enhanced understanding of the organism.
Remarkably, 112 of the identified 119 conditionally essential genes are included in the current metabolic model. This observation suggests that the applied experimental approach has a very low rate of incorrectly identifying essential genes. Otherwise, nonmetabolic and uncharacterized genes (at least 40% of E. coli genes) would comprise a substantially larger fraction of the identified set. At the same time, it indicates that an inventory of E. coli metabolic genes captured in the current model (1,003 out of ~4,400 genes in the E. coli genome) is rather comprehensive, at least with respect to the pathways required to support growth on minimal medium. The fact that the identified conditionally essential gene set contained only two genes of unknown function is notable but not surprising, since our screening protocol is conceptually equivalent to the identification of auxotrophs, a historical standard in the study of E. coli genetics.
These experimentally essential genes can be mapped to metabolic subsystems, which allows a level of generalization enabling us to detect tendencies across multiple organisms that may be obscured by details of functional variants. This type of analysis readily facilitates the identification of metabolic functions that are required by different organisms without the potentially complicating details regarding how the molecules are synthesized. For example, Bacillus subtilis, E. coli, and Corynebacteria use three different chemistries in the lysine biosynthesis DAP (meso-diaminopimelate) pathway, but their purposes remain the same. It should be noted that these subsystem projections were made only for conditionally essential genes and not for genes that are essential for growth on rich medium (and likely essential in minimal-medium environments, as well). For example, only the portions of the pathways that are required for NAD and CoA biosynthesis on minimal and not rich medium are represented. Otherwise, these fundamentally essential subsystems would be present in all analyzed genomes.
The set of conditionally essential subsystems (and genes therein) identified in this study may also be used to assess the metabolic potentials of organisms present in environmental samples as captured by emerging metagenomics data (49). Researchers will be able to rapidly assess the pathways present within an environmental sample and use the essentiality information to develop potential laboratory medium formulations to facilitate further controlled study in the laboratory (47). Furthermore, the presence of certain pathways and the absence of others may provide insights into the microenvironment from which the sample was taken and also indicate local intracommunity relationships between species that are present in the sample. This subsystem-based essentiality analysis approach could be a useful tool to add to the growing compendium of methods (5, 41) being developed to analyze and interpret these complex data.
Further analysis of the generated gene essentiality data set was made using a metabolic and regulatory model allowing the data to be easily placed into biological context. Discrepancies between model and experiment can be used to improve the predictive capabilities of the model by indicating regions that are not captured accurately by the models or, more importantly, can point to areas in metabolism or regulation that require further experimental interrogation. For example, a number of independent gene deletion studies have shown that some genes involved in arginine biosynthesis are not essential (18, 24), but without these enzymes, the current literature cannot explain how this essential amino acid is synthesized. Therefore, further experiments need to be conducted to either identify novel arginine biosynthetic genes or determine which multifunctional enzymes can compensate for any perturbation of the genes.
Additionally, based on the experimental results, several model improvements are suggested. Since a number of experimentally essential genes are involved in cofactor biosynthesis, a number of cofactors should be included in the biomass objective function used to conduct the growth prediction simulation. These cofactors include pyridoxal-5-phosphate, isoprenoids, hemes, ACP, and ubiquinone. These will help correct for the false negatives (lethal phenotypes with nonlethal model predictions) that account for a large number of discrepancies in both minimal- and rich-medium phenotypes (data not shown for rich medium). A wild-type biomass composition does not always correlate with an essential biomass composition; for example, only a core and not a complete LPS is required for cell survival (37). Accordingly, the essentiality of these and other biomass components can be refined or relaxed based on the nonessentiality of the corresponding biosynthetic-pathway genes. These issues are being addressed in a forthcoming updated metabolic reconstruction of E. coli (A. Feist and B. O. Palsson, personal communication) and represent a significant advance.
Model improvements are also suggested with regard to the first steps of glycerol metabolism (Fig. (Fig.4).4). As previously noted, analysis of the false positives suggests that glycerol import can occur by passive transport across the cell membrane in the absence of the glpF-encoded transporter. Additionally, the initial enzymatic steps required to convert glycerol to dihydroxyacetone phosphate appear to be exclusively mediated by GlpK and GlpD rather than by GldA and the DhaKLM-PtsHI complex. This pathway bias is likely due to transcriptional regulatory effects. Indeed, the elevated expression of glpK and glpD during growth on glycerol revealed by quantitative RT-PCR (Fig. (Fig.4)4) further supports the notion that the GlpK-GlpD branch is dominant under these conditions. Furthermore, a recent study showed that the DhaR transcriptional regulator specifically upregulates the genes encoding DhaKLM in the presence of dihydroxyacetone, but not glycerol (2). Under the conditions utilized in this study, quantitative RT-PCR of dhaM (Fig. (Fig.4)4) showed that the dhaKLM genes are only minimally expressed, leaving the alternative glycerol metabolic pathway dormant. Including the recently characterized DhaR regulatory interaction (2) in the integrated regulatory-metabolic model will readily correct this discrepancy.
In summary, this high-throughput phenotyping screen provides a significantly enhanced view of the conditionally essential gene set required for growth under minimally supplemented growth conditions and additionally represents the most comprehensive assessment of the constraint-based metabolic model of E. coli conducted to date. Moreover, this study further highlights the utility of using genome scale models as a context for content in interpreting and analyzing complex high-throughput data sets. This powerful synergistic approach of not only using models as data analysis tools, but also using high-throughput data as feedback for model improvement, is becoming a paradigm that will continue to drive systems biology research forward.
We thank Adam Feist for his critical reading of the manuscript; Trina Patel, Vasiliy Portnoy, and Eric Knight for technical assistance; and Christian Barrett and other members of the Palsson laboratory for insightful discussions and suggestions.
We gratefully acknowledge the support of the NIH Protein Structure Initiative, grant numbers P50 GM62411 and U54 GM074898, and also grant no. NIH R01 GM5708.
Bernhard Palsson has a financial interest in Genomatica, Inc. Although the NIH R01 GM5708 grant has been identified for conflict of interest management based on the overall scope of the project and its potential to benefit Genomatica, Inc., the research findings included in this publication do not necessarily directly relate to the interests of Genomatica, Inc.
Published ahead of print on 29 September 2006.