|Home | About | Journals | Submit | Contact Us | Français|
Network reconstructions are a common denominator in systems biology. Bottom-up metabolic network reconstructions have developed over the past 10 years. These reconstructions represent structured knowledge-bases that abstract pertinent information on the biochemical transformations taking place within specific target organisms. The conversion of a reconstruction into a mathematical format facilitates myriad computational biological studies including evaluation of network content, hypothesis testing and generation, analysis of phenotypic characteristics, and metabolic engineering. To date, genome-scale metabolic reconstructions for more than 30 organisms have been published and this number is expected to increase rapidly. However, these reconstructions differ in quality and coverage that may minimize their predictive potential and use as knowledge-bases. Here, we present a comprehensive protocol describing each step necessary to build a high-quality genome-scale metabolic reconstruction as well as common trials and tribulations. Therefore, this protocol provides a helpful manual for all stages of the reconstruction process.
Metabolic network reconstructions have become an indispensable tool for studying the systems biology of metabolism1–7. The number of organisms for which metabolic reconstructions have been created is increasing at a pace similar to whole genome sequencing. However, the quality of metabolic reconstructions differs considerably, which is partially caused by varying amounts of available data for the target organisms, but also partially by a missing standard operating procedure that describes the reconstruction process in detail. This protocol details a procedure by which a quality-controlled quality-assured (QC/QA) reconstruction can be built to ensure high quality and comparability between reconstructions. In particular, the protocol points out data that are necessary for the reconstruction process and that should accompany reconstructions. Moreover, standard tests are presented, which are necessary to verify functionality and applicability of reconstruction-derived metabolic models. Finally, this protocol presents strategies to debug non- or malfunctioning models. While the reconstruction process has been reviewed conceptually by numerous groups8–11 and a good general overview of the necessary data and steps is available, no detailed description of the reconstruction, debugging, and iterative validation process has been published. This protocol seeks to make this process explicit and generally available.
The presented protocol describes the procedure necessary to reconstruct metabolic networks intended to be used for computational modeling, including the constraint-based reconstruction and analysis (COBRA) approach11, 12. These network reconstructions, and in silico models, are created in a bottom-up fashion based on genomic and bibliomic data, and thus represent a biochemical, genetic, and genomic (BiGG) knowledge-base for the target organism9. These BiGG reconstructions can be converted into mathematical models and their systems and physiological properties can be determined. For example, they can be used to simulate maximal growth of a cell in a given environmental condition using flux balance analysis (FBA)13, 14. In contrast, the generation of networks derived from top-down approaches (high-throughput data based interference of component interactions) is not discussed here, as they do not generally result in functional, mathematical models.
The metabolic reconstruction process described herein is usually very labor- and time intensive, spanning from six months for well-studied, medium genome sized bacteria, to two years (and six people) for the metabolic reconstruction of human metabolism15. Often, the reconstruction process is iterative, as demonstrated by the metabolic network of Escherichia coli, whose reconstruction has been expanded and refined over the last 19 years7. As the number of reconstructed organisms increases, the need to find automated, or at least semi-automated, ways to reconstruct metabolic networks straight from the genome annotation is growing. Despite growing experience and knowledge, to date, we are still not able to completely automatically reconstruct high-quality metabolic networks that can be used as predictive models. Recent reviews highlight current problems with genome annotations and databases, which make automated reconstructions challenging and thus, require manual evaluation8, 9. Organism-specific features such as substrate and cofactor utilization of enzymes, intracellular pH, and reaction directionality remain problematic, and thus, requiring manual evaluation. However, some organism-specific databases and approaches exist, which can be used for automation. We describe here the manual reconstruction process in detail.
A limited number of software tools and packages are available (freely and commercially), which aim to assist and facilitate the reconstruction process (Table 1). The protocol presented can, in principle, be combined with those reconstruction tools. For generality, we present the entire procedure using a spreadsheet, namely Excel workbook (Microsoft Inc), and a numeric computation and visualization software package, namely Matlab (Mathwork Inc). Free spreadsheets (e.g., Open office and Google Docs) could be used instead of the listed spreadsheet. Alternatively, MySQL databases may be used, as they are very helpful to structure and track data. Matlab was also used to encode the COBRA Toolbox, which is a suite of COBRA functions commonly used for simulation16. This Toolbox was extended to facilitate the reconstruction, debugging, and manual curation process described herein.
The protocol describes in detail the process to generate metabolic reconstructions applicable for representatives of all domains of life. The process of reconstructing prokaryotic and eukaryotic metabolic networks is, in principle, identical, although eukaryote reconstructions are more challenging due to size of genomes, coverage of knowledge, and the multitude of cellular compartments. Specific properties and pitfalls are highlighted.
The described reconstruction and debugging process requires organism specific information. The minimum information includes the genome sequence, from which key metabolic functions can be obtained, and physiological data, such as growth conditions, which allow the comparison of model prediction to refine the network’s content. In general, the more information about physiology, biochemistry, and genetics is available for the target organism, the better the predictive capacity of the models. This property becomes obvious considering that the network evaluation and validation process relies on comparing predicted phenotypes (e.g., growth rate) with experimental observations. Additional cellular objectives (other than maximal growth rate) may be compared with experimental data but they are not detailed in this protocol15, 17–20.
Although this protocol presents the reconstruction process in terms of metabolic networks, the same approach can, and has been, applied for reconstructing signaling21, 22 and transcription/translation networks23. Regulatory networks have not been constructed in a fully stoichiometric manner yet, although a pseudo-stoichiometric approach has been proposed24, 25. The reconstruction process for these networks is not as well established as for metabolic networks, and is thus still subject to active research.
Lastly, myriad data sources are used during the reconstruction process rendering metabolic network reconstructions as knowledge-bases, which summarize and structure the available BiGG knowledge about the target organism. Frequently used organism-unspecific, and some of the organism-specific, resources are listed in Table 1. Note that the quality and wealth of organism-specific information will directly affect the quality and coverage of the metabolic reconstruction. Great resources are organism-specific books that have been published for a growing number of organisms26–29. In cases where organism-specific information is scarce, data from phylogenic neighbors may be of great help. It is important to ensure that, in cases where the reconstruction relies extensively on relative information, the overall behavior of the model matches the target organism. This assurance can be achieved by carefully comparing the predictions with experimental and physiological data, such as growth conditions, secretion products, and knock-out phenotypes.
The resulting knowledge-bases can be queried, used for mapping experimental data (e.g., gene expression, proteomic, fluxomic, and metabolomic data), and converted into a mathematical format to investigate metabolic capabilities and generate new biological hypotheses. The multitude of possible applications of BiGG knowledge-bases distinguishes them from other, automated efforts. By introducing standards in content and format with this protocol it will soon be possible to compare metabolic reconstructions between different organisms, which will further enhance our understanding of the evolutionary processes and may provide a complementary approach to comparative genomics.
The metabolic network reconstruction process described herein consists of four major stages followed by its prospective use in stage 5 (Figure 1). The order of steps in the different stages is a recommendation and may be altered within each stage, and with some limitations between stages, as long as they are completed. The quality of the reconstruction is generally ensured by performing all steps.
Note that the creation of a draft reconstruction and the manual reconstruction refinement (next stage) may be combined for bacterial reconstructions with main emphasis on reconstruction refinement.
The first stage consists of the generation of a draft reconstruction based on the genome annotation of the target organism and biochemical databases. This draft reconstruction, or automated reconstruction, is thus a collection of genome encoded metabolic functions, some of which may be falsely included while other ones are missing (e.g., due to missing, wrong, or incomplete annotations). Software tools such as Pathway tools30 or metaSHARK31 can be used for the generation of the draft reconstruction but they do not replace the manual curation.
Genomic information is important to unambiguously define the gene properties in respect to the organism’s genome as well as to allow data mapping (e.g., gene expression) in subsequent studies. Since the draft reconstruction, and to some extent the curated reconstruction, relies mainly on the genome annotation, it is important to download the most recent version available to ensure that updates and corrections since the genome’s original publication are accounted for. Thus, the quality and reliability of the genome annotation is crucial to the reconstruction quality. Note that the manual reconstruction refinement tries to identify those low confidence gene annotations by retrieving further, experimental evidence for the presence of the gene product and its metabolic function. The reconstruction assembly and refinement may also require re-annotation of genes but the procedure is not further discussed here. Please refer to available work and reviews32–36. Furthermore, in some cases, the genome-sequencing group created organism specific database (e.g., for Helicobacter pylori37 and E. coli38), which are very valuable during the reconstruction process. Table 1 lists some of the commonly used databases for annotations.
To obtain the draft reconstruction, one can automatically retrieve metabolic genes from the genome annotation by using, for example, key words or gene ontology (GO) catergories39 (see Supplemental methods 1, Figure S1). Metabolic reactions catalyzed by the identified gene products can be connected with the draft reconstruction by using the enzyme commission (E.C.) numbers40 and biochemical reaction databases, e.g., KEGG41 and Brenda42. Note that this first stage aims to obtain a list of candidates that will not necessarily be complete or comprehensive. Many false-positives may be present in the list. For example, proteins involved in DNA methylation or rRNA modification also have E.C. numbers, but their functions are normally not considered in metabolic reconstructions. Another example involves kinases that may be involved in signal transfer reactions or annotated as ‘histidine kinase-like’ and thus, no specific function can be derived from this annotation. A more targeted query for metabolic annotations could be designed to reduce the number of false-positives but it does not replace manual curation.
In this stage, the entire draft reconstruction will be re-evaluated and refined. For each gene and reaction entry, two questions will be asked: 1) Should this entry be here? 2) Is there an entry missing to connect the entry with the remainder of the network?
The second stage of the reconstruction process concentrates on curation and refinement of the network content. We highlight in this protocol parts that need special attention. In particular, the metabolic functions and reactions collected in the draft reconstruction are individually evaluated against organism-specific literature (and expert opinion). This manual evaluation is important since 1) not all annotations have a high confidence score (e.g., low e-value), and 2) biochemical databases are mostly organism-unspecific, listing enzymes activities found in various organisms, not all of which may be present in the target organism (Figure 2). Including organism-unspecific reactions can affect the predictive behavior of the resulting models. Furthermore, information about biomass composition, maintenance parameters, and growth conditions are collected in this stage, which will provide a basis for the simulations in stage 3 and 4.
It is generally recommended to refine and assemble the curated reconstruction in a pathway by pathway manner, starting from the canonical pathway. Peripheral pathways and reactions/gene products without clear pathway assignment are added in a later step. This approach has the advantage that reactions are evaluated within their metabolomic context and missing gene annotations can be readily identified, facilitating gap analysis and debugging in stage 4. However, this approach will also result in identification and/or additional information for reactions that are not in the pathway currently under investigation. One can choose to only include the main reaction(s) associated with the pathway that is currently considered. The remaining reactions may be noted somewhere so that they can be readily retrieved if necessary.
The draft reconstruction identified a set of metabolic genes and functions that are thought to be present in the target organism. Due to potential errors or incomplete in genome annotation, the presence of the annotated gene and its function should be supported using experimental data or literature.
If no organism-specific information can be found in the literature, information for phylogenetically close organisms can be used and should be marked as such. If enzyme-associated reactions are included purely based on gene annotation, they should receive with the lowest confidence score (Table 2). In the case of problems during subsequent simulations, these low confidence reactions can be easily identified.
In some cases, it is appropriate to exclude certain reactions to be entered in the reconstruction. Reactions containing generic terms, such as protein, DNA, electron acceptor, etc. should not be included as they are not specific enough and normally serve in databases as space holders until more knowledge and biochemical evidence becomes available.
Substrate and cofactor specificity of enzymes may differ between organisms. Organism-unspecific databases, such as KEGG41 and Brenda42, list all possible transformations of an enzyme that have been identified in any organism. Additionally, Brenda lists organism-specific information along with relevant references and kinetic parameters. As a rule of thumb, one can assume that enzymes, which have only one reaction associated in, for example, KEGG41do not require organism refinement. However, enzymes that are associated with multiple reactions, with varying substrates and/or cofactors, require manual refinement. Information about substrate and cofactor utilization can be obtained from organism-specific biochemical studies and may also be listed in organism-specific databases (e.g., Ecocyc43). This part of the curation process can be very time consuming and laborious as it may be difficult to find the necessary information. Often, this requires intensive literature search. It is important to pay great attention as false inclusion of substrates or cofactors can greatly change the in silico behavior (i.e., predictive potential) of the reconstruction.
In databases, metabolites are generally listed with their uncharged formula. In contrast, in medium and in cells, many metabolites are protonated or deprotonated. The protonation state, and thus the charged formula, depends on the pH of interest. Often metabolic networks are reconstructed assuming an intracellular pH of 7.2. However, the intracellular pH of bacterial cells may vary depending on environmental conditions and bacteria. Also, the pH of organelles may be different, e.g., peroxisome and lysosome. The protonated formula is calculated based on the pKa value of the functional groups (Figure 3). Software packages, such as Pipeline Pilot and pKa DB, can predict pKa values for a given compound (Table 1). Figure 2 shows some examples of charged molecules and their pKa values.
Once the charged formula is obtained for each metabolite, the reaction stoichiometry can be determined by counting the different elements on the left- and right-hand side of the reaction. Protons and water may need to be added to the reaction in this step as some databases and many biochemical textbooks omit these molecules. Therefore, every element and the charge need to balance on both sides of the reaction. This step is easy for many central metabolic reactions but may become challenging for more complex reactions. Note that unbalanced reactions may lead to synthesis of protons or energy (ATP) out of nothing (see also Figure 4 for examples).
Biochemical data for the target organism are very important for determination of reaction directionality but may not be available. New approaches are available, such as the estimation of the standard Gibbs free energy of formation (ΔfGo) and of reaction (ΔrGo) in a biochemical system44, 45. The standard Gibbs free energy of formation (ΔfGo) and of reaction (ΔrGo) can be obtained for most KEGG41 reactions from Web GCM44. Another approach combines thermodynamic information with network topology and heuristic rules to assign reaction directionality46. Biochemical textbooks may also report reaction directionalities. Additionally one can use the following rules of thumb: 1) all reactions involving transfer of phosphate from ATP to an accepter molecule should be irreversible (with the exception of the ATP synthetase, which is known to occur in reverse direction); 2) reactions involving quinones are generally irreversible.
Note that assigning the wrong direction to a reaction may have significant impact on the model’s performance. In general, one should leave a reaction reversible if no information is available and the aforementioned rules of thumb do not apply. However, models with too many reversible reactions (too loose constraints) may have so called futile cycle, which overcome the proton gradient by freely exchanging metabolites and protons across compartments. Therefore, assigning the correct reversibility to transport reactions is especially important (see below).
This information may also be difficult to obtain. The compartments that have been considered in various metabolic reconstructions are listed in Supplemental methods 1, Table S1. Algorithms such as PSORT47 and PASUB48 can be used to predict the cellular localization of proteins based on nucleotide or amino acid sequences. A recently published protocol describes the use of internet-accessible tools to predict the subcellular location of eukaryotic and prokaryotic proteins49. High-throughput experimental approaches are available to locate individual proteins, including immunofluorescence50 and GFP tagging of individual proteins51. In the absence of appropriate data, proteins should be assumed to reside in the cytosol. Incorrect assignment of the location of a reaction can lead to additional gaps in the metabolic network and misrepresentation of the network properties, especially, if intracellular transport reactions need to be added for which no evidence is available either.
The genome annotation often provides information about the GPR association, i.e., it indicates which gene has what function (Figure 5). The verification and refinement necessary in this step includes determining: i) if the functional protein is a heteromeric enzyme complex; ii) if the enzyme (complex) can carry out more than one reaction and iii) if more than one protein can carry out the same functions (i.e., isozymes exist). For the first case (i), the genome annotation often has refined information, e.g.: ‘protein X, catalytic subunit’ - which indicates that there is at least one more subunit needed for the function of the protein complex. Furthermore, KEGG41 lists subunits in some cases. Often, a more comprehensive database and/or literature search is required. Also, the protein complex composition may differ between organisms. The second case can also be identified from biochemical databases and/or literature. Multitasking of enzymes may also differ between organisms. Note that mistakes or mis-assignments in the GPR associations will change results of in silico gene deletion studies. However, discrepancies between in silico and in vivo results can be used to refine knowledge and reconstructions (see Step 79 and 80).
Linear pathways, such as fatty acid oxidation, have often been combined into few lumped reactions. The genes associated with these reactions are all required, with the exception of isozymes. Subsequently, the GPR association should reflect the requirement for all genes within the lumped reaction by using the Boolean rule AND.
Metabolite identifiers are necessary to enable the use of reconstructions for high-throughput data mapping (e.g., metabolomic or fluxomic data) and for comparison of network content with other metabolic reconstructions. Therefore, metabolites and reactions need to be recognizable by other scientists and by software tools. Each metabolite should be associated with at least one of the following identifiers: ChEBI52, Kegg41, and PubChem53. In many cases, having one of the identifiers is sufficient to automatically obtain the other two identifiers. Furthermore, database-independent representations of metabolites such as SMILES54 and InCHI strings55, 56 are also helpful when associated with each metabolite. These representations represent the exact chemical structure of compounds. Additionally, collecting Molfiles (MDL file format, http://www.symyx.com/), which hold information about the atoms, bonds, connectivity and coordinates of a molecule, will be very useful, e.g., if you are using online software for pKa determination (see Step 10 for details).
The confidence score provides a fast way of assessing the amount of information available for a metabolic function, pathway, or the entire reconstruction15, 57. Every network reaction is associated with a confidence score reflecting the information and evidence currently available. The confidence score ranges from 0 to 4, where 0 is the lowest and 4 is the highest evidence score (Table 2). Note that multiple information types result in a cumulative confidence score. For example, a confidence score of 4 may represent physiological and sequence evidence.
An excerpt of typical spontaneous reactions included in metabolic reconstructions is listed in Supplemental methods 1, Table S2. Note that only those spontaneous reactions should be added that have at least one metabolite connecting them to the rest of the reconstruction. This is to avoid too many dead-end metabolites caused by spontaneous reactions. In more recent reconstructions, spontaneous reactions have been associated with an artificial gene (s0001) and protein (S0001). By doing so, reaction and gene essentiality studies are easier to analyze. Furthermore, this artificial GPR association makes it easy to distinguish between spontaneous and orphan reactions, i.e., reactions without known gene.
When multi-compartment networks are constructed, intracellular transport reactions need to be added for all metabolites that are supposed to “move” between compartments. Inner cellular transport systems are not very well studied and many of these are not annotated in the genome. Finding experimental data is often not easy. A general approach should be to minimize the number of intracellular transport reactions to the ones that really need to be there. If too many transport reactions are added in a reconstruction, they can cause cycles (futile cycles or Type III pathways). This is a common problem in reconstructions with multiple compartments. For the directionality of intracellular transport reactions, one should consider the nature of the pathway in the compartment. For instance, if the pathway is biosynthetic, it is very likely that i) the precursor(s) is only imported, ii) the product(s) of the pathway is only exported from the compartment, and iii) intermediates are not transported at all. Another issue is the transport mechanism. Many transport reactions are in symport or antiport with either protons, cations, or other metabolites. However, not much information is available for intracellular transporters, but the mechanism used in the model may affect the predictive potential. To minimize the error and increase consistency, one can adopt the intracellular transport mechanism from a corresponding transport reaction from extracellular/periplasmic space to cytoplasm when it is known (and is not an ABC transport reaction); Otherwise (facilitated) diffusion reaction may be assumed as mechanism. In any case, these reactions should receive a low confidence score (1 for modeling purpose) to enable easy identification (Table 2) as well as a note and references describing where the mechanism was taken from.
The refinement stage of the reconstruction process is also an ideal point to identify missing functions in the draft reconstruction. Using KEGG41 maps, for example, one can analyze the metabolic “environment” of the reaction(s) under inspections. If the genome annotation of the target organism is present in KEGG41, one can highlight the genes on the map. This gives an estimate of the “connectivity” of the reaction with its metabolic surrounding (Supplemental methods 1, Figure S2). Missing reactions/functions may become evidence for which experimental/annotation evidence should be collected (see also gap analysis). Creating organism-specific maps, using specific drawing software, is of great use for identifying missing functions as well as for network evaluation and debugging.
The biomass reaction accounts for all known biomass constituents and their fractional contributions to the overall cellular biomass (Table 3). The detailed biomass composition of the target organism needs to be experimentally determined for cells growing in log phase58–60. However, it may be not possible to obtain a detailed biomass composition for the target organism. In this case, one can estimate the relative fraction of the precursors from the genome (e.g., by using the Comprehensive Microbial Resource (CMR) database, Table 1). Note that we do not suggest taking the RNA composition from E. coli, rather than estimating it using organism-specific genome data. One reason is that the number of rRNA operons, which contains rRNA and tRNA molecules, can differ significantly between organisms. For instance, E. coli has 7 rRNA operons per genome61, while Mycoplasma capricolum has two62 and Halobacterium cutirubrum has only one rRNA operon63.
In comparison to other biomass precursors, it is slightly more difficult to determine the lipid composition of the cell. The contribution fatty acids and phospholipids needs to be determined from experiments and/or experimental data. Note that compounds, such as phospholipids, can consist of many different fatty acids (different chain length, saturated and unsaturated). Available data often reports the average composition of these compounds, listing the fraction of the fatty acids with different chain length and saturation status. Thus, the model compounds will not represent all possible combinations but only average compounds, consistent with the experimental data.
The composition of the biomass reaction plays an important role for in silico gene deletion experiments. If a biomass precursor is not accounted for in the biomass reactions, the synthesis reactions may not be required for growth (i.e., it is non-essential). Therefore, the associated genes may not be essential either. Subsequently, the presence or absence of metabolites in the biomass reaction may affect the in silico essentiality of reactions and their associated gene(s). In contrast, the fractional contribution of each precursor plays a minor role for gene and reaction essentiality studies. When one wishes to predict the optimal growth rate accurately, the fractional distribution of each compound plays an important role. The unit of the biomass reaction is 1/h since all biomass precursor fractions are converted to mmol/gDW. Therefore, the biomass reaction sums the mole fraction of each precursor necessary to produce 1 g dry weight of cells.
The GAM accounts for the energy (in form of ATP) necessary to replicate a cell, including for macromolecular synthesis (e.g., proteins, DNA, and RNA). The GAM is best determined in chemostat growth experiments (see also Figure 6). Alternatively, if experimental data is not available the GAM can be estimated by determining the energy required for macromolecular synthesis. Therefore, the total amount of macromolecule (Protein, DNA, and RNA) is determined from databases or other resources. Neidhardt et al.64 lists the amount of phosphate bonds necessary to synthesize a macromolecule which is then multiplied with the total amount of phosphate bonds necessary. These phosphate bonds are accounted for by adding ATP hydrolysis to the biomass reaction (x ATP + x H2O → x ADP + x Pi + x H+, where x is the number of required phosphate bonds). Note that this estimate will be too low, as other growth-associated cellular processes also require ATP.
More recent reconstructions include an ATP hydrolysis reaction (1 ATP + 1 H2O → 1 ADP + 1 Pi + 1 H+), which represents non-growth associated ATP requirements of the cell to maintain, for example, Turgor pressure65. The value for the reaction rate can be estimated from growth experiments. For example, based on such measurements, the reaction flux rate was constrained to 8.39 mmol/gDW/h in the E. coli metabolic model65 (Figure 6).
Demand reactions are unbalanced network reactions that allow the accumulation of a compound, which otherwise is not allowed in steady-state models due to mass-balancing requirement (i.e., in steady state the sum of influx equals the sum of efflux for each metabolite) (Figure 7). Most of the demand reactions will be added in the gap filling process (Steps 46 to 48). At this stage, demand functions should only be added for compounds that are known to be produce by the organism, e.g., certain cofactors, lipopolysaccharide, and antigens, but i) for which information is available about their fractional distribution to the biomass or ii) which may be only produced in some environmental conditions. By including a demand reaction for a particular metabolite one can turn otherwise blocked reactions (cannot carry flux) into active reactions (can carry flux). In general, most reconstructions contain only few demand reactions. However, during the debugging and network evaluation process (Stage 4) demand reactions may be temporarily added to the model to test or verify certain metabolic functions. They will be removed from the model before versioning.
Sink reactions are similar to demand reactions but are defined to be reversible and thus provide the network with metabolites (see Figure 7 for examples). These sink reactions are of great use for compounds that are produced by non-metabolic cellular processes but need to be metabolized. Adding too many sink reactions may enable the model to grow without any resources in the medium. Therefore, sink reactions have to be added with care. As for demand reactions, sink reactions are mostly used during the debugging process. They help to identify the origin of a problem (e.g., why a metabolite cannot be produced). These sink reactions are functionally replaced by filling the identified gap.
Information about growth-enabling media is of great help in the following two stages. Thus, if possible, they should be collected prior to the conversion and debugging stage. The following information should be collected: 1) Which metabolites are present? 2) Are there any auxotrophies? 3) Define the composition of a base medium, e.g., water, protons, ions, etc. 4) Obtain information about rich medium composition. This data will be crucial for simulations and network evaluations. If uptake or secretion rates are available, they should also be documented and collected. While this step is easy for the experimentalist, researchers which cannot grow the target organism have to identify growth requirements from literature (or genome annotation). In some cases, research studies describe minimal, defined, or rich medium compositions. In other cases, the culturing conditions reported in some experimental study must be sufficient.
In the third stage, the reconstruction is converted into a mathematical format and condition-specific models are defined. This stage can be mostly automated. Moreover, systems boundaries are defined, converting the general reconstruction into a condition-specific model. Note that the initial model may differ in scope and boundaries to the final model, which is obtained after multiple iterations of validation and refinement, and which is used to simulate phenotypic behavior in a prospective manner. Figure 7 illustrates the conversion of a reconstruction into mathematical format.
Using the functions in the COBRA Toolbox, it is very easy to change reaction constraints but it is sometimes difficult to keep track of all the changes. In fact, one of the most common reasons for errors in simulation is that reaction constraints are not correctly set (Table 4). Therefore, it is important to have an expectation of the results before running a simulation to avoid erroneous conclusions. It is recommended that the constraints are checked by copying the model reaction abbreviations as well as lower and upper bounds into a spreadsheet. For most models, this is the easiest way to see where problems are with the constraints. Similarly, copying calculated solution(s) into a spreadsheet is very helpful.
The fourth stage in the reconstruction process consists of network verification, evaluation, and validation. Common error modes in metabolic reconstructions are listed in Table 4. The metabolic model created in the third step is tested, among other thing, for its ability to synthesize biomass precursors (such as amino acids, nucleotides triphosphates, and lipids). This evaluation generally leads to the identification of missing metabolic functions in the reconstruction, so called network gaps, which are added by repeating partially stage 2 and 3. This illustrates how the reconstruction process is an iterative procedure. An important issue is to decide when to stop the iterative process and call a reconstruction “finished”. This decision is normally based on the definition of the scope and purpose of the reconstruction.
At this point, the first iteration of manual curated reconstruction is finished. It is expected that the network contain a significant number of gaps, i.e., missing reactions and functions. We recommend performing a first gap analysis at this stage of the reconstruction process as it will ease the subsequent computation and reduce the number of “bugs” in the model. Comparing dead-end metabolites identified in this step with the list generated in Stage 2 will accelerate the debugging process.
This step will require an intensive literature search and may include re-annotation of a genome to find candidate genes and reactions to fill the gap (see Table 1 and supplemental methods 1, Table S3) for some example tools). KEGG41 maps, biochemical textbooks, or other available biochemical maps can be used to identify the metabolic ‘environment’ of the dead-end metabolite. If the genome annotation of the target organism is present in KEGG41, one can highlight the dead-end metabolite on the map (Supplemental methods, Figure S2). This context analysis may give an indication of which enzyme(s) may be able to produce or synthesize the dead-end metabolite and thus provide a good starting point for literature and/or genome search.
Gap-filling is a tricky business. In some cases, a gap should be filled to ensure that the model is functional, i.e., biomass precursor synthesis or a certain physiological function can be simulated. In other cases, filling a gap may enable the model to perform a function that the organism is not able to do (see Figure 8 for some examples). In general, if no information supports the existence of a particular gap reaction, the gap should only be filled if it is required for the model’s functionality. In such cases, the confidence score should be set to 1, which corresponds to “modeling purpose” only, and allows retrieving these low confidence reactions readily, if desired. Earlier, we highlighted that enzymes, which are listed in biochemical databases to catalyze multiple reactions should be included in the reconstruction with care and that it should be noted if evidence for all of the reactions could be found. Some of the identified dead-end metabolites will originate from such secondary reactions of these “multitasking” enzymes. Closing these gaps may affect the predictive potential of the reconstruction, therefore, only gaps should be filled which are required for network functionality (e.g., biomass precursor synthesis) or which have supporting data. Keep in mind that adding new reactions to the network may cause new gaps. Therefore, when adding reactions you should make sure that all metabolites are connected to the network.
SBC, or Type III extreme pathways66, are formed by internal network reactions and can carry fluxes despite closed exchange reactions (closed system). Examples for simple or more complex Type III pathways in metabolic networks can be found in67, 68. These SBCs are artifacts of metabolic reconstructions due to insufficient constraints (e.g., thermodynamic constraints and regulatory constraints). Recent efforts have concentrated on dealing with these SBCs67. Note that SBCs are not futile cycles. This protocol shows how to identify SBCs and highlights some possible approaches to eliminate them. However, no systematic, universally valid approach has been developed yet to eliminate SBCs. For practical purposes, in simulation one can use the ‘min norm’ option for the LP solver, which will minimize the sum of the squares of fluxes and thus, will return an optimal solution without netflux around SBCs.
The following steps will test if the model can or cannot grow. This means that we will test for qualitative behavior but not focus the correctness of predicted growth rates.
The composition of the biomass reaction was determined in stage 2. It is best to test for model’s ability to produce each individual biomass component in standard medium condition (e.g., minimal medium M9 supplemented with D-Glucose) (Figure 4). This sequential approach will facilitate the debugging process and make it easier to find causes of error. It is very likely that these tests will lead to addition of further reactions by repeating steps listed in the second stage. Furthermore, this step may lead to the addition of reactions for which no experimental evidence and candidate genes can be identified. These reactions should be marked with the tag “modeling purposes” only (confidence score of 1). Be careful with such reactions as too many of them may change the overall properties of the network (in this or other simulation conditions). Moreover, the overall performance of the model in standard medium condition is determined and, in some cases, corrected. This step needs great care since there may be many possible ways of filling a gap.
Subsequently, the capability to produce biomass precursors needs to be tested in other growth media. Therefore, the correctness of the network content is evaluated in respect to all known growth conditions of the target organism. This includes all known carbon, nitrogen, sulfur, and phosphor sources. Physiological information is of great value to determine all growth conditions. For example, Gutnick et al.69 have tested about 600 compounds and have found that 100 can serve as carbon-or nitrogen source for Salmonella typhimurium. The model should be able to produce biomass in the majority of these instances. However, not all known conditions may be reproduced by the model – this is not a problem as it represents a starting point for experimental studies to identify missing metabolic functions. Nevertheless, great attention should be given to collecting and documenting those cases and thus enabling other researchers to pursue them.
If such information is available, they can be used to further refine the model. The first question is if the model can produce the secretion product(s) given a substrate, while the subsequent question could be if a specific ratio of by-product secretion is correct. Classical biochemical studies often reported measured secretion products given a certain carbon source (e.g., Schroeder et al.70). This information is very helpful to evaluate the phenotypic traits of the model with those of the target organism.
Reactions that cannot carry any flux in any simulation conditions are called blocked reactions. These reactions are directly or indirectly associated with dead-end metabolites, which cannot be balanced and give rise to so-called blocked compounds71. It is good to be aware of those reactions, especially, if one expects different results in a simulation (e.g., false-negative analysis of single gene deletion). In the early phase of the debugging stage, the reconstruction can contain many blocked reactions that one might decide to fill if supporting information is available or if they are required for the overall function of the network. Targeted use of sink and demand reactions around a pathway of block reactions will facilitate the identification of the source problem. Other blocked reactions may remain if the terminal dead-end metabolite is beyond the scope of the metabolic reconstruction or no information and evidence for filling the gap is available. The easiest way to determine blocked reactions is by performing flux variability analysis72, 73.
Analysis of false positive and false negative predictions will help to further refine the network content if the information is available or provides a basis for experimental studies otherwise (Figure 9). Numerous reconstructions relied on phenotyping data (e.g., biolog data) or gene essentiality data to improve the network content and thus the predictive potential74, 75.
So far we compared whether the model could reproduce growth on certain substrate, secrete a particular by-product, etc. In this step, it should be tested if known incapabilities of the organism can also be reproduced by the model. For example, Helicobacter pylori is known to be auxotroph for certain amino acids, subsequently, their lack in the medium should abolish in silico growth76. It is important to use those “negative” data (incapabilities) as well as to correct for errors. Error cases can be removed by analyzing the confidence score associated with the reactions along the pathway. In the example of H. pylori, this would be the biosynthetic reactions leading to amino acid synthesis76. In a more algorithmic approach, a single reaction deletion study can be carried out and the results can be analyzed in terms of which deletions disable growth. This smaller subset of reactions needs to be manually evaluated. Note that the deletion of a single function may not be sufficient when alternate pathways exist in the network. Missing incapabilities may not only be caused by falsely added reactions in the metabolic network, but may be a consequence of missing regulatory information. Literature may provide the necessary data.
The model should be also tested for known capabilities, beside the aforementioned growth performance and secretion capability. For instance, this test can include known carbon splits in central metabolic pathways as observed with a recently published Pseudomonas putida network57. The P/O ratio was investigated for Methanosarcina barkeri77, Saccharomyces cerevisiae78 and compared to known growth data. Many more examples exist and the suite of necessary tests depends on the available data as well as the properties of the network.
Too slow growth means that at least one precursor of the biomass function cannot be synthesized sufficiently. This implies that the model’s biomass production is either carbon-, nitrogen-, oxygen-, sulfur-, or phosphate-limited. Since there are generally less active uptake reactions for a particular element than biomass precursors, it is faster to test if any of the medium components are growth limiting. If the biomass reaction value increases when the uptake reaction flux is increased, it means that this compound is limiting. This gives you a hint as to, where in the network something must be missing or constraining. Furthermore analysis of shadow prices and reduced costs, which are associated with the LP solution, can be of great help to identify metabolites or reactions that limit the biomass rate. For example, the P. putida network57 that is not able to grow as fast as reported experimentally in silico when toluene is the carbon source. In silico analysis suggested that oxygen is rate limiting and that more oxygen-efficient reactions are missing in the network. Whether this discrepancy can be resolved by iterative network refinement depends on the specific case, and thus, no general solution can be proposed. As in the case of P. putida’s oxygen restriction, such error cases can lead to further experimental investigation that will ultimately increase our biological insight and the reconstruction’s quality.
When the predicted growth rate is higher than expected, many explanations are possible. 1. The optimization for growth assumes that microbial cells maximize their growth. However, as aforementioned, many other objective functions are possible and more appropriate depending on the experimental setup and growth conditions of the target organism6, 18–20, 79–82. 2. The GAM, which is part of the biomass reaction, may be estimated wrongly and needs adjustment. 3. It can indicate that constraints are missing or incorrect (e.g., NGAM, missing regulation). 4. Falsely included reactions increase growth rate. Knowledge about the model and the expected flux map is crucial for identifying the errors. Proton shuttling reactions may be present that circumvent the ATP synthetase (e.g., due to a futile cycle). Note that this is only the case in aerobic growth conditions. Such shuttling reactions may be enabled by many reversible transport reactions. Reactions associated with such loops can be readily identified (see Step 51–59). Also, looking at the flux through the reactions of oxidative phosphorylation may indicate if they are used under the aerobic condition or not. Alternatively, one can investigate if there is one reaction that enables the model to grow too fast. In this case, a single reaction deletion study will push you towards the right solution. Another approach could be to investigate the directionality of network reactions. As indicated earlier, reaction directionality may play a role in the fast growth. Therefore, improving reaction directionality assignments may be helpful. Make sure that only those reactions which are known to produce ATP are allowed for ATP synthesis, while all other reactions are set irreversible (ATP utilization). Similarly, reactions using quinones as electron acceptor should not run reversibly. This might cause problems and may allow circumventing the electron transport chain. These examples are very specific to a model and problem, and no general rule for corrections can be proposed.
Once the necessary content and desired in silico capability is reached, one can start to use the reconstruction in a prospective manner, which represents a fifth step in the reconstruction process that is not addressed here.
The COBRA Toolbox16 should be downloaded and copied in a local folder on the user's computer. Extract the .zip file. After opening Matlab, a path should be set to the local folder, containing the COBRA Toolbox (Matlab → File → Set Path → Add with Subfolder, choose the corresponding folder and save). All working files (SBML and xls files) should also be stored in the local folder, in order to allow access to the reconstruction and models. A full documentation of the COBRA Toolbox can be found in the "doc" subfolder within the main Toolbox folder, which has all help files as html files. Furthermore, help for Matlab and COBRA Toolbox functions can be accessed via Matlab's "help" facility by typing "help function_name" on Matlab command line. See also Becker et al.16.
Comprehensive documentation on SBML, the file format, and model setup, can be found at the official SBML website (http://sbml.org/documents/, level 2 version 1). The SBML file describing the model has to include at least the following information: stoichiometry of each reaction, upper/lower bounds of each reaction, and objective function coefficients for each reaction. Additionally, gene–reaction associations can be added to the "Notes" section.
The first two reconstruction steps are illustrated in this protocol using spreadsheets. It is important that the order of the columns in the spreadsheet match the example given in Supplemental methods 2.
The imported model from the spreadsheets is contained in a model structure (see Figure 10 for details on this structure). All functions in the COBRA Toolbox access the information stored in the model structure. The values computed by the COBRA Toolbox are fluxes, which represent reaction rates for all model reactions. The units for fluxes used throughout this protocol are mmol/gDW/h, where gDW is the dry weight of the cell in grams.
The Matlab software, SBML Toolbox, and one or more of the suggested LP solvers should be installed following the instructions of the software providers. Note that the SBML Toolbox and the LP solver also need to be accessible in the Matlab path (see above). Sample installation instructions for the lp_solve LP solver on Windows can be found in Becker et al.16. The SBML Toolbox is downloaded and installed. Follow the installation instructions. Choose ‘libsbml’ in the dialog field. Once installed, open Matlab and type ‘install’. If you get an error with ‘libsbml’ (when opening Matlab again), go to set path and add the folder ‘libsbml’ with subfolders.
The COBRA Toolbox is initiated by typing in the Matlab command window:
SBML Toolbox and the LP solver should be tested for functionality following the software provider's instructions before attempting to use the COBRA Toolbox.
X3 is the software package used to determine stoichiometrically unbalanced cycles, or Type III pathways. X3.exe needs to be placed and extracted in the local folder. The help can be accessed by opening the DOS command line, changing to the local folder, and typing X3 –h. The extreme pathway tool will be called from Matlab by the COBRA Toolbox.
We will illustrate many steps of the protocol using KEGG41 because it is freely accessible and very helpful for the illustrated pathway-by-pathway reconstruction process. However, one has to keep in mind three properties of KEGG41: 1. It is NOT organism-specific data; hence, not all reactions associated with an enzyme may be catalyzed by the enzyme of the target organism, and 2. KEGG41 may not update the genome annotation of the target organism on a regular basis; hence, the information may be outdated and need a “second opinion” from another more recent resource. 3. Not all reactions in the KEGG41 database are mass- and charge-balanced as they omit protons and water molecules although the KEGG database is continuously updated and improved83, 84.
For each biomass component i, perform the following test:
Perform one or more of the following test, to identify possible errors in the network:
The timing of the entire reconstruction process depends on the properties of the target organism (prokaryote vs. eukaryote, genome size), the quality of the genome annotation, and the availability of experimental data. The timing listed below represents an average and can be used to plan the different stages. All COBRA Toolbox functions described in this protocol finish with a couple of seconds to some few hours on a newer personal computer (Intel Core 2 Duo 6600 2.4 GHz with 4Gb of memory running Windows Vista).
This protocol will result in a reconstruction that covers most of the known metabolic information of the target organism and represents a knowledge database. This reconstruction can be used as a resource for information (query tool), high-throughput data mapping (context for content), and a starting point for mathematical models. Table 6 lists a subset of published reconstructions which were constructed based on the presented protocol.
To facilitate the use of the presented COBRA Toolbox commands (Steps 43 to 94), we listed examples of their use in the Supplementary Method 1.
Bibliome – A bibliome is a collection of primary and review literature as well as textbooks.
Biochemical, Genetic and Genomic (BiGG) knowledge base – A BiGG knowledge base is a genome-scale reconstruction, which incorporates in a structured manner genomic, proteomic, biochemical and physiological information of a particular organism or cell.
Biomass reaction – The biomass reaction lumps all known biomass precursors and their fractional distribution to a cell into one network reaction.
Blocked reactions – Network reactions that cannot carry any flux in any simulation condition are called blocked reactions. Generally, these blocked reactions are caused by missing links in the network.
Constraint-based reconstruction and analysis (COBRA) – COBRA is a modeling approach in which manually curated, stoichiometric network reconstructions are constructed. Subsequently, models can be obtained and analyzed by applying equality and inequality constraints and by computing functional states. Constraints include mass conservation and thermodynamics (for directionality) as well as constraints reflecting experimental conditions and regulatory constraints
Dead-end metabolite A dead-end metabolite that is only produced or consumed in the network.
Demand reaction – When the consumption reaction(s) of a metabolite is not known or outside the scope of the reconstruction it can be represented by this unbalanced, intracellular reaction (e.g., 1 A -->).
Exchange reactions These reactions are unbalanced, extra-organism reactions that represent the supply to or removal of metabolites from the extra-organism “space”. (See Box 3).
Extreme pathways (ExPa’s) – ExPa’s are a unique and minimal set of flux vectors which lie at the edges of the bounded null space. Biochemically meaningful steady-state solutions can be obtained by nonnegative linear combination of ExPa’s.
Flux-balance analysis (FBA) – FBA is a formalism that defined the metabolic network as a linear programming optimization problem. The main constraints in FBA are imposed by the steady state mass conservation of metabolites.
Futile cycles – Stoichiometrically unbalanced cycles, which are associated with energy consumption.
Gene-protein-reaction (GPR) association – GPR association connect genes, proteins and reactions in a logical relationship (AND, OR).
Genome-scale model (GEM) – A GEM is derived from a GENRE, by converting it into a mathematical form (i.e., an in silico model) and by assessing computationally its phenotypic properties.
Genome-scale network reconstruction (GENRE) – A GENRE formed based on an organism-specific BiGG knowledge base. A GENRE is a collection biochemical transformation derived from the genome annotation and the bibliome of the target organism. A network GENRE is unique to an organism, as its genome is.
Flux variability analysis (FVA) – FVA is a frequently used computational tool for investigating more global capabilities under a given simulation condition (e.g., network redundancy). Therefore, every network reaction will be chosen as an objective function and the minimal and maximal possible flux value through the reaction is determined by minimizing and maximizing the objective function.
Linear programming (LP) – LP is an optimization technique, in which a linear objective function is optimized (i.e., minimized or maximized) subject to linear equality and inequality constraints.
Network gap – A network gap is a missing reaction or function in the network, which can connect one or more dead-end metabolites with the remainder of the network.
Objective function – An objective function is a network reaction, or a linear combination of network reactions, for which is optimized in the linear programming problem.
Sink reaction – When the synthesis reaction(s) of a metabolite is not known or outside the scope of the reconstruction its discharge can be represented by this unbalanced, intracellular reaction (e.g., 1 A <-->)
P/O ratio – This ratio represents the number of ATP molecules (P) which are formed per oxygen atom (O) consumed during respiration.
Reduced cost A parameter associated with linear programming. It can be used to investigate properties associated with the calculated optimal solution. Each network reaction has a reduced cost values associated, which represents the amount the objective value would increase if the flux through the reaction would be increased by one unit. Note that by definition reduced costs values that can increase the objective value are negative numbers.
Type III extreme pathway – These stoichiometric balanced cycles (SBC) are a subset of ExPa’s that are only composed of intracellular reactions, i.e., that all exchange reactions (i.e., systems boundaries) have zero flux.
We would like to acknowledge R.M.T. Fleming, A. Feist, and N. Jamshidi, for valuable discussions. We are thankful to M. Abrahams, S.A. Becker, and F.-C. Cheng for reading the manuscript. We would like to thank S. Burning for preparing the biomass reaction manual as well as A. Bordbar and R.M.T.Fleming for providing Matlab code. I.T. was supported by National Institutes of Health (NIH) grant R01 GM057089.
Competing interest statement: The authors declare that they have no competing financial interests.
Author contribution: IT and BOP designed concept and wrote manuscript. IT developed protocol.