Constraint-based modeling is a widely used systems biology method and is particularly well suited for predicting the phenotypes of microbial organisms after gene knockouts or when grown on different substrates [
1-
3]. These variable conditions are simply represented as additional constraints on a model, and growth can be predicted by flux balance analysis (FBA) [
4]. Because not every realistic constraint is represented in a typical metabolic model, it is quite possible for such a model to predict growth under conditions where growth does not really occur. The actual organism may not express a required gene for growth, or fluxes may be limited by kinetic or thermodynamic constraints, for example. This case is called a false positive prediction. On the other hand, false predictions of no growth can be taken as indications that the model is missing an essential reaction [
5]. This prediciton is called a false negative. No current metabolic network reconstruction is entirely complete and realistic because our knowledge of the metabolism of no organism is complete. Even in very well-studied model organisms such as
Escherichia coli there are still many genes with unknown functions [
6,
7]. The result of this is that there are gaps in metabolic network reconstructions. These gaps take the form of dead-end metabolites, which have either no producing or no consuming reactions [
8].
Several different types of gaps can exist in reconstructed metabolic networks [
8,
9]. These gaps result in blocked reactions, which are unable to carry flux at steady state, and blocked metabolites, which exist only in blocked reactions and can never be produced or consumed. Root no-production gaps are metabolites that have consuming reactions but are blocked because they have no producing reactions. Metabolites that can only be produced from root no-production metabolites are also blocked, and are referred to as downstream gaps. Likewise, root no-consumption gaps are metabolites with producing reactions but no consuming reactions, and the other metabolites blocked by these gaps are called upstream gaps. The gaps in a metabolic network can also be classified as either scope gaps or knowledge gaps. Scope gaps are those that exist because the scope of most metabolic network models does not include features like macromolecular degradation or the use of charged tRNAs in protein synthesis. Knowledge gaps, on the other hand, are actually the result of our incomplete knowledge of the metabolism of any organism [
10].
The comparison of model predictions to experimental data can be a useful way to fill network gaps and discover new genes and reactions. There are four possible outcomes when comparing computationally predicted to experimentally measured growth phenotypes: true positives, when the model correctly predicts growth; true negatives, when the model correctly predicts that no growth is possible; false positives, when the model predicts growth under a condition where growth was not observed; and false negatives, when the model fails to predict growth where growth was experimentally observed. Both false positive and false negative results can be useful for refining model content, but it is the false negative cases that can help fill gaps. Several methods have been developed to predict the correct gap-filling reactions based on comparisons to experimental data.
The first such method to be published was called SMILEY [
5]. This is a mixed-integer linear programming algorithm that identifies the minimum number of reactions that need to be added to a metabolic model from a universal database of reactions in order to allow a minimum defined growth rate to be achieved. The SMILEY algorithm was first developed and used to predict reactions missing from the
iJR904
E. coli reconstruction [
11] that caused false negative model growth predictions when compared to Biolog growth data [
12]. Several results were experimentally verified and new genes were characterized [
5]. SMILEY was also recently used to predict gap-filling reactions in the Recon 1 human metabolic reconstruction [
13,
14]. The algorithms GapFind/GapFill [
9] and GrowMatch [
15] were later developed, and could predict missing reactions by connecting model gaps and by comparing model predictions to gene essentiality data, respectively. To date, these methods have been used to make predictions for the
E. coli and yeast metabolic networks [
15,
16], but these predictions have not yet been experimentally verified. Non-constraint-based methods for reconstructing metabolic networks and filling gaps have also been developed. One example is PathoLogic, a component of the Pathway Tools software that has been used to assemble the organism specific databases of BioCyc [
17]. This program fills gaps to complete metabolic pathways and even includes a hole-filling algorithm that assigns genes to gap-filling reactions [
18,
19]. Another recent procedure uses network expansion to determine the minimum number of reactions that need to be added to a network to make it compliant with experimental data [
20]. The production of metabolites as macromolecule degradation products was considered, and genes were predicted using hidden Markov models. This strategy was applied to improve metabolic models of
E. coli[
21] and
Chlamydomonas reinhardtii[
22].
The present study builds on these methods with a new workflow that includes use of the SMILEY algorithm. SMILEY was used instead of GapFill or GrowMatch because it could be modified to make predictions for a wider range of experimental data than it was originally applied to. Specifically, it was used to make predictions using gene essentiality data and network gaps in addition to data for growth on different substrates. The
iJO1366 metabolic network reconstruction of
E. coli K-12, the latest and most complete genome-scale reconstruction of this organism [
10], was used in this analysis. To begin, a large dataset of
E. coli gene essentiality from the Keio Collection [
23], combined from four published datasets [
10,
23-
25], was assembled. Next, model growth predictions made using the
iJO1366 model were compared to this dataset, and both false positive and false negative comparisons were analyzed to identify potential errors in the model and in the experimental datasets. The SMILEY algorithm was then used to predict gap-filling reactions and reactions that correct false negative model predictions. The feasibility of these reactions was then assessed by comparing augmented model predictions to the experimental dataset. Finally, genes were predicted for the most feasible putative reactions. Several sets of gene function predictions are presented, and provide plausible hypotheses for experimental validation. These predictions have the potential to improve the metabolic reconstruction and lead to new metabolic gene discoveries [
8]. Knockout strain growth phenotyping experiments were performed to identify a gene involved in myo-inositol metabolism, demonstrating the types of experimental analyses that can validate these biological predictions.