Metabolic pathway information is available in several different databases such as EcoCyc [19
], WIT [20
], BRENDA [21
] and KEGG [12
]. EcoCyc is an E. coli
-specific database and contains the metabolic complement of E. coli
. We chose EcoCyc over the other available databases for three reasons: 1) EcoCyc contains information about the directionality of the enzymatic reactions, 2) Some enzymes have not yet been classified according to the Enzyme Commission (EC) system [22
] (EcoCyc contains both EC-classified enzymes and enzymes that fall outside of the EC classification; There are 1172 enzyme entries in the database and 781 of these have EC numbers.) and 3) EcoCyc is freely available for universities and non-profit research institutes.
The vertices of the graph we use represent the enzymes catalyzing the reactions. One edge represents one or more compounds. There will be an edge leading from an enzyme E1 to an enzyme E2 if E1 catalyzes a reaction where compound A is produced and E2 takes A as a substrate. Reversible reactions, such as A
B, are separated into two reactions, A → B and B → A. There can be at most one edge in each direction between a pair of enzymes. Note that the representation used herein is different from the common representation of metabolic pathways where the substrates and products are the vertices and the enzymes catalyzing the reactions are the edges (Figure ). The type of network representation used in our study has been used before for metabolic network analysis where it has been referred to as 'protein-centric' graphs [23
] and 'reaction graphs' [24
] (Figure ).
Figure 1 Common representation of the biotin metabolism. In the most common representation of biochemical reactions the reactants represent the vertices of the network and the enzymes represent the edges. The drawing of the biotin metabolism was redrawn from EcoCyc (more ...)
Figure 2 Protein-centric representation of the biotin metabolism. The protein-centric or reaction graph representation which was used in our study has enzymes as vertices and reactants as edges. EC 188.8.131.52 is a neighbor of EC 184.108.40.206 because there is an edge (more ...)
Some reactions are physiologically irreversible and should be represented accordingly. To encompass the directionality information we use a directed graph to represent the metabolic network of E. coli. As a result there are not necessarily paths from every enzyme vertex in the network to every other enzyme vertex.
One of the most problematic aspects with metabolic network analysis is how to handle the promiscuous compounds. One may argue that because promiscuous compounds are usually not the limiting factors of reactions, the network would become more biochemically meaningful if these compounds are removed [25
]. However, to our knowledge there is no generally accepted criterion to determine whether a compound participating in a reaction is a current compound, cofactor or main metabolite. In this study, we have chosen to formulate and apply a simple network-based criterion. We count the number of times a compound occurs as part of an edge in the network. The most common compounds are then considered as promiscuous compounds [11
]. As in the study performed by Wagner and Fell [24
] we here conduct our study with one network that includes all compounds, including the promiscuous compounds, and another network where the most promiscuous compounds have been removed.
Determination of network parameters
The directed graph derived from EcoCyc contains 1,172 vertices and 224,972 edges (Table ). The network consists of two components; one contains 2 vertices and a single edge while the other component consists of the remaining vertices. Except for the two enzymes (two chloride ion transporters) the whole metabolic network of E. coli is therefore connected as one would expect from a unicellular, autonomous and non-parasitic organism.
Table 1 Network parameters. The table contains some important characteristics of the network (see text for more detailed description). The second column contains the network parameters for the whole network while the third column contains the network parameters (more ...)
We examined which compounds are the most promiscuous in the metabolic network graph by determining the number of times the compounds occur as part of edges connecting two vertices in the network. The most promiscuous compound is H2O, which appears as part of an edge between 79,485 enzyme pairs (Table ). The compound frequency plot shows some scale-free network characteristics in that there are a few compounds that occur very often and most compounds occur as parts of edges 1, 2 or 3 times (Figure ). We also determine which enzymes are the most highly connected. The most highly connected enzyme has 735 enzyme neighbors (Carbamoyl phosphate synthase) (Table ).
The most promiscuous compounds in the network. The frequency of a compound is the number of times the compound occurs as part of an edge in the network.
Figure 3 Compound frequency distribution histogram for the metabolic network of E. coli. The 20 most promiscuous compounds have been removed in the network analyzed here. The histogram shows how many compounds (y-axis, logged scale) that occur with a certain frequency (more ...)
The enzymes with the highest connectivities (k) of the network for the whole and the reduced networks. The connectivity of an enzyme is here defined as the number of enzymes which it is connected to by an outgoing edge.
It has been shown in previous studies that the metabolic network of E. coli
is a small world network [24
]. The definition of a small world graph is that its average path length (D) is on the same order as the average path length for a random graph, with the same number of vertices (n) and mean connectivity (
), but its clustering coefficient (Cn
) exceeds that of the random graph by far [27
is a measure between 0 and 1 of the cliquishness of the graph [27
]. Each vertex' clustering coefficient (C) is calculated by taking the number of edges between the vertex' neighbors (m) divided by the maximum number of edges between the vertex' neighbors, i.e if u is the number of neighbors and the graph is directed C
- 1)). Cn
is the average clustering coefficient for the graph.
Our network graph has Cn
= 0.72 which is large compared to the clustering coefficient of the random graph Cr
= 0.16 (Table ). The average path length for the Erdös-Rényi random graph [28
] is Dr
) = 1.3, which is smaller than but on the same order as the average path length for the metabolic network (Table ). Given the large Cn
and the small D we can conclude that the metabolic network of E. coli
constructed from EcoCyc shows some characteristics of a small world network.
Correlation between MPL and homology
We used PSI-BLAST [29
] with an E-value cut-off of 10-6
and 3 iterations for an all-against-all sequence comparison of the 1105 protein sequences coding for the metabolic enzymes in E. coli
. We chose the relatively strict E-value cut-off in order to minimize the number of false positives. We also ran PSI-BLAST against the SCOP database [30
] to increase the sensitivity and collect further pairs of homologous enzymes. The proteins in the SCOP database are ordered into a hierarchy consisting of class, fold, super-family and family, with an increasing level of structural similarity between the proteins. In our study, enzymes belonging to the same super-family are considered homologous. Using these methods 8,218 homologous enzyme pairs were found.
There are 72 cases where two homologous genes code for the same enzyme. The reason may for instance be that the genes code for different subunits of the enzyme, that there are multiple copies of one gene or that the genes code for two isozymes. In addition there are 169 pairs of enzymes which are clearly isozymes because they are homologous, identified as separate enzymes in EcoCyc and catalyze the same reaction. It is not clear what relationship these 241 pairs should have in our graph representation of the metabolic network. They may be included at minimal path length (MPL, Figure ) 1 since they catalyze identical reactions, or they may be included as a special category at MPL 0. For our analysis however, since we are interested in the relationship between clearly different enzymes, we chose not to include these probably fairly recent gene duplications.
Figure 4 Minimal path length (MPL) between enzymes. a) The reactions catalyzed by EC 220.127.116.11 (alcohol dehydrogenase), EC 18.104.22.168 (1-pyrroline-5-carboxylate dekydrogenase) and EC 22.214.171.124 (aldehyde dehydrogenase). b) There are severeral paths leading from EC 126.96.36.199 (more ...)
There are 209 E. coli
genes which are each associated with at least two enzymatic functions. Most of these multifunctional enzymes are enzymes with broad substrate specificities. 18 of these genes consist of two separate regions which are clearly associated with two (or more) different enzymatic functions (see Additional file 1
and for instance [31
]). We used EcoCyc to identify these genes and Pfam [33
] to localize the domains on the genes. The domains were separated from each other and included in the analysis as partial genes.
There are 644,656 pairs of enzymes in the whole network (Table ). Of these, only 7,989 (1.2%) are homologous enzyme pairs. There are 121,295 enzyme pairs at MPL 1. 4,662 (3.8%) of these are homologous enzyme pairs. There appears to be a 3-fold over-representation of homologous enzyme pairs at MPL 1 in the whole network. In order to determine whether the observed over-representation of homologous enzyme pairs at MPL 1 is statistically significant randomized networks were constructed. There are many alternative ways to construct randomized networks. We have chosen an approach that preserves the topological properties of the network since it has been shown that the metabolic network of many organisms have some of the properties of scale-free networks [26
]. We constructed many randomized networks by starting from the original real network and shuffling the enzyme identities between randomly chosen pairs of vertices in the graph which preserves the network topology (Figure ).
Table 4 The numbers of homologous enzyme pairs and enzyme pairs found at MPLs 1–12. The fraction is simply the ratio between the number of homologous enzyme pairs and the enzyme pairs. The reduced network is the network where the 20 most promiscuous compounds (more ...)
Figure 5 Preservation of network topology. The left side of the figure shows the numbers of vertices and edges in the original example graph. The right side of the figure shows the numbers of vertices and edges for the randomized graph where the vertex identities (more ...)
The number of homologous enzyme pairs plotted against MPL for the whole metabolic network of E. coli and the standard deviations for the 1,000 randomized networks are shown in Figure . In the real network there are 4,662 pairs of homologous genes at MPL 1, which is 58% of all homologous pairs found (Table ). In contrast typically 1,454 homologous enzyme pairs (18%) are found at MPL 1 in the randomized networks. Compared to the randomized networks there is again a 3-fold over-representation of homologous enzyme pairs at MPL 1. If the homologies were randomly distributed in the network we would expect to see the line representing the real network within the boundaries of the standard deviations for the randomized networks in Figure . Instead, we find that there is a significantly greater likelihood of enzymes at distance 1 from each other to be homologous in the metabolic network of E. coli than in the randomized network.
Figure 6 Homology vs minimal path length (MPL) for the whole metabolic network of E. coli. The plot shows the correlation between homology and MPL when all compounds are included. The solid line represents the real metabolic network of E. coli. The dotted vertical (more ...)
The network includes promiscuous compounds such as H2O, ATP and NAD. There are therefore homologous enzyme pairs containing for instance NAD-binding domains that are at MPL 1 because their gene products catalyze reactions involving that cofactor. It could be argued that such coenzyme binding domains give rise to skewed results in our analysis. To remedy this complication we removed 20 compounds, starting from the most promiscuous compound (H2O) down to the 20th most promiscuous compound (O2). We find that the correlation between MPL and homology is preserved (Figure ), indicating that the abundance of homologous enzyme pairs at MPL 1 is not the result of common cofactor-binding domains alone. We could also detect a marginally significant correlation between MPL and homology at MPL 2 when the 20 most promiscuous compounds had been excluded from the network (Figure ). No correlation could be detected at MPLs greater than 2. These results were robust for variations in the number of compounds that were removed from the network, i.e. removing between 17 and 23 of the most promiscuous compounds generated the same result (data not shown).
Figure 7 Homology vs minimal path length (MPL) without the 20 most promiscuous compounds. The plot shows the correlation between homology and MPL when the 20 most promiscuous compounds (Table ) have been removed. The solid line represents the (more ...)
It could be argued that the over-representation of homologous enzyme pairs at MPL 1 is due to the fact that not all cofactors have been removed by removing the 20 most promiscuous compounds. To investigate this possibility an alternative network was constructed where all the cofactors, as defined by EcoCyc, were removed from the network. In this alternative network the number of homologous enzyme pairs at MPL 2 is just within the boundaries of 3 standard deviations for the randomized networks. The correlation at MPL 1 remained unchanged (data not shown). Hence, the correlation between homology and MPL at MPL 1 is not due to cofactor-binding regions alone.
Rison et al
] found an over-representation within the extended pathways of E. coli
at pathway distances 1, 2 and 3. Alves et al
found that there is clustering of homologous enzyme pairs at MPL 1 and 2 in the metabolic networks of several organisms. We detect a clearly significant over-representation only at MPL 1 in the metabolic network of E. coli
. Two possible explanations for the differences between our results are that Alves et al
performed a multi-organism analysis while we analyze only E. coli
and that Rison et al
looked at extended pathways rather than at the whole network.
Analysis of the homologous enzyme pairs
We wished to investigate to which extent the retrograde evolution model and the patchwork evolution model respectively are responsible for the over-representation of homologous enzyme pairs seen at MPL 1. In order to be able to analyze the homologous enzyme pairs at MPL 1 further we use a rough criterion for discriminating between cases that fit the retrograde evolution model or the patchwork evolution model:
1) Patchwork model: Homologous enzymes with similar functions probably evolved through patchwork evolution events. Therefore homologous enzymes that evolved through the patchwork evolution model should share the same primary EC number.
2) Retrograde evolution: Homologous enzymes with dissimilar functions are less likely to have evolved through patchwork evolution events. Therefore homologous enzymes that have different primary EC numbers are candidates for retrograde evolution.
We find that 304 enzymes (26%) have not been EC classified. Reactions that do not have EC numbers have been classified by EcoCyc according to an EcoCyc-specific scheme. We consider enzymes that belong to the same EcoCyc reaction type as functionally similar. There are some reactions that remain unclassified in EcoCyc. Pairs including such enzymes are regarded as 'undetermined' in our analysis.
240 (56%) of the homologous enzyme pairs at MPL 1 have similar functions (Table ). One instance is 1-phosphofructokinase and 6-phosphofructokinase, which catalyze similar reactions with the difference that 1-phosphofructokinase catalyzes a reaction involving fructose-1-phosphate while 6-phosphofructokinase has fructose-6-phosphate as a substrate (Figure ). These are clearly analogous reactions whose homology can readily be explained by the patchwork evolution model.
The numbers of functionally similar /dissimilar/undetermined homologous enzyme pairs at MPLs 1–11 in the network where the 20 most promiscuous compounds have been removed.
Figure 8 Different types of homologous enzyme pairs at minimal path length (MPL) 1. a) Enzymes that catalyze similar reactions. The enzymes are at MPL 1 because they both catalyse reactions involving the metabolite fructose-1,6-bisphosphate. b) Enzymes that catalyze (more ...)
90 (21%) of the homologous enzyme pairs at MPL 1 are functionally dissimilar (Table ). One instance is component I of anthranilate synthase (trpE) which is homologous to the two isozymes of isochorismate synthase (entC and menF) (Figure ). If substrate depletion was the primary selective pressure for the E. coli ancestor of these enzymes, chorismate was probably the substrate being depleted because it is the only compound that these two reactions have in common. It is primarily among the 297 homologous enzyme pairs with dissimilar functions at MPL 1 and 2 that the candidates for retrograde theory enzymes can be found. However, it should be noted that some of the candidates for retrograde evolution that have been identified before are not included among our retrograde evolution candidates: We classify the enzymes, coding for the last two consecutive steps in the tryptophan biosynthesis (trpA/trpB and trpC) as functionally conserved because these two enzymes have the same primary EC numbers (188.8.131.52 and 184.108.40.206). In the same manner we classify the enzymes, coding for two consecutive steps in methionine biosynthesis (metB and metC) (220.127.116.11 and 18.104.22.168) as functionally similar as well as the four homologous consecutive enzymes in the peptidoglycan biosynthesis (22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206).
The functionally similar, dissimilar and undetermined homologous enzyme pairs were plotted against MPL and normalized against 1,000 randomized networks by the same method as described above. From this procedure we could determine that most of the correlation between homology and MPL at MPL 1 can be attributed to enzymes with similar functions (Figure ) and enzymes with undetermined function (data not shown). However, there is still an over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 (Figure ) indicating that there is a statistically significant proportion of the homologous enzyme pairs whose homology cannot be attributed to chemical similarity. We also note an over-representation at MPL 10 for the functionally similar homologous pairs (Figure ). The over-representation consists of 9 pairs of homologous inner membrane MFS transporters which show 15–20% sequence identity.
Figure 9 Homology vs minimal path length (MPL) without the 20 most promiscuous compounds for functionally similar enzyme pairs. The plots show the correlation between homology and MPL for functionally similar (shared primary EC number) enzyme pairs when the 20 (more ...)
Figure 10 Homology vs minimal path length (MPL) without the 20 most promiscuous compounds for functionally dissimilar enzyme pairs. The plots show the correlation between homology and MPL for functionally dissimilar (different primary EC number) enzyme pairs when (more ...)
The correlation between homology and network distance at MPL 1 is mostly due to the homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. While there is some statistically significant over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model, it appears that the evolution of an enzyme that catalyzes a new type of reaction is rare and increased enzyme specificity driven evolution is more common which is in general agreement with recent studies [16