The MANET database project traces the evolution of protein structure in biomolecular networks with bioinformatic, phylogenetic, and statistical methods. Metabolic MANET links the SCOP and KEGG databases to universal phylogenies of protein fold architecture. The database was assembled in multiple steps. We first reconstructed phylogenomic trees describing the evolution of protein folds in 174 proteomes belonging to Eucarya, Archaea and Bacteria. These trees are large and can be visualized using hyperbolic tree visualization tools [48
]. Figure shows cladogram, hyperbolic, and circular tree representations of the tree of protein architecture used in this study. The tree is consistent with phylogenies generated previously from a set of 32 proteomes using the same approach [27
]. These tree reconstructions were then used to assign a relative age (ancestry) to each fold based on how many cladogenic events occurred in each lineage (Fig. ). Finally, ancestries were literally painted onto metabolic subnetworks with information derived from SCOP, KEGG and HMM-based fold superfamily prediction tools. Figure describes a representative subnetwork of metabolic MANET showing enzymatic nodes painted with molecular ancestries. Please note that ancestries represent a lower limit on the time at which the fold might have been adopted for a particular enzymatic activity.
Figure 3 Phylogenomic tree reconstruction of protein fold architecture generated from a domain census in 174 completely sequenced genomes. The structural census was defined by advanced HMMs and assigned domain structure to about 60% of genomic sequences. Three (more ...)
Figure 4 Representative subnetwork diagram describing molecular ancestries in metabolic MANET. A colored scale is use to assign binned ancestry values to enzyme nodes named with EC numbers. The red color represent enzyme nodes with the oldest ancestry (i.e. with (more ...)
Reconstructed trees were based on a genomic census of protein architecture. Consequently, they depend on the accuracy of genomic databases, a balanced genomic sampling of the living world, efficient and accurate assignment of structures to proteins, a structural classification scheme that depicts evolutionary patterns, and methods of phylogenetic tree and character state reconstruction. The influence of these factors has been discussed previously [27
]. While there is no possible gold standard that can be used to confirm the validy of phylogenomic statements, the genome census data we use to generate the tree of fold architectures was also used to generate trees of proteomes, and these trees group organisms in the three domains for the most part according to established organismal classification [Wang and Caetano-Anollés, ms. in preparation]. This observation supports the validity of phylogenetic signal embedded in the data.
Our study also rests on the accuracy of SCOP, a robust protein classification scheme [18
], and on the monophyletic nature of protein folds and superfamilies. Consequently our inferences should be regarded as rough first approximations. While we do not expect major changes in the operational definition of a protein fold, many folds could be better described by "continuous" rather than "discrete" distributions in structure space [50
]. Furthermore, we trust SCOP hierarchies reflect true evolutionary groupings. In SCOP, proteins in families express clear evolutionary relationships. They generally exhibit >30% pairwise residue identities or have functions and/or structures that provide definite evidence of common descent. Similarly, fold superfamilies contain proteins with structural and functional features that are highly suggestive of a common evolutionary origin. However, highly popular folds encompass collections of fold superfamilies that share the same arrangement and topology of secondary structures but may not have a common evolutionary origin. Consequently, the monophyletic nature of protein folds needs to be examined case by case, as has been done for the (βα)8
Currently, metabolic MANET contains 23,217 entries linking 1,255 enzymatic activities to PDB entries, folds, ancestry values, and pathways. A total of 6,552 PDB entries are associated with metabolic subnetworks. Based on information derived mostly from crystallographic structural models, 33% of metabolic protein nodes were painted in phylogenetic tracings of the metabolic pathways that are registered in KEGG. Use of HMMs that assign probable fold superfamily identities to protein sequences increased the fraction of painted enzymes to 63%. Individual steps in the analysis and sorting of data can be found in the supplementary data [see Additional file 1
]. Among the 132 subnetworks from the MANET database, 122 subnetworks described metabolic pathways and 10 subnetworks described processing of genetic, environmental and cellular information. On average, 72% of enzymes were painted in metabolic MANET [see Additional file 1
], ranging from 6% for the monoterpenoid biosynthesis subnetwork to 100% for subnetworks such as aminoacyl-tRNA biosynthesis, reductive carboxylate cycle (CO2
fixation), and novobiocin biosynthesis. Large subnetworks such as those belonging to nucleotide, carbohydrate and amino acid mesonetworks were painted similarly to others. Interestingly, some subnetworks contained more evolutionary information. Subnetworks such as purine metabolism and pyrimidine metabolism that contain many more enzymes than others had about 83% and 79% of enzymes painted, respectively. Only 10 subnetworks (7.6%) in metabolic MANET did not have entries associated with ancestry values. These were beta-lactam resistance and clavulanic acid biosynthesis in mesonetwork "biosynthesis of secondary metabolites", 1,1,1-Trichloro-2,2-bis(4-chlorophenyl) ethane (DDT) degradation and bisphenol A degradation in "biodegradation of xenobiotics", glycosylphosphatidylinositol(GPI)-anchor biosynthesis in "glycan biosynthesis and metabolism", and biosynthesis of ansamycins, biosynthesis of siderophore group nonribosomal peptides, biosynthesis of vancomycin group antibiotics, and biosynthesis of type II polyketide products in mesonetwork "biosynthesis of polyketides and nonribosomal peptides". The efficiency of painting was not biased by subnetwork size (Fig. ).
Figure 5 Painting efficiency in metabolic subnetworks. The plot describes the total number of enzymes (black line) and the total number of painted enzymes (red line) in each of the 132 subnetworks described in KEGG, sorted according to enzyme number. Subnetworks (more ...)
Evolutionary tracing in MANET reflects information derived from structural models present in the PDB or represents HMM-based inferences of structural classification. In order to test if biases in fold superfamily predictions could affect evolutionary tracings in networks, we designed a statistical test that compared frequency distributions of ancestries derived from the join operation defined by structural models (population group A) or derived from HMM-based predictions (population group B). We selected amino acid sequences associated with enzymes that had structural PDB entries and participated in the join operation. A total of 72,354 amino acid sequences within this category were selected, and resulting ancestry values were calculated and analyzed (Fig. ). The mean (± SE) for ancestry value distributions was 0.277 ± 0.008 and 0.296 ± 0.006 for populations groups A and B, respectively. Basic statistical parameters showed both ancestry frequency distributions were not normally distributed but had the same shape with almost the same variance (0.072 and 0.078 for groups A and B). However, measurements of skewness (1.068 and 0.990) and kurtosis (0.233 and -0.029) indicate the distribution of group B was shifted to the right of the distribution of group A. The Wilcoxon rank sum test showed that the p-value (0.0553) for a one-tailed test was greater than the expected value for α = 0.05 (using both normal or t-approximation), failing to reject the null hypothesis that ancestry values distributions for groups A and B were identical. We therefore conclude that ancestry value distributions derived from structural models or HMM predictions were not significantly different at the 95% confidence level.
Figure 6 Box-and-whiskers plot describing global frequency distribution profiles. A. Comparison of ancestry values derived from structural models using the join operation (population group A) and predicted using HMMs (population group B) in metabolic MANET. B. (more ...)
We also tested the accuracy of the HMM prediction. We selected PDB entries corresponding to enzymes in KEGG that participated in the join operation and were classified structurally, assigned PDB sequence records downloaded from ASTRAL [52
] to the PDB entries, and analyzed the PDB sequences using the HMM package at E value = 0.02. Superfamily IDs and structural classifications corresponding to the PDB sequences were retrieved. Out of 21,173 PDB entries corresponding to enzymes identified by the join operation, 20,941 PDB entries mapped to ASTRAL. Sequences corresponding to these PDB entries were analyzed further. The HMM-based method rejected 212 sequences, leaving a total of 20,729 PDB entries with an assigned fold structure. Out of these, only 67 PDB entries differed in the expected fold assignment. At the fold superfamily level of classification, 106 PDB entries differed in the expected superfamily assignment. These results indicate that the HMM-based superfamily prediction can be performed at 98% accuracy levels. The details of this analysis can be found in our website.
The assignment of numerical ancestry values to enzymes in cellular metabolism uncovers evolutionary patterns of architectural diversification within the metabolic network. A quick examination of ancestry distributions depicted in each subnetwork and mesonetwork diagram of the MANET database reveals that enzymes of old origin generally coexist with those of recent origin (see example subnetwork; Fig. ). A more detailed analysis of individual subnetwork paintings reveals the absence of clear patterns in individual pathways. Enzymes of old origin were generally followed haphazardly by enzymes of recent origin, and vice versa, with no apparent pattern along pathways. The patchy appearance of ancestries in subnetworks belonging to all metabolic mesonetworks supports strongly the enzyme recruitment (patchwork) evolutionary scenario as the major evolutionary force responsible of present day metabolism. Metabolic MANET makes visually evident enzymatic recruitment patterns that have been observed previously (e.g
]), placing them into a relative evolutionary time frame. This offers the possibility of reconstructing temporal timelines of recruitment episodes in subnetworks and mesonetworks. Other evolutionary alternatives (backward evolution, forward evolution, de novo invention, pathway duplication, etc.) are not readily visible in our evolutionary tracing exercise. A detailed analysis of each subnetwork will be required to reveal the incidence of these possible evolutionary mechanisms. Pathway 'take-over' mechanisms in which new enzymes replace either pre-biotic chemistries or old enzymes, and 'co-option' mechanisms in which old enzymes gain novel functions, are also possible. In this regard, we are currently evaluating possible take-over episodes in metabolic subnetworks that may result from en masse
enzymatic recruitment processes occurring in subnetwork pathways. We envision that uncovering take-over patterns in MANET at global levels will require extensive information about possible pre-biotic chemistries and novel phylogenetic approaches.
The evolutionary patterns revealed by MANET have other interesting implications. If we assume that pre-biotic chemistries remain imprinted in modern metabolism as relics of the pre-biotic world, patterns of enzymatic ancestries may reveal fundamental steps in prebiotic evolution. These evolutionary patterns may still manifest in the subnetworks despite obscuring events such as take-overs. Morowitz [53
] proposed that metabolism evolved through the sequential addition of shells to an "energy amphiphile" core (shell A), which consisted of the Krebs cycle, glycolysis, and fatty acid biosynthesis. The amination of 2-ketoglutarate was the gateway to shell B, the synthesis of most amino acids. In shell C sulfur was incorporated into cysteine and methionine. The gateways to shell D, ring closure and synthesis of nitrogen and dinitrogen heterocycles, gave access to purines, pyrimidines, and many cofactors, including B12
. This scenario suggests that compounds in shell D evolved after enzymes (derived from shell B and C) and were not a part of prebiotic chemistry. The energy amphiphile core is consistent with Wächtershäuser's proposal that life evolve on pyrite (see [5
]). According to this theory of an iron-sulfur world, a reductive citric acid cycle that used thio-organic homologues evolved early and was later coopted for oxidation. The reductive citric acid cycle, an autocatalytic network, expanded by branch reactions into higher homologous cycles. This archaic network included pathways for the synthesis and degradation of phosphorylated sugars, some amino acids (glutamate, aspartate, alanine, lysine), fatty acids and isoprenoids, coenzymes (including tetrapyrroles), and purines.
When ancestry patterns embedded in the subnetworks of MANET were analyzed, sequential evolution of metabolic "shells" was not obvious. However, pervasive enzyme recruitment could have masked the original pre-biotic evolutionary patterns. In fact, we performed a global statistical analysis of the distribution of ancestries of enzymes in metabolism, testing if global evolutionary patterns in metabolism matched possible "shell" scenarios (Fig. ). We calculated mean ancestry levels from frequency distribution patterns of ancestry data for mesonetworks, assuming these values were indicative of an average age of the enzymes examined. The statistics of distribution of ancestries in mesonetworks showed that distributions differed significantly in mean ancestry levels (p < 0.0001; ANOVA, F-test). Furthermore, the analysis revealed that amino acid mesonetworks were the oldest and lipid (including steroid) and glycan mesonetworks were relatively recent evolutionary additions (p < 0.05; Tukey-Kramer multiple comparison). The early evolutionary appearance of mesonetworks related to amino acid metabolism suggests that metabolic routes leading to the synthesis of polypeptides (shells B and C of Morowitz) 'internalized' early into the protein-based enzymatic machinery.
While mesonetworks may pool subnetworks of different average ancestry complicating interpretation, our results are nevertheless consistent with the shell hypothesis of Morowitz [54
]. In this regard, the early evolution of amino acid metabolic mesonetworks raises an interesting question. Why were the energy amphiphile core pre-biotic functions not the first to be replaced by enzymatic counterparts? These pre-biotic functions were the oldest and probably the most stable. One explanation is that replacement of non-enzymatic amino acid metabolic pathways follows the need to secure amino acid synthesis for protein-based enzymatic activities. It is possible that pre-biotic entities could have competed with each other for environmental resources during this early stage of metabolic evolution. Within this context, the opening of the gateway to amino acid synthesis proposed by Morowitz could have offered the possibility of creating enzymes that would perform pre-biotic functions more effectively.