|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: ADG JAB RS. Performed the experiments: ADG. Analyzed the data: ADG JAB RS. Contributed reagents/materials/analysis tools: ADG RS. Wrote the paper: ADG JAB RS.
We introduce the concept of metaconsensus and employ it to make high confidence predictions of early enzyme functions and the metabolic properties that they may have produced. Several independent studies have used comparative bioinformatics methods to identify taxonomically broad features of genomic sequence data, protein structure data, and metabolic pathway data in order to predict physiological features that were present in early, ancestral life forms. But all such methods carry with them some level of technical bias. Here, we cross-reference the results of these previous studies to determine enzyme functions predicted to be ancient by multiple methods. We survey modern metabolic pathways to identify those that maintain the highest frequency of metaconsensus enzymes. Using the full set of modern reactions catalyzed by these metaconsensus enzyme functions, we reconstruct a representative metabolic network that may reflect the core metabolism of early life forms. Our results show that ten enzyme functions, four hydrolases, three transferases, one oxidoreductase, one lyase, and one ligase, are determined by metaconsensus to be present at least as late as the last universal common ancestor. Subnetworks within central metabolic processes related to sugar and starch metabolism, amino acid biosynthesis, phospholipid metabolism, and CoA biosynthesis, have high frequencies of these enzyme functions. We demonstrate that a large metabolic network can be generated from this small number of enzyme functions.
The expansion of genomic and biomolecular data has led to large repositories of genetic sequences , , protein structures , , protein functions , , , networks of metabolic reactions , , and networks of molecular interactions . By comparing these features across the universal tree of known life, it is possible to reveal traits that are common to a broad range of organisms and, therefore, are likely to have been present in the Last Universal Common Ancestor (LUCA) and its predecessors . In this study, we focus on three such approaches with distinctly different methodologies.
Harris et al.  performed a universal sequence comparison by identifying Clusters of Orthologous Groups of genes (COGs)  that were present in all completely sequenced genomes. Of the 3100 groups catalogued in the COG database, 80 are common across all of the genomes available at the time. Because any evolutionary pressure must act primarily on sequence, the appearance of these COGs across all sequenced genomes can be directly attributed to the presence of related genes in the genome of LUCA. Early horizontal gene transfers may make a COG look more ancient by this method, while gene losses, even recent ones, will make a COG look less ancient.
The work of the Caetano-Anollés group – has provided a second type of comparative bioinformatics analysis pertaining to protein structure. Protein structure may be the most highly conserved feature of molecular evolution ,  because the development of a new protein structure is unlikely , , whereas the repurposing of an existing protein structure is simple by comparison , . Their method begins with a survey of the distribution of protein fold architectures across all available sequenced genomes. These distributions are then used to create a phylogeny of all protein fold architectures. This method assumes that both broad taxonomic distribution and genomic recurrence are quantitatively related to ancestry. Wang et al.,  also identified a specific branch node that represents the divergence of LUCA into three domains of life. In their phylogeny, 165 protein fold architectures are deeper branching than the divergence of LUCA and are, thus, considered to have been present in the proteome of LUCA. Because fold architecture is more highly conserved than gene sequence, this phylogeny may represent the deepest view of ancient evolution to date. On the other hand, it is not certain that every fold architecture is the result of a single evolutionary origin .
Another substantially different comparative bioinformatics analysis was performed by Srinivasan and Morowitz  in which metabolic reactions were surveyed across organisms without any reliance on enzyme sequence or structure data. To do so, the authors superimposed entries stored in the Kyoto Encyclopedia of Genes and Genomes reactions database (KEGG)  for the five autotrophic organisms, four bacteria and one archaean, with extensive available metabolic network data. Two hundred eighty-six reactions were identified as common among all five organisms and were thus assumed to have been present in LUCA. It should be noted that the authors assumed an autotrophic origin of life and that this analysis was limited accordingly to data from only autotrophic organisms. The function of a particular protein family is perhaps the least conserved feature of molecular evolution , , . However, a protein may evolve a new enzymatic function and replace an unrelated protein with the same function through the process of non-orthologous gene displacement . Thus the conservation of the presence of an enzymatic function may be high, even if the protein family that imparts that function changes during the course of evolution.
While the gene content of modern organisms provides strong evidence for a common ancestor ,  it is not clear that the root of the tree of life represents a single organism or a community of organisms exhibiting rampant horizontal gene transfer , . Furthermore, it is not clear what effects early horizontal gene transfer, non-orthologous gene displacement, or other mechanisms of gene gain and loss have on comparative bioinformatics analyses such as those outlined above. Gene sequence, protein structure, and enzyme function represent separate, but related, features of cellular biology and, consequently, respond differently to evolutionary selection pressures . Each of these methods should not only reveal different features of LUCA, but should also do so with a unique degree of confidence and sensitivity. Here we compare the results of these three studies in order to make high confidence predictions through “metaconsensus” (i.e. a consensus between these independent consensus methods). The results are used to determine a higher confidence minimal catalytic repertoire of LUCA and to extrapolate the metabolic properties produced by these catalytic functions.
In order to compare the three aforementioned datasets to one another, the contents of each were converted to 3-letter Enzyme Commission (EC) codes (See methods; Data S1). We henceforth refer to the converted datasets as the universal sequence (Harris et al. ), universal structure (Kim et al. ) and universal functions (Srinivasan and Morowitz ). The universal sequence dataset contains 12 EC groups, the universal structure dataset contains 155 EC groups, and the universal reaction dataset contains 53 EC groups.
Six enzyme functions are found in all three datasets (Figure 1). Four additional enzyme functions are found in the universal sequence and universal structure datasets, but not the universal reaction dataset. Note that the universal reaction dataset was derived solely from autotrophic organisms and thus its relevance to LUCA may be dependent on LUCA having been autotrophic. We consider these four additional enzyme functions along with the six metaconsensus enzyme functions in the following analyses. Three of the six EC groups found in all three datasets are transferases. The other three are an oxidoreductase, a lyase, and a ligase. The four EC groups found in both the universal sequence and universal structure datasets, but not in the universal reaction dataset, are all hydrolases. Srinivasan and Morowitz's emphasis on autotrophy may have excluded these hydrolase functions, which are catabolic by definition. Details of these enzyme functions are presented in Table 1.
We corroborate the antiquity of these metaconsensus enzyme functions by the association with both metal and nucleotide derived cofactors . It is thought that many catalytic mechanisms of metalloenzymes may have originated during the transition from prebiotic chemistry to genetically directed metabolism –. Early metalloproteins likely tended to use structures that were ambiguous rather than specific in the metals that they bound , . Similarly, nucleotide derived cofactors are believed to reflect a preceding state in which the same reaction was catalyzed by a ribozyme , .
According to Uniprot annotations , all of the ten conserved EC groups have metal cofactors coupled to the catalytic mechanisms of enzymes within the group. These results may reflect previous work implicating iron-sulfur clusters ,  and zinc ,  as having important catalytic functions in prebiotic synthesis reactions, as well as an important early role for divalent cations such as Mg2+. Only three of the ten conserved EC groups also use nucleotide-derived cofactors, although it should be noted that uses of nucleotides and nucleotide-derived coenzymes like CoA as substrates was excluded from this analysis.
Ancestry values  were used to identify the most ancient folds defined by SCOP  associated with each metaconsensus enzyme function. The ancestry value, developed by the Caetano-Anolles group , is a scale of relative evolutionary age where 0% is the most ancient fold architecture and 100% is the most recent fold architecture. The establishment of the DNA genome is constrained at 19% ancestry by the appearance of folds that catalyze the critical ribonucleotide reductase step in deoxyribonucleotide synthesis , . The evolutionary divergence of LUCA can be constrained at 40% ancestry by the first appearance of a fold found exclusively in a single domain of life .
Figure 2 shows that many ancient folds are associated with these ten metaconsensus enzyme functions. All ten metaconsensus enzyme functions can be catalyzed by the triosephosphate isomerase (TIM) beta/alpha barrel (SCOP ID=c.1; ancestry=1.9%). The TIM beta/alpha barrel is a very versatile fold architecture that is able to catalyze a number of disparate enzyme functions . Other ancestral catalytic folds common among these EC groups include the Ferredoxin-like fold (SCOP ID=d.58, ancestry=1.3%), the Ribonuclease H-like motif (SCOP ID=c.55; ancestry=3.7%), the S-adenosyl-L-methionine-dependent methyltransferase fold (SCOP ID=c.66; ancestry=5.0%), the Adenine nucleotide alpha hydrolase-like fold (SCOP ID=c.26; ancestry=5.7%), the UDP-glycosyltransferase/glycogen phosphorylase fold (SCOP ID=c.87; ancestry=11.9%), and the Globin-like fold (SCOP ID=a.1; ancestry=18.8%).
Most proteins are composed of more than one fold within a single peptide chain. In many cases, some ancient folds associated with an EC group are not catalytic domains, themselves. By ancestry value, the P-loop containing nucleoside triphosphate fold (SCOP ID=c.37; ancestry=0.0%) is the most ancient fold associated with any metaconsensus enzyme function, but this fold catalyzes NTP hydrolysis or NDP phosphorylation that is coupled to the enzyme rather than the specific catalytic function of the enzyme. Other ancient folds associated with metaconsensus enzyme functions that do not confer specific catalysis include the DNA/RNA-binding 3-helical bundle (SCOP ID=a.4, ancestry 0.6%), the NAD(P)-binding Rossmann fold (SCOP ID=c.2, ancestry=2.5%), and the Oligonucleotide/oligosaccharide binding (OB) fold (SCOP ID=b.40; ancestry=4.4%), to name a few. In combination with the ancestral catalytic folds, these folds probably allowed ancient enzymes to bind a range of substrates and cofactors and couple reactions to NTP hydrolysis. The combinatorial power of these ancient folds could conceivably produce a robust metabolism in LUCA. The prevalence of nucleotide and nucleic acid binding folds associated with metaconsensus enzymes suggests a strong connection to a preceding RNA world scenario.
Having identified these metaconsensus EC groups, we were able to evaluate the antiquity of modern metabolic pathways based on the presence of these conserved enzyme functions. We performed a survey of enzymes across all metabolic pathways defined by KEGG  (Data S2 and Data S3). The percentages of metaconsensus EC groups within each pathway is presented in Figure 3 and the metaconsensus enzyme functions are highlighted on KEGG pathway images in Figure S1. This analysis reveals a list of conserved metabolic pathways that carry out the synthesis and degradation of important biomolecules ranging in size from CoA and simple sugars to N-glycans and sphingolipids.
The high frequency of conserved EC groups in sphingolipid metabolism is interesting because sphingolipids are found in eukaryotes and some bacteria, but are not taxonomically universal , . This result may reflect a conserved core of phospholipid metabolism that sphingolipid metabolism happens to resemble. Within the sphingolipid metabolism pathway, metaconsensus enzymes are most closely associated with the conversion of ceramides to sphingosines (Figure S1). The high frequency of conserved EC groups that carry out pantothenate and CoA biosynthesis reflects the universality of CoA in the central metabolisms of organisms. It has been proposed that acetyl CoA was a key constituent of prebiotic synthesis –. An alternate, but not mutually exclusive, explanation is that CoA, as a nucleotide-derived cofactor, is a remnant of ribozyme catalyzed reactions that preceded modern metabolism . The appearance of drug metabolism is curious, although most metaconsensus enzyme functions in this category are involved in fluorouracil metabolism, and may reflect a propensity toward nucleobase chemistry in general (Figure S1). It is important to note that all of these pathways are modern and that their use of metaconsensus enzymes probably does not identify their current configurations as ancient, but rather, illustrates the general metabolic priorities of LUCA.
To extend our analysis beyond the constraints of modern pathways, we reconstructed a representative ancient metabolism using only the metaconsensus enzyme functions. These ten enzyme functions perform nearly three hundred reactions as defined in the KEGG reactions database  (Data S4). After generating networks based on these reactions, we identified a relatively large network composed of 119 nodes and 135 edges (Figure 4). The major hub nodes (defined here as nodes connected to five or more edges) include nucleotide triphosphate, glucose, sucrose, starch, glutamate, alanine, glutathione, and UDP-acetylmuramoyl peptide. These hub nodes are circumscribed by subnetworks related to NTP phosphorylation/dephosphorylation and RNA synthesis/degradation, sugar synthesis and starch polymerization/degradation, amino acid synthesis/interconversion, phospholipid synthesis/degradation, and enzymatic peptide synthesis/degradation.
This reconstructed ancient metabolism includes component subnetworks spanning the breadth of modern central metabolism, from monomer synthesis and interconversion, to the synthesis of RNA, proteins, starches, and phospholipids. Even though it is thought that LUCA had a DNA genome , this network does not include DNA synthesis. DNA polymerase is included in the general metaconsensus enzyme function (EC 2.7.7.-), but the reduction of ribonucleotides to form deoxyribonucleotides (EC 220.127.116.11) is not a metaconsensus function. The ribonucleotide reductase enzyme is thought to have been the limiting enzyme function in establishing a DNA genome  during the development of LUCA from the RNA-protein world. DNA replication does not exhibit a conserved universal core of enzymes, although excision repair DNA polymerases ,  and some RNA polymerase catalytic domains ,  are universally distributed across the tree of life.
It is also curious that the reconstructed metabolism includes enzymatic peptide synthesis. We have previously shown that the translation system reached a modern level of sophistication during the RNA-protein stage of early evolution . Perhaps enzymatic peptide synthesis supplemented this system as new amino acids became available, but before these amino acids were incorporated into the translation system. Alternatively, these enzyme functions may have been involved in the production of peptide-derived biomolecules as they often are now. Taken together, the presence of ancient nucleic acid/nucleotide binding folds in metaconsensus enzymes, the absence of DNA synthesis from the reconstructed metabolic network, and the presence of enzyme functions related to peptide synthesis in the reconstructed metabolic network, indicate that our metaconsensus method is revealing trends more ancient than the divergence of LUCA, perhaps perhaps from the time of the development of the RNA-protein system. Previous bioinformatics-based studies have also found a propensity of ancient enzymes that corroborate an RNA-protein stage of early life , .
A substantial and general conclusion from this work is that a large metabolic network can be produced with only these ten metaconsensus enzyme functions. This observation demonstrates the strength of the patchwork model of primitive metabolism in which a small number of functionally ambiguous enzymes can generate the groundwork for a complex metabolism , . Furthermore, the enzyme functions, themselves, can be produced by variants of one protein fold that imparts the direct enzymatic function and several auxiliary catalytic and noncatalytic protein folds, all of which appeared early in the development of genetically encoded proteins. Thus, whatever the catalytic repertoire of early life, we assert that a relatively complex metabolism can be produced from a small number of enzymatic functions.
The dataset of universal COGs was copied directly from Table 1 of Harris et al. . The dataset of universal protein structures was downloaded from the MANET database (http://www.manet.uiuc.edu/) . The dataset of universal metabolic reactions was downloaded from the supplemental online information of Srinivasan and Morowitz . The universal structure dataset uses folds from the MANET database with ancestry values ≤0.399, which are considered to have been present in LUCA . EC codes for COGs were identified by searching annotations on the COG database (http://www.ncbi.nlm.nih.gov/COG/) . EC codes for protein fold architectures were extracted from the MANET data file. EC codes for metabolic reactions were identified by searching the KEGG database (http://www.genome.jp/kegg/) . A protein was not included in this analysis if no EC code was available, either because the protein is non-enzymatic or it is an enzyme that has not yet been incorporated into the EC system. Thus, our analysis is limited to enzyme-mediated metabolism and is less likely to incorporate recently discovered enzyme functions.
In the resulting datasets, the final (fourth) term of each EC code was deleted in order to remove a layer of specificity from the conserved enzyme functions. This final term is defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) as the “serial number of the enzyme in its sub-subclass“. For example the sub-subclass of EC 1.2.3.- is [Oxidoreductase. Acting on the aldehyde or oxo group of donors. With oxygen as acceptor]. The serial number 18.104.22.168 designates oxalate oxidase, while the serial number 22.214.171.124 designates glyoxylate oxidase, a very a similar reaction. Our generalization of enzyme functions is consistent with the patchwork hypothesis of early metabolic pathway evolution , .
These sets of conserved enzyme functions were made nonredundant by removing all but one instance of repeated 3-term EC codes. The ancestry of protein fold architectures associated with these ten conserved enzymes were extracted from the MANET data file . These folds were identified as catalytic or associated noncatalytic folds by their SCOP database functional annotations . Metal and nucleotide derived cofactors associated with each conserved enzyme function were identified by surveying Uniprot annotations  for proteins with metaconsensus EC classifications.
Lists of metabolic pathways and their associated enzymes were downloaded from the KEGG FTP site (ftp://ftp.genome.jp/pub/kegg/). These pathway lists were generalized to 3-term EC codes and made nonredundant in the same manner described previously for metaconsensus enzyme datasets. Percentages of conserved enzymes were calculated with respect to the total number of conserved enzymes and the total number of enzyme groups within each pathway to produce Figure 3. The same list of metabolic reactions was used to reconstruct a network of reactions imparted by only the metaconsensus enzyme functions. Small molecules such as H2O and CO2 were removed from the reactions as these would impose false connectivity between reactants and products. Similarly, the conversion of ATP to ADP was removed from reactions when its presence was only attributable to energetic coupling and did not otherwise contribute to the reaction product. Networks were generated with reactants and products as nodes and the enzyme functions that catalyze them as edges. The resulting networks were visualized using Osprey . The 119-node network presented in Figure 4 was identified visually. Nearly all other networks were 3 to 5 nodes in size. The 119-node network was hand annotated and relabeled for clarity using graphics editors.
A text file containing nonredundant lists of generalized enzyme functions (by Enzyme Commission code) that are predicted to be ancient by previous comparative bioinformatics studies.
A text file containing nonredundant lists of generalized enzyme functions (by Enzyme Commission code) in metabolic pathways defined by the KEGG database.
A text file listing percentage overlap between the metaconsensus enzyme set and metabolic pathways defined by the KEGG database.
A text file listing all reactions stored in the KEGG database that are catalyzed by metaconsensus enzyme functions.
Thanks to the University of Washington Origin of Life Research Consortium and members of the Samudrala lab for useful discussion.