The ever accelerating pace of DNA sequencing and annotation information generation [
1] is spearheading the global inventorying of metabolic functions across all kingdoms of life. Increasingly, metabolite and reaction information is organized in the form of community [
2], organism, or even tissue-specific genome-scale metabolic reconstructions. These reconstructions account for reaction stoichiometry and directionality, gene to protein to reaction associations, organelle reaction localization, transporter information, transcriptional regulation and biomass composition. Already over 75 genome-scale models are in place for eukaryotic, prokaryotic and archaeal species [
3] and are becoming indispensable for computationally driving engineering interventions in microbial strains for targeted overproductions [
4-
7], elucidating the organizing principles of metabolism [
8-
11] and even pinpointing drug targets [
12,
13]. A key bottleneck in the pace of reconstruction of new high quality metabolic models is our inability to directly make use of metabolite/reaction information from biological databases [
14] (e.g., BRENDA [
15], KEGG [
16], MetaCyc, EcoCyc, BioCyc [
17], BKM-react [
18], UM-BBD [
19], Reactome.org, Rhea, PubChem, ChEBI etc.) or other models [
20] due to incompatibilities of representation, duplications and errors, as illustrated in Figure .
A major impediment is the presence of metabolites with multiple names across databases and models, and in some cases within the same resource, which significantly slows down the pooling of information from multiple sources. Therefore, the almost unavoidable inclusion of multiple replicates of the same metabolite can lead to missed opportunities to reveal (synthetic) lethal gene deletions, repair network gaps and quantify metabolic flows. Moreover, most data sources inadvertently include some reactions that may be stoichiometrically inconsistent [
21] and/or elementally/charge unbalanced [
22,
23], which can adversely affect the prediction quality of the resulting models if used directly. Finally, a large number of metabolites in reactions are partly specified with respect to structural information and may contain generic side groups (e.g., alkyl groups -R), varying degree of a repeat unit participation in oligomers, or even just compound class identification such as "an amino acid" or "electron acceptor". Over 3% of all metabolites and 8% of all reactions in the aforementioned databases and models exhibit one or more of these problems.
There have already been a number of efforts aimed at addressing some of these limitations. The Rhea database, hosted by the European Bioinformatics Institute, aggregates reaction data primarily from IntEnz [
24] and ENZYME [
25], whereas Reactome.org is a collection of reactions primarily focused on human metabolism [
26,
27]. Even though they crosslink their data to one or more popular databases such as KEGG, ChEBI, NCBI, Ensembl, Uniprot, etc., both retain their own representation formats. More recently, the BKM-react database is a non-redundant biochemical reaction database containing known enzyme-catalyzed reactions compiled from BRENDA, KEGG, and MetaCyc [
18]. The BKM-react database currently contains 20,358 reactions. Additionally, the contents of five frequently used human metabolic pathway databases have been compared [
28]. An important step forward for models was the BiGG database, which includes seven genome-scale models from the Palsson group in a consistent nomenclature and exportable in SBML format [
29-
31]. Research towards integrating genome-scale metabolic models with large databases has so far been even more limited. Notable exceptions include the partial reconciliation of the latest
E. coli genome scale model
iAF1260 with EcoCyc [
32] and the aggregation of data from the
Arabidopsis thaliana database and KEGG for generating genome-scale models [
33] in a semi-automated fashion. Additionally, ReMatch integrates some metabolic models, although its primary focus is on carbon mappings for metabolic flux analysis [
34]. Also, many metabolic models retain the KEGG identifiers of metabolites and reactions extracted during their construction [
35,
36]. An important recent development is the web resource Model SEED that can generate draft genome-scale metabolic models drawing from an internal database that integrates KEGG with 13 genome scale models (including six of the models in the BiGG database) [
37]. All of the reactions in Model SEED and BiGG are charge and elementally balanced.
In this paper, we describe the development and highlight applications of the web-based resource MetRxn that integrates, using internally consistent descriptions, metabolite and reaction information from 8 databases and 44 metabolic models. The MetRxn knowledgebase (as of October 2011) contains over 76,000 metabolites and 72,000 reactions (including unresolved entries) that are charge and elementally balanced. By conforming to standardized metabolite and reaction descriptions, MetRxn enables users to efficiently perform queries and comparisons across models and/or databases. For example, common metabolites and/or reactions between models and databases can rapidly be generated along with connected paths that link source to target metabolites. MetRxn supports export of models in SBML format. New models are being added as they are published or made available to us. It is available as a web-based resource at
http://metrxn.che.psu.edu.