Search tips
Search criteria

Results 1-25 (1136497)

Clipboard (0)

Related Articles

1.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases 
Nucleic Acids Research  2011;40(Database issue):D742-D753.
The MetaCyc database ( provides a comprehensive and freely accessible resource for metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are experimentally determined, small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains more than 1800 pathways derived from more than 30 000 publications, and is the largest curated collection of metabolic pathways currently available. Most reactions in MetaCyc pathways are linked to one or more well-characterized enzymes, and both pathways and enzymes are annotated with reviews, evidence codes and literature citations. BioCyc ( is a collection of more than 1700 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the full genome and predicted metabolic network of one organism. The network, which is predicted by the Pathway Tools software using MetaCyc as a reference database, consists of metabolites, enzymes, reactions and metabolic pathways. BioCyc PGDBs contain additional features, including predicted operons, transport systems and pathway-hole fillers. The BioCyc website and Pathway Tools software offer many tools for querying and analysis of PGDBs, including Omics Viewers and comparative analysis. New developments include a zoomable web interface for diagrams; flux-balance analysis model generation from PGDBs; web services; and a new tool called Web Groups.
PMCID: PMC3245006  PMID: 22102576
2.  The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases 
Nucleic Acids Research  2007;36(Database issue):D623-D631.
MetaCyc ( is a universal database of metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are curated from the primary scientific literature, and are experimentally determined small-molecule metabolic pathways. Each reaction in a MetaCyc pathway is annotated with one or more well-characterized enzymes. Because MetaCyc contains only experimentally elucidated knowledge, it provides a uniquely high-quality resource for metabolic pathways and enzymes. BioCyc ( is a collection of more than 350 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the predicted metabolic network of one organism, including metabolic pathways, enzymes, metabolites and reactions predicted by the Pathway Tools software using MetaCyc as a reference database. BioCyc PGDBs also contain predicted operons and predicted pathway hole fillers—predictions of which enzymes may catalyze pathway reactions that have not been assigned to an enzyme. The BioCyc website offers many tools for computational analysis of PGDBs, including comparative analysis and analysis of omics data in a pathway context. The BioCyc PGDBs generated by SRI are offered for adoption by any interested party for the ongoing integration of metabolic and genome-related information about an organism.
PMCID: PMC2238876  PMID: 17965431
3.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases 
Nucleic Acids Research  2009;38(Database issue):D473-D479.
The MetaCyc database ( is a comprehensive and freely accessible resource for metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are experimentally determined, small-molecule metabolic pathways and are curated from the primary scientific literature. With more than 1400 pathways, MetaCyc is the largest collection of metabolic pathways currently available. Pathways reactions are linked to one or more well-characterized enzymes, and both pathways and enzymes are annotated with reviews, evidence codes, and literature citations. BioCyc ( is a collection of more than 500 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the full genome and predicted metabolic network of one organism. The network, which is predicted by the Pathway Tools software using MetaCyc as a reference, consists of metabolites, enzymes, reactions and metabolic pathways. BioCyc PGDBs also contain additional features, such as predicted operons, transport systems, and pathway hole-fillers. The BioCyc Web site offers several tools for the analysis of the PGDBs, including Omics Viewers that enable visualization of omics datasets on two different genome-scale diagrams and tools for comparative analysis. The BioCyc PGDBs generated by SRI are offered for adoption by any party interested in curation of metabolic, regulatory, and genome-related information about an organism.
PMCID: PMC2808959  PMID: 19850718
4.  Expansion of the BioCyc collection of pathway/genome databases to 160 genomes 
Nucleic Acids Research  2005;33(19):6083-6089.
The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing.
PMCID: PMC1266070  PMID: 16246909
5.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases 
Nucleic Acids Research  2013;42(Database issue):D459-D471.
The MetaCyc database ( is a comprehensive and freely accessible database describing metabolic pathways and enzymes from all domains of life. MetaCyc pathways are experimentally determined, mostly small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains >2100 pathways derived from >37 000 publications, and is the largest curated collection of metabolic pathways currently available. BioCyc ( is a collection of >3000 organism-specific Pathway/Genome Databases (PGDBs), each containing the full genome and predicted metabolic network of one organism, including metabolites, enzymes, reactions, metabolic pathways, predicted operons, transport systems and pathway-hole fillers. Additions to BioCyc over the past 2 years include YeastCyc, a PGDB for Saccharomyces cerevisiae, and 891 new genomes from the Human Microbiome Project. The BioCyc Web site offers a variety of tools for querying and analysis of PGDBs, including Omics Viewers and tools for comparative analysis. New developments include atom mappings in reactions, a new representation of glycan degradation pathways, improved compound structure display, better coverage of enzyme kinetic data, enhancements of the Web Groups functionality, improvements to the Omics viewers, a new representation of the Enzyme Commission system and, for the desktop version of the software, the ability to save display states.
PMCID: PMC3964957  PMID: 24225315
6.  MetaCyc: a multiorganism database of metabolic pathways and enzymes 
Nucleic Acids Research  2005;34(Database issue):D511-D516.
MetaCyc is a database of metabolic pathways and enzymes located at . Its goal is to serve as a metabolic encyclopedia, containing a collection of non-redundant pathways central to small molecule metabolism, which have been reported in the experimental literature. Most of the pathways in MetaCyc occur in microorganisms and plants, although animal pathways are also represented. MetaCyc contains metabolic pathways, enzymatic reactions, enzymes, chemical compounds, genes and review-level comments. Enzyme information includes substrate specificity, kinetic properties, activators, inhibitors, cofactor requirements and links to sequence and structure databases. Data are curated from the primary literature by curators with expertise in biochemistry and molecular biology. MetaCyc serves as a readily accessible comprehensive resource on microbial and plant pathways for genome analysis, basic research, education, metabolic engineering and systems biology. Querying, visualization and curation of the database is supported by SRI's Pathway Tools software. The PathoLogic component of Pathway Tools is used in conjunction with MetaCyc to predict the metabolic network of an organism from its annotated genome. SRI and the European Bioinformatics Institute employed this tool to create pathway/genome databases (PGDBs) for 165 organisms, available at the website. These PGDBs also include predicted operons and pathway hole fillers.
PMCID: PMC1347490  PMID: 16381923
7.  Metabolic pathways for the whole community 
BMC Genomics  2014;15(1):619.
A convergence of high-throughput sequencing and computational power is transforming biology into information science. Despite these technological advances, converting bits and bytes of sequence information into meaningful insights remains a challenging enterprise. Biological systems operate on multiple hierarchical levels from genomes to biomes. Holistic understanding of biological systems requires agile software tools that permit comparative analyses across multiple information levels (DNA, RNA, protein, and metabolites) to identify emergent properties, diagnose system states, or predict responses to environmental change.
Here we adopt the MetaPathways annotation and analysis pipeline and Pathway Tools to construct environmental pathway/genome databases (ePGDBs) that describe microbial community metabolism using MetaCyc, a highly curated database of metabolic pathways and components covering all domains of life. We evaluate Pathway Tools’ performance on three datasets with different complexity and coding potential, including simulated metagenomes, a symbiotic system, and the Hawaii Ocean Time-series. We define accuracy and sensitivity relationships between read length, coverage and pathway recovery and evaluate the impact of taxonomic pruning on ePGDB construction and interpretation. Resulting ePGDBs provide interactive metabolic maps, predict emergent metabolic pathways associated with biosynthesis and energy production and differentiate between genomic potential and phenotypic expression across defined environmental gradients.
This multi-tiered analysis provides the user community with specific operating guidelines, performance metrics and prediction hazards for more reliable ePGDB construction and interpretation. Moreover, it demonstrates the power of Pathway Tools in predicting metabolic interactions in natural and engineered ecosystems.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-619) contains supplementary material, which is available to authorized users.
PMCID: PMC4137073  PMID: 25048541
8.  Enhancing a Pathway-Genome Database (PGDB) to capture subcellular localization of metabolites and enzymes: the nucleotide-sugar biosynthetic pathways of Populus trichocarpa 
Understanding how cellular metabolism works and is regulated requires that the underlying biochemical pathways be adequately represented and integrated with large metabolomic data sets to establish a robust network model. Genetically engineering energy crops to be less recalcitrant to saccharification requires detailed knowledge of plant polysaccharide structures and a thorough understanding of the metabolic pathways involved in forming and regulating cell-wall synthesis. Nucleotide-sugars are building blocks for synthesis of cell wall polysaccharides. The biosynthesis of nucleotide-sugars is catalyzed by a multitude of enzymes that reside in different subcellular organelles, and precise representation of these pathways requires accurate capture of this biological compartmentalization. The lack of simple localization cues in genomic sequence data and annotations however leads to missing compartmentalization information for eukaryotes in automatically generated databases, such as the Pathway-Genome Databases (PGDBs) of the SRI Pathway Tools software that drives much biochemical knowledge representation on the internet. In this report, we provide an informal mechanism using the existing Pathway Tools framework to integrate protein and metabolite sub-cellular localization data with the existing representation of the nucleotide-sugar metabolic pathways in a prototype PGDB for Populus trichocarpa. The enhanced pathway representations have been successfully used to map SNP abundance data to individual nucleotide-sugar biosynthetic genes in the PGDB. The manually curated pathway representations are more conducive to the construction of a computational platform that will allow the simulation of natural and engineered nucleotide-sugar precursor fluxes into specific recalcitrant polysaccharide(s).
Database URL: The curated Populus PGDB is available in the BESC public portal at and the nucleotide-sugar biosynthetic pathways can be directly accessed at
PMCID: PMC3316911  PMID: 22465851
9.  MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information 
BMC Bioinformatics  2013;14:202.
A central challenge to understanding the ecological and biogeochemical roles of microorganisms in natural and human engineered ecosystems is the reconstruction of metabolic interaction networks from environmental sequence information. The dominant paradigm in metabolic reconstruction is to assign functional annotations using BLAST. Functional annotations are then projected onto symbolic representations of metabolism in the form of KEGG pathways or SEED subsystems.
Here we present MetaPathways, an open source pipeline for pathway inference that uses the PathoLogic algorithm to map functional annotations onto the MetaCyc collection of reactions and pathways, and construct environmental Pathway/Genome Databases (ePGDBs) compatible with the editing and navigation features of Pathway Tools. The pipeline accepts assembled or unassembled nucleotide sequences, performs quality assessment and control, predicts and annotates noncoding genes and open reading frames, and produces inputs to PathoLogic. In addition to constructing ePGDBs, MetaPathways uses MLTreeMap to build phylogenetic trees for selected taxonomic anchor and functional gene markers, converts General Feature Format (GFF) files into concatenated GenBank files for ePGDB construction based on third-party annotations, and generates useful file formats including Sequin files for direct GenBank submission and gene feature tables summarizing annotations, MLTreeMap trees, and ePGDB pathway coverage summaries for statistical comparisons.
MetaPathways provides users with a modular annotation and analysis pipeline for predicting metabolic interaction networks from environmental sequence information using an alternative to KEGG pathways and SEED subsystems mapping. It is extensible to genomic and transcriptomic datasets from a wide range of sequencing platforms, and generates useful data products for microbial community structure and function analysis. The MetaPathways software package, installation instructions, and example data can be obtained from
PMCID: PMC3695837  PMID: 23800136
Environmental pathway/Genome Database (ePGDB); Metagenome; Pathway tools; PathoLogic; MetaCyc; Microbial community; Metabolism; Metabolic interaction networks
10.  Data Mining in the MetaCyc Family of Pathway Databases 
Pathway databases collect the bioreactions and molecular interactions that define the processes of life. The MetaCyc family of pathway databases consists of thousands of databases that were derived through computational inference of metabolic pathways from the MetaCyc Pathway/Genome Database (PGDB). In some cases these DBs underwent subsequent manual curation. Curated pathway DBs are now available for most of the major model organisms. Databases in the MetaCyc family are managed using the Pathway Tools software. This chapter presents methods for performing data mining on the MetaCyc family of pathway DBs. We discuss the major data access mechanisms for the family, which include data files in multiple formats; application programming interfaces (APIs) for the Lisp, Java, and Perl languages; and web services. We present an overview of the Pathway Tools schema, an understanding of which is needed to query the DBs. The chapter also presents several interactive data mining tools within Pathway Tools for performing omics data analysis.
PMCID: PMC3694719  PMID: 23192547
Metabolic pathways; pathway databases; systems biology
11.  PGDB: a curated and integrated database of genes related to the prostate 
Nucleic Acids Research  2003;31(1):291-293.
The Prostate Gene Database (PGDB: is a curated and integrated database of genes or genomic loci related to the human prostate and prostatic diseases. Currently, PGDB covers genes involved in a number of molecular and genetic events of the prostate including gene amplification, mutation, gross deletion, methylation, polymorphism, linkage and over-expression, as published in the literature. Genes that are specifically expressed in prostate, as evidenced by analysis of data from expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), are also included. There are a total of 165 unique entries in the database. Users can either browse or query the PGDB through a web interface. For each gene, in addition to basic gene information and rich cross-references to other databases, inclusive and relevant literature references are provided to support the inclusion of the gene in the database. Detailed expression data calculated from the UniGene and SAGEmap databases are also presented.
PMCID: PMC165455  PMID: 12520005
12.  LeishCyc: a biochemical pathways database for Leishmania major 
BMC Systems Biology  2009;3:57.
Leishmania spp. are sandfly transmitted protozoan parasites that cause a spectrum of diseases in more than 12 million people worldwide. Much research is now focusing on how these parasites adapt to the distinct nutrient environments they encounter in the digestive tract of the sandfly vector and the phagolysosome compartment of mammalian macrophages. While data mining and annotation of the genomes of three Leishmania species has provided an initial inventory of predicted metabolic components and associated pathways, resources for integrating this information into metabolic networks and incorporating data from transcript, protein, and metabolite profiling studies is currently lacking. The development of a reliable, expertly curated, and widely available model of Leishmania metabolic networks is required to facilitate systems analysis, as well as discovery and prioritization of new drug targets for this important human pathogen.
The LeishCyc database was initially built from the genome sequence of Leishmania major (v5.2), based on the annotation published by the Wellcome Trust Sanger Institute. LeishCyc was manually curated to remove errors, correct automated predictions, and add information from the literature. The ongoing curation is based on public sources, literature searches, and our own experimental and bioinformatics studies. In a number of instances we have improved on the original genome annotation, and, in some ambiguous cases, collected relevant information from the literature in order to help clarify gene or protein annotation in the future. All genes in LeishCyc are linked to the corresponding entry in GeneDB (Wellcome Trust Sanger Institute).
The LeishCyc database describes Leishmania major genes, gene products, metabolites, their relationships and biochemical organization into metabolic pathways. LeishCyc provides a systematic approach to organizing the evolving information about Leishmania biochemical networks and is a tool for analysis, interpretation, and visualization of Leishmania Omics data (transcriptomics, proteomics, metabolomics) in the context of metabolic pathways. LeishCyc is the first such database for the Trypanosomatidae family, which includes a number of other important human parasites. Flexible query/visualization capabilities are provided by the Pathway Tools software and its Web interface. The LeishCyc database is made freely available over the Internet .
PMCID: PMC2700086  PMID: 19497128
13.  EcoCyc: A comprehensive view of Escherichia coli biology 
Nucleic Acids Research  2008;37(Database issue):D464-D470.
EcoCyc ( provides a comprehensive encyclopedia of Escherichia coli biology. EcoCyc integrates information about the genome, genes and gene products; the metabolic network; and the regulatory network of E. coli. Recent EcoCyc developments include a new initiative to represent and curate all types of E. coli regulatory processes such as attenuation and regulation by small RNAs. EcoCyc has started to curate Gene Ontology (GO) terms for E. coli and has made a dataset of E. coli GO terms available through the GO Web site. The curation and visualization of electron transfer processes has been significantly improved. Other software and Web site enhancements include the addition of tracks to the EcoCyc genome browser, in particular a type of track designed for the display of ChIP-chip datasets, and the development of a comparative genome browser. A new Genome Omics Viewer enables users to paint omics datasets onto the full E. coli genome for analysis. A new advanced query page guides users in interactively constructing complex database queries against EcoCyc. A Macintosh version of EcoCyc is now available. A series of Webinars is available to instruct users in the use of EcoCyc.
PMCID: PMC2686493  PMID: 18974181
14.  CycADS: an annotation database system to ease the development and update of BioCyc databases 
In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms.
Database URL:
PMCID: PMC3072769  PMID: 21474551
15.  MetaCyc: a multiorganism database of metabolic pathways and enzymes 
Nucleic Acids Research  2004;32(Database issue):D438-D442.
The MetaCyc database (see URL is a collection of metabolic pathways and enzymes from a wide variety of organisms, primarily microorganisms and plants. The goal of MetaCyc is to contain a representative sample of each experimentally elucidated pathway, and thereby to catalog the universe of metabolism. MetaCyc also describes reactions, chemical compounds and genes. Many of the pathways and enzymes in MetaCyc contain extensive information, including comments and literature citations. SRI’s Pathway Tools software supports querying, visualization and curation of MetaCyc. With its wide breadth and depth of metabolic information, MetaCyc is a valuable resource for a variety of applications. MetaCyc is the reference database of pathways and enzymes that is used in conjunction with SRI’s metabolic pathway prediction program to create Pathway/Genome Databases that can be augmented with curation from the scientific literature and published on the world wide web. MetaCyc also serves as a readily accessible comprehensive resource on microbial and plant pathways for genome analysis, basic research, education, metabolic engineering and systems biology. In the past 2 years the data content and the Pathway Tools software used to query, visualize and edit MetaCyc have been expanded significantly. These enhancements are described in this paper.
PMCID: PMC308834  PMID: 14681452
16.  Reconstruction of metabolic pathways for the cattle genome 
BMC Systems Biology  2009;3:33.
Metabolic reconstruction of microbial, plant and animal genomes is a necessary step toward understanding the evolutionary origins of metabolism and species-specific adaptive traits. The aims of this study were to reconstruct conserved metabolic pathways in the cattle genome and to identify metabolic pathways with missing genes and proteins. The MetaCyc database and PathwayTools software suite were chosen for this work because they are widely used and easy to implement.
An amalgamated cattle genome database was created using the NCBI and Ensembl cattle genome databases (based on build 3.1) as data sources. PathwayTools was used to create a cattle-specific pathway genome database, which was followed by comprehensive manual curation for the reconstruction of metabolic pathways. The curated database, CattleCyc 1.0, consists of 217 metabolic pathways. A total of 64 mammalian-specific metabolic pathways were modified from the reference pathways in MetaCyc, and two pathways previously identified but missing from MetaCyc were added. Comparative analysis of metabolic pathways revealed the absence of mammalian genes for 22 metabolic enzymes whose activity was reported in the literature. We also identified six human metabolic protein-coding genes for which the cattle ortholog is missing from the sequence assembly.
CattleCyc is a powerful tool for understanding the biology of ruminants and other cetartiodactyl species. In addition, the approach used to develop CattleCyc provides a framework for the metabolic reconstruction of other newly sequenced mammalian genomes. It is clear that metabolic pathway analysis strongly reflects the quality of the underlying genome annotations. Thus, having well-annotated genomes from many mammalian species hosted in BioCyc will facilitate the comparative analysis of metabolic pathways among different species and a systems approach to comparative physiology.
PMCID: PMC2669051  PMID: 19284618
17.  A genome-scale metabolic flux model of Escherichia coli K–12 derived from the EcoCyc database 
BMC Systems Biology  2014;8:79.
Constraint-based models of Escherichia coli metabolic flux have played a key role in computational studies of cellular metabolism at the genome scale. We sought to develop a next-generation constraint-based E. coli model that achieved improved phenotypic prediction accuracy while being frequently updated and easy to use. We also sought to compare model predictions with experimental data to highlight open questions in E. coli biology.
We present EcoCyc–18.0–GEM, a genome-scale model of the E. coli K–12 MG1655 metabolic network. The model is automatically generated from the current state of EcoCyc using the MetaFlux software, enabling the release of multiple model updates per year. EcoCyc–18.0–GEM encompasses 1445 genes, 2286 unique metabolic reactions, and 1453 unique metabolites. We demonstrate a three-part validation of the model that breaks new ground in breadth and accuracy: (i) Comparison of simulated growth in aerobic and anaerobic glucose culture with experimental results from chemostat culture and simulation results from the E. coli modeling literature. (ii) Essentiality prediction for the 1445 genes represented in the model, in which EcoCyc–18.0–GEM achieves an improved accuracy of 95.2% in predicting the growth phenotype of experimental gene knockouts. (iii) Nutrient utilization predictions under 431 different media conditions, for which the model achieves an overall accuracy of 80.7%. The model’s derivation from EcoCyc enables query and visualization via the EcoCyc website, facilitating model reuse and validation by inspection. We present an extensive investigation of disagreements between EcoCyc–18.0–GEM predictions and experimental data to highlight areas of interest to E. coli modelers and experimentalists, including 70 incorrect predictions of gene essentiality on glucose, 80 incorrect predictions of gene essentiality on glycerol, and 83 incorrect predictions of nutrient utilization.
Significant advantages can be derived from the combination of model organism databases and flux balance modeling represented by MetaFlux. Interpretation of the EcoCyc database as a flux balance model results in a highly accurate metabolic model and provides a rigorous consistency check for information stored in the database.
PMCID: PMC4086706  PMID: 24974895
Escherichia coli; Flux balance analysis; Constraint-based modeling; Metabolic network reconstruction; Metabolic modeling; Genome-scale model; Gene essentiality; Systems biology; EcoCyc; Pathway Tools
18.  A computational platform to maintain and migrate manual functional annotations for BioCyc databases 
BMC Systems Biology  2014;8:115.
BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database.
We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers.
Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.
Electronic supplementary material
The online version of this article (doi:10.1186/s12918-014-0115-1) contains supplementary material, which is available to authorized users.
PMCID: PMC4203924  PMID: 25304126
Annotation tool; BioCyc; Pathway/Genome database; JavaCycO
19.  Semi-automated Curation of Metabolic Models via Flux Balance Analysis: A Case Study with Mycoplasma gallisepticum 
PLoS Computational Biology  2013;9(9):e1003208.
Primarily used for metabolic engineering and synthetic biology, genome-scale metabolic modeling shows tremendous potential as a tool for fundamental research and curation of metabolism. Through a novel integration of flux balance analysis and genetic algorithms, a strategy to curate metabolic networks and facilitate identification of metabolic pathways that may not be directly inferable solely from genome annotation was developed. Specifically, metabolites involved in unknown reactions can be determined, and potentially erroneous pathways can be identified. The procedure developed allows for new fundamental insight into metabolism, as well as acting as a semi-automated curation methodology for genome-scale metabolic modeling. To validate the methodology, a genome-scale metabolic model for the bacterium Mycoplasma gallisepticum was created. Several reactions not predicted by the genome annotation were postulated and validated via the literature. The model predicted an average growth rate of 0.358±0.12, closely matching the experimentally determined growth rate of M. gallisepticum of 0.244±0.03. This work presents a powerful algorithm for facilitating the identification and curation of previously known and new metabolic pathways, as well as presenting the first genome-scale reconstruction of M. gallisepticum.
Author Summary
Flux balance analysis (FBA) is a powerful approach for genome-scale metabolic modeling. It provides metabolic engineers with a tool for manipulating, predicting, and optimizing metabolism for biotechnological and biomedical purposes. However, we posit that it can also be used as tool for fundamental research in understanding and curating metabolic networks. Specifically, by using a genetic algorithm integrated with FBA, we developed a curation approach to identify missing reactions, incomplete reactions, and erroneous reactions. Additionally, it was possible to take advantage of the ensemble information from the genetic algorithm to identify the most critical reactions for curation. We tested our strategy using Mycoplasma gallisepticum as our model organism. Using the genome annotation as the basis, the preliminary genome-scale metabolic model consisted of 446 metabolites involved in 380 reactions. Carrying out our analysis, we found over 80 incorrect reactions and 16 missing reactions. Based upon the guidance of the algorithm, we were able to curate and resolve all discrepancies. The model predicted an average bacterial growth rate of 0.358±0.12 h−1 compared to the experimentally observed 0.244±0.03 h−1. Thus, our approach facilitated the curation of a genome-scale metabolic network and generated a high quality metabolic model.
PMCID: PMC3764002  PMID: 24039564
20.  Browsing Metabolic and Regulatory Networks with BioCyc 
The BioCyc database collection at integrates genome and cellular network information for more than 500 organisms. This method article describes Web-based tools for browsing metabolic and regulatory networks within BioCyc. These tools allow visualization of complete metabolic and regulatory networks, and allow the user to zoom-in on regions of the network of interest. The user can find objects of interest such as genes and metabolites within the networks, and can selectively examine the connectivity of the network.
The EcoCyc database within the BioCyc collection has been extensively curated. The descriptions within EcoCyc of the Escherichia coli metabolic network and regulatory network were derived from thousands of publications. Other BioCyc databases received moderate levels of curation, or no curation at all. Those databases receiving no curation contain metabolic networks that were computationally inferred from the annotated genome sequences of each organism.
PMCID: PMC3549617  PMID: 22144155
Regulatory Network; Metabolic Network; Cellular Network; Web Interface; Highlighting; Regulatory Subnetwork; Browsing; Genome Database; Metabolic Database
21.  Dead End Metabolites - Defining the Known Unknowns of the E. coli Metabolic Network  
PLoS ONE  2013;8(9):e75210.
The EcoCyc database is an online scientific database which provides an integrated view of the metabolic and regulatory network of the bacterium Escherichia coli K-12 and facilitates computational exploration of this important model organism. We have analysed the occurrence of dead end metabolites within the database – these are metabolites which lack the requisite reactions (either metabolic or transport) that would account for their production or consumption within the metabolic network. 127 dead end metabolites were identified from the 995 compounds that are contained within the EcoCyc metabolic network. Their presence reflects either a deficit in our representation of the network or in our knowledge of E. coli metabolism. Extensive literature searches resulted in the addition of 38 transport reactions and 3 metabolic reactions to the database and led to an improved representation of the pathway for Vitamin B12 salvage. 39 dead end metabolites were identified as components of reactions that are not physiologically relevant to E. coli K-12 – these reactions are properties of purified enzymes in vitro that would not be expected to occur in vivo. Our analysis led to improvements in the software that underpins the database and to the program that finds dead end metabolites within EcoCyc. The remaining dead end metabolites in the EcoCyc database likely represent deficiencies in our knowledge of E. coli metabolism.
PMCID: PMC3781023  PMID: 24086468
22.  The comprehensive updated regulatory network of Escherichia coli K-12 
BMC Bioinformatics  2006;7:5.
Escherichia coli is the model organism for which our knowledge of its regulatory network is the most extensive. Over the last few years, our project has been collecting and curating the literature concerning E. coli transcription initiation and operons, providing in both the RegulonDB and EcoCyc databases the largest electronically encoded network available. A paper published recently by Ma et al. (2004) showed several differences in the versions of the network present in these two databases.
Discrepancies have been corrected, annotations from this and other groups (Shen-Orr et al., 2002) have been added, making the RegulonDB and EcoCyc databases the largest comprehensive and constantly curated regulatory network of E. coli K-12.
Several groups have been using these curated data as part of their bioinformatics and systems biology projects, in combination with external data obtained from other sources, thus enlarging the dataset initially obtained from either RegulonDB or EcoCyc of the E. coli K12 regulatory network. We kindly obtained from the groups of Uri Alon and Hong-Wu Ma the interactions they have added to enrich their public versions of the E. coli regulatory network. These were used to search for original references and curate them with the same standards we use regularly, adding in several cases the original references (instead of reviews or missing references), as well as adding the corresponding experimental evidence codes. We also corrected all discrepancies in the two databases available as explained below.
One hundred and fifty new interactions have been added to our databases as a result of this specific curation effort, in addition to those added as a result of our continuous curation work. RegulonDB gene names are now based on those of EcoCyc to avoid confusion due to gene names and synonyms, and the public releases of RegulonDB and EcoCyc are henceforth synchronized to avoid confusion due to different versions. Public flat files are available providing direct access to the regulatory network interactions thus avoiding errors due to differences in database modelling and representation. The regulatory network available in RegulonDB and EcoCyc is the most comprehensive and regularly updated electronically-encoded regulatory network of E. coli K-12.
PMCID: PMC1382256  PMID: 16398937
23.  Insight into human alveolar macrophage and M. tuberculosis interactions via metabolic reconstructions 
A human alveolar macrophage genome-scale metabolic reconstruction was reconstructed from tailoring a global human metabolic network, Recon 1, by using computational algorithms and manual curation.A genome-scale host–pathogen network of the human alveolar macrophage and Mycobacterium tuberculosis is presented. This involved integrating two genome-scale network reconstructions.The reaction activity and gene essentiality predictions of the host–pathogen model represent a more accurate depiction of infection.Integration of high-throughput data into a host-pathogen model followed by systems analysis was performed in order to elucidate major metabolic differences under different types of M. tuberculosis infection.
Mycobacterium tuberculosis (M. tb) is an insidious and highly persistent pathogen that affects one-third of the world's population (WHO, 2009). Metabolism is foundational to M. tb's infection ability and the ensuing host–pathogen interactions. In addition, M. tb has a heterogeneous clinical presentation and can infect virtually every tissue. Depending on the location of the infection, different metabolic pathways are active and inactive in both the host and pathogen cells. In this study, we sought to model the host–pathogen interactions of the human alveolar macrophage and M. tb as well as detail the metabolic differences in specific infection types using genome-scale metabolic reconstructions (Figure 4A).
Genome-scale metabolic reconstructions are knowledge bases of all known metabolic reactions of a given organism. Reconstructions have been shown to elucidate the mechanistic genotype-to-phenotype relationship through the integration of high-throughput and physiological data (Oberhardt et al, 2009). Genome-scale reconstructions are converted into mathematical models under the constraints-based reconstruction and analysis (COBRA) platform (Becker et al, 2007). COBRA models use network stoichiometry and steady-state mass balances to define a solution space of potential flux states that a network can take. Thus, the COBRA approach does not require kinetic parameters.
Recently, the global human metabolic network, Recon 1, has been reconstructed (Duarte et al, 2007). To understand the metabolic host–pathogen integrations of M. tb with its human host, we first tailored the global human metabolic network into a cell-specific metabolic reconstruction of the human alveolar macrophage. This was carried out using established computational algorithms (Becker and Palsson, 2008; Shlomi et al, 2008) and manual curation to confirm the included and excluded reactions. The human alveolar macrophage reconstruction, iAB-AMØ-1410, accounts for 1410 genes, 3012 intracellular reactions, and 2572 metabolites (Figure 4C). iAB-AMØ-1410 was able to accurately predict maximum ATP and NO production rates obtained from experimental data (Griscavage et al, 1993; Newsholme et al, 1999).
The second step to studying host–pathogen interactions was integration of the human alveolar macrophage reconstruction with an existing genome-scale metabolic model of M. tb, iNJ661 (Jamshidi and Palsson, 2007). Interfacial constraints were set to create a phagosomal environment that was hypoxic, nitrosative, rich in fatty acids, and poor in carbohydrates. From the onset, it was apparent that some oxygen (<15% of in vitro uptake) was required for proper simulations. In addition, algorithmic tailoring of the M. tb biomass objective function was performed to better represent an infectious state. The integrated host–pathogen metabolic reconstruction was dubbed iAB-AMØ-1410-Mt-661.
Analysis of the integrated host–pathogen metabolic reconstruction resulted in three main findings. First, by setting interfacial constraints and tailoring the biomass objective function, the solution space better represents an infectious state. Without adding artificial constraints to the host portion of the integrated model, the iAB-AMØ-1410 solution space is greatly reduced (Figure 4B). Macrophage glycolysis and nitric oxide production are up-regulated and macrophage ATP production, nucleotide synthesis, and amino-acid metabolism are suppressed. In addition, M. tb glycolysis is suppressed and isocitrate lyase is up-regulated for generation of acetyl-CoA. Fatty acid oxidation pathways and production of mycolic acids are increased, while production of nucleotides, peptidoglycans, and phenolic glycolipids are reduced. The modified solution space of the alveolar macrophage and M. tb better represents the infectious state.
Second, the host-pathogen model more accurately predicts M. tb gene deletion tests than the current in vitro model, iNJ661. The host-pathogen model predicted 11 essential genes and 37 unessential genes differently than iNJ661. A total of 22 of the differentially predicted genes have been experimentally characterized (Sassetti and Rubin, 2003; Sohaskey, 2008). The host-pathogen model correctly predicted 18 of the 22 genes. Thus, iAB-AMØ-1410-Mt-661 is a more accurate platform for studying infectious states of M. tb.
Finally, we sought to determine metabolic differences in both the macrophage and M. tb between three different types of infection: latent, pulmonary, and meningeal. Transcription profiling data of the macrophage for the three infections (Thuong et al, 2008) were integrated in the context of the host–pathogen network to elucidate the reaction activity of the three infections. There was wide heterogeneity in the three infection states; some of these differences are highlighted. Macrophage hyaluronan synthase and export were only active in the pulmonary infection. This is potentially interesting from a pharmaceutical viewpoint as hyaluronan has been implicated as a potential carbon source for extracellular M. tb (Hirayama et al, 2009). In addition, we detected metabolic activity differences in M. tb pathways that have been previously discussed as potential drug targets (Eoh et al, 2007; Boshoff et al, 2008). Polyprenyl metabolic reactions were only active in the latent state infection, while de novo synthesis of nicotinamide cofactors was only active in latent and meningeal M. tb infections.
Host-pathogen modeling represents a novel approach for studying metabolic interactions during infection. iAB-AMØ-1410-Mt-661 is a more accurate platform for understanding the biology and pathophysiology of M. tb infection. Most importantly, genome-scale metabolic reconstructions can act as scaffolds for integrating high-throughput data. Particularly, in this study we were able to discern reaction activity differences between different infection types.
Metabolic coupling of Mycobacterium tuberculosis to its host is foundational to its pathogenesis. Computational genome-scale metabolic models have shown utility in integrating -omic as well as physiologic data for systemic, mechanistic analysis of metabolism. To date, integrative analysis of host–pathogen interactions using in silico mass-balanced, genome-scale models has not been performed. We, therefore, constructed a cell-specific alveolar macrophage model, iAB-AMØ-1410, from the global human metabolic reconstruction, Recon 1. The model successfully predicted experimentally verified ATP and nitric oxide production rates in macrophages. This model was then integrated with an M. tuberculosis H37Rv model, iNJ661, to build an integrated host–pathogen genome-scale reconstruction, iAB-AMØ-1410-Mt-661. The integrated host–pathogen network enables simulation of the metabolic changes during infection. The resulting reaction activity and gene essentiality targets of the integrated model represent an altered infectious state. High-throughput data from infected macrophages were mapped onto the host–pathogen network and were able to describe three distinct pathological states. Integrated host–pathogen reconstructions thus form a foundation upon which understanding the biology and pathophysiology of infections can be developed.
PMCID: PMC2990636  PMID: 20959820
computational biology; host–pathogen; Mycobacterium tuberculosis; systems biology; macrophage
24.  The Gaggle: An open-source software system for integrating bioinformatics software and data sources 
BMC Bioinformatics  2006;7:176.
Systems biologists work with many kinds of data, from many different sources, using a variety of software tools. Each of these tools typically excels at one type of analysis, such as of microarrays, of metabolic networks and of predicted protein structure. A crucial challenge is to combine the capabilities of these (and other forthcoming) data resources and tools to create a data exploration and analysis environment that does justice to the variety and complexity of systems biology data sets. A solution to this problem should recognize that data types, formats and software in this high throughput age of biology are constantly changing.
In this paper we describe the Gaggle -a simple, open-source Java software environment that helps to solve the problem of software and database integration. Guided by the classic software engineering strategy of separation of concerns and a policy of semantic flexibility, it integrates existing popular programs and web resources into a user-friendly, easily-extended environment.
We demonstrate that four simple data types (names, matrices, networks, and associative arrays) are sufficient to bring together diverse databases and software. We highlight some capabilities of the Gaggle with an exploration of Helicobacter pylori pathogenesis genes, in which we identify a putative ricin-like protein -a discovery made possible by simultaneous data exploration using a wide range of publicly available data and a variety of popular bioinformatics software tools.
We have integrated diverse databases (for example, KEGG, BioCyc, String) and software (Cytoscape, DataMatrixViewer, R statistical environment, and TIGR Microarray Expression Viewer). Through this loose coupling of diverse software and databases the Gaggle enables simultaneous exploration of experimental data (mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations (operon, chromosomal proximity, phylogenetic pattern), metabolic pathways (KEGG) and Pubmed abstracts (STRING web resource), creating an exploratory environment useful to 'web browser and spreadsheet biologists', to statistically savvy computational biologists, and those in between. The Gaggle uses Java RMI and Java Web Start technologies and can be found at .
PMCID: PMC1464137  PMID: 16569235
25.  Metabolic network reconstruction of Chlamydomonas offers insight into light-driven algal metabolism 
A comprehensive genome-scale metabolic network of Chlamydomonas reinhardtii, including a detailed account of light-driven metabolism, is reconstructed and validated. The model provides a new resource for research of C. reinhardtii metabolism and in algal biotechnology.
The genome-scale metabolic network of Chlamydomonas reinhardtii (iRC1080) was reconstructed, accounting for >32% of the estimated metabolic genes encoded in the genome, and including extensive details of lipid metabolic pathways.This is the first metabolic network to explicitly account for stoichiometry and wavelengths of metabolic photon usage, providing a new resource for research of C. reinhardtii metabolism and developments in algal biotechnology.Metabolic functional annotation and the largest transcript verification of a metabolic network to date was performed, at least partially verifying >90% of the transcripts accounted for in iRC1080. Analysis of the network supports hypotheses concerning the evolution of latent lipid pathways in C. reinhardtii, including very long-chain polyunsaturated fatty acid and ceramide synthesis pathways.A novel approach for modeling light-driven metabolism was developed that accounts for both light source intensity and spectral quality of emitted light. The constructs resulting from this approach, termed prism reactions, were shown to significantly improve the accuracy of model predictions, and their use was demonstrated for evaluation of light source efficiency and design.
Algae have garnered significant interest in recent years, especially for their potential application in biofuel production. The hallmark, model eukaryotic microalgae Chlamydomonas reinhardtii has been widely used to study photosynthesis, cell motility and phototaxis, cell wall biogenesis, and other fundamental cellular processes (Harris, 2001). Characterizing algal metabolism is key to engineering production strains and understanding photobiological phenomena. Based on extensive literature on C. reinhardtii metabolism, its genome sequence (Merchant et al, 2007), and gene functional annotation, we have reconstructed and experimentally validated the genome-scale metabolic network for this alga, iRC1080, the first network to account for detailed photon absorption permitting growth simulations under different light sources. iRC1080 accounts for 1080 genes, associated with 2190 reactions and 1068 unique metabolites and encompasses 83 subsystems distributed across 10 cellular compartments (Figure 1A). Its >32% coverage of estimated metabolic genes is a tremendous expansion over previous algal reconstructions (Boyle and Morgan, 2009; Manichaikul et al, 2009). The lipid metabolic pathways of iRC1080 are considerably expanded relative to existing networks, and chemical properties of all metabolites in these pathways are accounted for explicitly, providing sufficient detail to completely specify all individual molecular species: backbone molecule and stereochemical numbering of acyl-chain positions; acyl-chain length; and number, position, and cis–trans stereoisomerism of carbon–carbon double bonds. Such detail in lipid metabolism will be critical for model-driven metabolic engineering efforts.
We experimentally verified transcripts accounted for in the network under permissive growth conditions, detecting >90% of tested transcript models (Figure 1B) and providing validating evidence for the contents of iRC1080. We also analyzed the extent of transcript verification by specific metabolic subsystems. Some subsystems stood out as more poorly verified, including chloroplast and mitochondrial transport systems and sphingolipid metabolism, all of which exhibited <80% of transcripts detected, reflecting incomplete characterization of compartmental transporters and supporting a hypothesis of latent pathway evolution for ceramide synthesis in C. reinhardtii. Additional lines of evidence from the reconstruction effort similarly support this hypothesis including lack of ceramide synthetase and other annotation gaps downstream in sphingolipid metabolism. A similar hypothesis of latent pathway evolution was established for very long-chain fatty acids (VLCFAs) and their polyunsaturated analogs (VLCPUFAs) (Figure 1C), owing to the absence of this class of lipids in previous experimental measurements, lack of a candidate VLCFA elongase in the functional annotation, and additional downstream annotation gaps in arachidonic acid metabolism.
The network provides a detailed account of metabolic photon absorption by light-driven reactions, including photosystems I and II, light-dependent protochlorophyllide oxidoreductase, provitamin D3 photoconversion to vitamin D3, and rhodopsin photoisomerase; this network accounting permits the precise modeling of light-dependent metabolism. iRC1080 accounts for effective light spectral ranges through analysis of biochemical activity spectra (Figure 3A), either reaction activity or absorbance at varying light wavelengths. Defining effective spectral ranges associated with each photon-utilizing reaction enabled our network to model growth under different light sources via stoichiometric representation of the spectral composition of emitted light, termed prism reactions. Coefficients for different photon wavelengths in a prism reaction correspond to the ratios of photon flux in the defined effective spectral ranges to the total emitted photon flux from a given light source (Figure 3B). This approach distinguishes the amount of emitted photons that drive different metabolic reactions. We created prism reactions for most light sources that have been used in published studies for algal and plant growth including solar light, various light bulbs, and LEDs. We also included regulatory effects, resulting from lighting conditions insofar as published studies enabled. Light and dark conditions have been shown to affect metabolic enzyme activity in C. reinhardtii on multiple levels: transcriptional regulation, chloroplast RNA degradation, translational regulation, and thioredoxin-mediated enzyme regulation. Through application of our light model and prism reactions, we were able to closely recapitulate experimental growth measurements under solar, incandescent, and red LED lights. Through unbiased sampling, we were able to establish the tremendous statistical significance of the accuracy of growth predictions achievable through implementation of prism reactions. Finally, application of the photosynthetic model was demonstrated prospectively to evaluate light utilization efficiency under different light sources. The results suggest that, of the existing light sources, red LEDs provide the greatest efficiency, about three times as efficient as sunlight. Extending this analysis, the model was applied to design a maximally efficient LED spectrum for algal growth. The result was a 677-nm peak LED spectrum with a total incident photon flux of 360 μE/m2/s, suggesting that for the simple objective of maximizing growth efficiency, LED technology has already reached an effective theoretical optimum.
In summary, the C. reinhardtii metabolic network iRC1080 that we have reconstructed offers insight into the basic biology of this species and may be employed prospectively for genetic engineering design and light source design relevant to algal biotechnology. iRC1080 was used to analyze lipid metabolism and generate novel hypotheses about the evolution of latent pathways. The predictive capacity of metabolic models developed from iRC1080 was demonstrated in simulating mutant phenotypes and in evaluation of light source efficiency. Our network provides a broad knowledgebase of the biochemistry and genomics underlying global metabolism of a photoautotroph, and our modeling approach for light-driven metabolism exemplifies how integration of largely unvisited data types, such as physicochemical environmental parameters, can expand the diversity of applications of metabolic networks.
Metabolic network reconstruction encompasses existing knowledge about an organism's metabolism and genome annotation, providing a platform for omics data analysis and phenotype prediction. The model alga Chlamydomonas reinhardtii is employed to study diverse biological processes from photosynthesis to phototaxis. Recent heightened interest in this species results from an international movement to develop algal biofuels. Integrating biological and optical data, we reconstructed a genome-scale metabolic network for this alga and devised a novel light-modeling approach that enables quantitative growth prediction for a given light source, resolving wavelength and photon flux. We experimentally verified transcripts accounted for in the network and physiologically validated model function through simulation and generation of new experimental growth data, providing high confidence in network contents and predictive applications. The network offers insight into algal metabolism and potential for genetic engineering and efficient light source design, a pioneering resource for studying light-driven metabolism and quantitative systems biology.
PMCID: PMC3202792  PMID: 21811229
Chlamydomonas reinhardtii; lipid metabolism; metabolic engineering; photobioreactor

Results 1-25 (1136497)