Search tips
Search criteria

Results 1-25 (48)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Data Mining in the MetaCyc Family of Pathway Databases 
Pathway databases collect the bioreactions and molecular interactions that define the processes of life. The MetaCyc family of pathway databases consists of thousands of databases that were derived through computational inference of metabolic pathways from the MetaCyc Pathway/Genome Database (PGDB). In some cases these DBs underwent subsequent manual curation. Curated pathway DBs are now available for most of the major model organisms. Databases in the MetaCyc family are managed using the Pathway Tools software. This chapter presents methods for performing data mining on the MetaCyc family of pathway DBs. We discuss the major data access mechanisms for the family, which include data files in multiple formats; application programming interfaces (APIs) for the Lisp, Java, and Perl languages; and web services. We present an overview of the Pathway Tools schema, an understanding of which is needed to query the DBs. The chapter also presents several interactive data mining tools within Pathway Tools for performing omics data analysis.
PMCID: PMC3694719  PMID: 23192547
Metabolic pathways; pathway databases; systems biology
2.  PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools 
Nucleic Acids Research  2013;42(D1):D677-D684.
PortEco ( aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a ‘virtual’ model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-throughput experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput phenotyping of single-gene knockouts under hundreds of annotated conditions, from chromatin immunoprecipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.
PMCID: PMC3965092  PMID: 24285306
3.  Dead End Metabolites - Defining the Known Unknowns of the E. coli Metabolic Network  
PLoS ONE  2013;8(9):e75210.
The EcoCyc database is an online scientific database which provides an integrated view of the metabolic and regulatory network of the bacterium Escherichia coli K-12 and facilitates computational exploration of this important model organism. We have analysed the occurrence of dead end metabolites within the database – these are metabolites which lack the requisite reactions (either metabolic or transport) that would account for their production or consumption within the metabolic network. 127 dead end metabolites were identified from the 995 compounds that are contained within the EcoCyc metabolic network. Their presence reflects either a deficit in our representation of the network or in our knowledge of E. coli metabolism. Extensive literature searches resulted in the addition of 38 transport reactions and 3 metabolic reactions to the database and led to an improved representation of the pathway for Vitamin B12 salvage. 39 dead end metabolites were identified as components of reactions that are not physiologically relevant to E. coli K-12 – these reactions are properties of purified enzymes in vitro that would not be expected to occur in vivo. Our analysis led to improvements in the software that underpins the database and to the program that finds dead end metabolites within EcoCyc. The remaining dead end metabolites in the EcoCyc database likely represent deficiencies in our knowledge of E. coli metabolism.
PMCID: PMC3781023  PMID: 24086468
4.  Groups: knowledge spreadsheets for symbolic biocomputing 
Knowledge spreadsheets (KSs) are a visual tool for interactive data analysis and exploration. They differ from traditional spreadsheets in that rather than being oriented toward numeric data, they work with symbolic knowledge representation structures and provide operations that take into account the semantics of the application domain. ‘Groups’ is an implementation of KSs within the Pathway Tools system. Groups allows Pathway Tools users to define a group of objects (e.g. groups of genes or metabolites) from a Pathway/Genome Database. Groups can be transformed (e.g. by transforming a metabolite group to the group of pathways in which those metabolites are substrates); combined through set operations; analysed (e.g. through enrichment analysis); and visualized (e.g. by painting onto a metabolic map diagram). Users of the Pathway Tools-based website have made extensive use of Groups, and an informal survey of Groups users suggests that Groups has achieved the goal of allowing biologists themselves to perform some data manipulations that previously would have required the assistance of a programmer.
Database URL:
PMCID: PMC3773185  PMID: 24037025
5.  The Pathway Tools Pathway Prediction Algorithm 
Standards in Genomic Sciences  2011;5(3):424-429.
The PathoLogic component of the Pathway Tools software performs prediction of metabolic pathways in sequenced and annotated genomes. This article provides a detailed presentation of the PathoLogic algorithm. The algorithm consists of two phases. The reactome inference phase infers the reactions catalyzed by the organism from the set of enzymes present in the annotated genome. The pathway inference phase infers the metabolic pathways present in the organism from the reactions catalyzed by the organism. Both phases draw on the MetaCyc database of metabolic reactions and pathways. MetaCyc contains two data fields to support pathway inference: the expected taxonomic range of each pathway, and a list of key reactions for pathways. These fields have significantly increased the predictive accuracy of PathoLogic.
PMCID: PMC3368424  PMID: 22675592
6.  Computing minimal nutrient sets from metabolic networks via linear constraint solving 
BMC Bioinformatics  2013;14:114.
As more complete genome sequences become available, bioinformatics challenges arise in how to exploit genome sequences to make phenotypic predictions. One type of phenotypic prediction is to determine sets of compounds that will support the growth of a bacterium from the metabolic network inferred from the genome sequence of that organism.
We present a method for computationally determining alternative growth media for an organism based on its metabolic network and transporter complement. Our method predicted 787 alternative anaerobic minimal nutrient sets for Escherichia coli K–12 MG1655 from the EcoCyc database. The program automatically partitioned the nutrients within these sets into 21 equivalence classes, most of which correspond to compounds serving as sources of carbon, nitrogen, phosphorous, and sulfur, or combinations of these essential elements. The nutrient sets were predicted with 72.5% accuracy as evaluated by comparison with 91 growth experiments. Novel aspects of our approach include (a) exhaustive consideration of all combinations of nutrients rather than assuming that all element sources can substitute for one another(an assumption that can be invalid in general) (b) leveraging the notion of a machinery-duplicating constraint, namely, that all intermediate metabolites used in active reactions must be produced in increasing concentrations to prevent successive dilution from cell division, (c) the use of Satisfiability Modulo Theory solvers rather than Linear Programming solvers, because our approach cannot be formulated as linear programming, (d) the use of Binary Decision Diagrams to produce an efficient implementation.
Our method for generating minimal nutrient sets from the metabolic network and transporters of an organism combines linear constraint solving with binary decision diagrams to efficiently produce solution sets to provided growth problems.
PMCID: PMC3644277  PMID: 23537498
Binary decision diagrams; Computational biology; Linear constraint solving; Minimal nutrient sets; SMT solvers; Metabolic and regulatory networks; Cellular metabolism
7.  A systematic comparison of the MetaCyc and KEGG pathway databases 
BMC Bioinformatics  2013;14:112.
The MetaCyc and KEGG projects have developed large metabolic pathway databases that are used for a variety of applications including genome analysis and metabolic engineering. We present a comparison of the compound, reaction, and pathway content of MetaCyc version 16.0 and a KEGG version downloaded on Feb-27-2012 to increase understanding of their relative sizes, their degree of overlap, and their scope. To assess their overlap, we must know the correspondences between compounds, reactions, and pathways in MetaCyc, and those in KEGG. We devoted significant effort to computational and manual matching of these entities, and we evaluated the accuracy of the correspondences.
KEGG contains 179 module pathways versus 1,846 base pathways in MetaCyc; KEGG contains 237 map pathways versus 296 super pathways in MetaCyc. KEGG pathways contain 3.3 times as many reactions on average as do MetaCyc pathways, and the databases employ different conceptualizations of metabolic pathways. KEGG contains 8,692 reactions versus 10,262 for MetaCyc. 6,174 KEGG reactions are components of KEGG pathways versus 6,348 for MetaCyc. KEGG contains 16,586 compounds versus 11,991 for MetaCyc. 6,912 KEGG compounds act as substrates in KEGG reactions versus 8,891 for MetaCyc. MetaCyc contains a broader set of database attributes than does KEGG, such as relationships from a compound to enzymes that it regulates, identification of spontaneous reactions, and the expected taxonomic range of metabolic pathways. MetaCyc contains many pathways not found in KEGG, from plants, fungi, metazoa, and actinobacteria; KEGG contains pathways not found in MetaCyc, for xenobiotic degradation, glycan metabolism, and metabolism of terpenoids and polyketides. MetaCyc contains fewer unbalanced reactions, which facilitates metabolic modeling such as using flux-balance analysis. MetaCyc includes generic reactions that may be instantiated computationally.
KEGG contains significantly more compounds than does MetaCyc, whereas MetaCyc contains significantly more reactions and pathways than does KEGG, in particular KEGG modules are quite incomplete. The number of reactions occurring in pathways in the two DBs are quite similar.
PMCID: PMC3665663  PMID: 23530693
Pathway databases; Database comparison
8.  What we can learn about Escherichia coli through application of Gene Ontology 
Trends in microbiology  2009;17(7):269-278.
How we classify the genes, products, and complexes that are present or absent in genomes, transcriptomes, proteomes, and other datasets helps us place biological objects into subsystems with common functions, see how molecular functions are used to implement biological processes, and compare the biology of different species and strains. Gene Ontology (GO) is one of the most successful systems for classifying biological function. Although GO is widely used for eukaryotic genomics, it has not yet been widely used for bacterial systems. The potential applications of GO are currently limited by the need to improve the annotation of bacterial genomes with GO and to improve how prokaryotic biology is represented in the ontology. In this review, we will discuss why GO should be adopted by microbiologists, and describe recent efforts to build and maintain high-quality GO annotation for Escherichia coli as a model system.
PMCID: PMC3575750  PMID: 19576778
9.  Browsing Metabolic and Regulatory Networks with BioCyc 
The BioCyc database collection at integrates genome and cellular network information for more than 500 organisms. This method article describes Web-based tools for browsing metabolic and regulatory networks within BioCyc. These tools allow visualization of complete metabolic and regulatory networks, and allow the user to zoom-in on regions of the network of interest. The user can find objects of interest such as genes and metabolites within the networks, and can selectively examine the connectivity of the network.
The EcoCyc database within the BioCyc collection has been extensively curated. The descriptions within EcoCyc of the Escherichia coli metabolic network and regulatory network were derived from thousands of publications. Other BioCyc databases received moderate levels of curation, or no curation at all. Those databases receiving no curation contain metabolic networks that were computationally inferred from the annotated genome sequences of each organism.
PMCID: PMC3549617  PMID: 22144155
Regulatory Network; Metabolic Network; Cellular Network; Web Interface; Highlighting; Regulatory Subnetwork; Browsing; Genome Database; Metabolic Database
10.  Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology 
Briefings in Bioinformatics  2009;11(1):40-79.
Pathway Tools is a production-quality software environment for creating a type of model-organism database called a Pathway/Genome Database (PGDB). A PGDB such as EcoCyc integrates the evolving understanding of the genes, proteins, metabolic network and regulatory network of an organism. This article provides an overview of Pathway Tools capabilities. The software performs multiple computational inferences including prediction of metabolic pathways, prediction of metabolic pathway hole fillers and prediction of operons. It enables interactive editing of PGDBs by DB curators. It supports web publishing of PGDBs, and provides a large number of query and visualization tools. The software also supports comparative analyses of PGDBs, and provides several systems biology analyses of PGDBs including reachability analysis of metabolic networks, and interactive tracing of metabolites through a metabolic network. More than 800 PGDBs have been created using Pathway Tools by scientists around the world, many of which are curated DBs for important model organisms. Those PGDBs can be exchanged using a peer-to-peer DB sharing system called the PGDB Registry.
PMCID: PMC2810111  PMID: 19955237
Genome informatics; Metabolic pathways; Pathway bioinformatics; Model organism databases; Genome databases; Biological networks; Regulatory networks
11.  An advanced web query interface for biological databases 
Although most web-based biological databases (DBs) offer some type of web-based form to allow users to author DB queries, these query forms are quite restricted in the complexity of DB queries that they can formulate. They can typically query only one DB, and can query only a single type of object at a time (e.g. genes) with no possible interaction between the objects—that is, in SQL parlance, no joins are allowed between DB objects. Writing precise queries against biological DBs is usually left to a programmer skillful enough in complex DB query languages like SQL. We present a web interface for building precise queries for biological DBs that can construct much more precise queries than most web-based query forms, yet that is user friendly enough to be used by biologists. It supports queries containing multiple conditions, and connecting multiple object types without using the join concept, which is unintuitive to biologists. This interactive web interface is called the Structured Advanced Query Page (SAQP). Users interactively build up a wide range of query constructs. Interactive documentation within the SAQP describes the schema of the queried DBs. The SAQP is based on BioVelo, a query language based on list comprehension. The SAQP is part of the Pathway Tools software and is available as part of several bioinformatics web sites powered by Pathway Tools, including the site that contains more than 500 Pathway/Genome DBs.
PMCID: PMC2911841  PMID: 20624715
12.  Regulatory network operations in the Pathway Tools software 
BMC Bioinformatics  2012;13:243.
Biologists are elucidating complex collections of genetic regulatory data for multiple organisms. Software is needed for such regulatory network data.
The Pathway Tools software supports storage and manipulation of regulatory information through a variety of strategies. The Pathway Tools regulation ontology captures transcriptional and translational regulation, substrate-level regulation of enzyme activity, post-translational modifications, and regulatory pathways. Regulatory visualizations include a novel diagram that summarizes all regulatory influences on a gene; a transcription-unit diagram, and an interactive visualization of a full transcriptional regulatory network that can be painted with gene expression data to probe correlations between gene expression and regulatory mechanisms. We introduce a novel type of enrichment analysis that asks whether a gene-expression dataset is over-represented for known regulators. We present algorithms for ranking the degree of regulatory influence of genes, and for computing the net positive and negative regulatory influences on a gene.
Pathway Tools provides a comprehensive environment for manipulating molecular regulatory interactions that integrates regulatory data with an organism’s genome and metabolic network. Curated collections of regulatory data authored using Pathway Tools are available for Escherichia coli, Bacillus subtilis, and Shewanella oneidensis.
PMCID: PMC3473263  PMID: 22998532
Regulatory networks; Regulatory interactions; Regulation ontology; Bioinformatics
13.  Discovering novel subsystems using comparative genomics 
Bioinformatics  2011;27(18):2478-2485.
Motivation: Key problems for computational genomics include discovering novel pathways in genome data, and discovering functional interaction partners for genes to define new members of partially elucidated pathways.
Results: We propose a novel method for the discovery of subsystems from annotated genomes. For each gene pair, a score measuring the likelihood that the two genes belong to a same subsystem is computed using genome context methods. Genes are then grouped based on these scores, and the resulting groups are filtered to keep only high-confidence groups. Since the method is based on genome context analysis, it relies solely on structural annotation of the genomes. The method can be used to discover new pathways, find missing genes from a known pathway, find new protein complexes or other kinds of functional groups and assign function to genes. We tested the accuracy of our method in Escherichia coli K-12. In one configuration of the system, we find that 31.6% of the candidate groups generated by our method match a known pathway or protein complex closely, and that we rediscover 31.2% of all known pathways and protein complexes of at least 4 genes. We believe that a significant proportion of the candidates that do not match any known group in E.coli K-12 corresponds to novel subsystems that may represent promising leads for future laboratory research. We discuss in-depth examples of these findings.
Availability: Predicted subsystems are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3167049  PMID: 21775308
14.  A Survey of Metabolic Databases Emphasizing the MetaCyc Family 
Archives of Toxicology  2011;85(9):1015-1033.
Thanks to the confluence of genome sequencing and bioinformatics, the number of metabolic databases has expanded from a handful in the mid 1990s to several thousand today. These databases lie within distinct families that have common ancestry and common attributes. The main families are the MetaCyc, KEGG, Reactome, Model SEED, and BiGG families. We survey these database families, as well as important individual metabolic databases, including multiple human metabolic databases. The MetaCyc family is described in particular detail. It contains well over 1,000 databases, including highly curated databases for Escherichia coli, Saccharamyces cerevisiae, Mus musculus, and Arabidopsis thaliana. These databases are available through a number of web sites that offer a range of software tools for querying and visualizing metabolic networks. These web sites also provide multiple tools for analysis of gene expression and metabolomics data, including visualization of those datasets on metabolic network diagrams, and overrepresentation analysis of gene sets and metabolite sets.
PMCID: PMC3352032  PMID: 21523460
15.  Metabolomics Reveals Amino Acids Contribute to Variation in Response to Simvastatin Treatment 
PLoS ONE  2012;7(7):e38386.
Statins are widely prescribed for reducing LDL-cholesterol (C) and risk for cardiovascular disease (CVD), but there is considerable variation in therapeutic response. We used a gas chromatography-time-of-flight mass-spectrometry-based metabolomics platform to evaluate global effects of simvastatin on intermediary metabolism. Analyses were conducted in 148 participants in the Cholesterol and Pharmacogenetics study who were profiled pre and six weeks post treatment with 40 mg/day simvastatin: 100 randomly selected from the full range of the LDL-C response distribution and 24 each from the top and bottom 10% of this distribution (“good” and “poor” responders, respectively). The metabolic signature of drug exposure in the full range of responders included essential amino acids, lauric acid (p<0.0055, q<0.055), and alpha-tocopherol (p<0.0003, q<0.017). Using the HumanCyc database and pathway enrichment analysis, we observed that the metabolites of drug exposure were enriched for the pathway class amino acid degradation (p<0.0032). Metabolites whose change correlated with LDL-C lowering response to simvastatin in the full range responders included cystine, urea cycle intermediates, and the dibasic amino acids ornithine, citrulline and lysine. These dibasic amino acids share plasma membrane transporters with arginine, the rate-limiting substrate for nitric oxide synthase (NOS), a critical mediator of cardiovascular health. Baseline metabolic profiles of the good and poor responders were analyzed by orthogonal partial least square discriminant analysis so as to determine the metabolites that best separated the two response groups and could be predictive of LDL-C response. Among these were xanthine, 2-hydroxyvaleric acid, succinic acid, stearic acid, and fructose. Together, the findings from this study indicate that clusters of metabolites involved in multiple pathways not directly connected with cholesterol metabolism may play a role in modulating the response to simvastatin treatment.
Trial Registration NCT00451828
PMCID: PMC3392268  PMID: 22808006
16.  The Pathway Tools cellular overview diagram and Omics Viewer 
Nucleic Acids Research  2006;34(13):3771-3778.
The Pathway Tools cellular overview diagram is a visual representation of the biochemical network of an organism. The overview is automatically created from a Pathway/Genome Database describing that organism. The cellular overview includes metabolic, transport and signaling pathways, and other membrane and periplasmic proteins. Pathway Tools supports interrogation and exploration of cellular biochemical networks through the overview diagram. Furthermore, a software component called the Omics Viewer provides visual analysis of whole-organism datasets using the overview diagram as an organizing framework. For example, gene expression and metabolomics measurements, alone or in combination, can be painted onto the overview, as can computed whole-organism datasets, such as predicted reaction-flux values. The cellular overview and Omics Viewer provide a mechanism whereby biologists can apply the pattern-recognition capabilities of the human visual system to analyze large-scale datasets in a biologically meaningful context. SRI's website provides overview diagrams for more than 200 organisms. This article describes enhancements to the overview made since a 1999 publication, including the automatic layout capability, expansion of the cellular machinery that it includes, new semantic zooming and poster-generating capabilities, and extension of the Omics Viewer to support painting of metabolites, animations and zooming to individual pathway diagrams.
PMCID: PMC1557788  PMID: 16893960
17.  Construction and completion of flux balance models from pathway databases 
Bioinformatics  2012;28(3):388-396.
Motivation: Flux balance analysis (FBA) is a well-known technique for genome-scale modeling of metabolic flux. Typically, an FBA formulation requires the accurate specification of four sets: biochemical reactions, biomass metabolites, nutrients and secreted metabolites. The development of FBA models can be time consuming and tedious because of the difficulty in assembling completely accurate descriptions of these sets, and in identifying errors in the composition of these sets. For example, the presence of a single non-producible metabolite in the biomass will make the entire model infeasible. Other difficulties in FBA modeling are that model distributions, and predicted fluxes, can be cryptic and difficult to understand.
Results: We present a multiple gap-filling method to accelerate the development of FBA models using a new tool, called MetaFlux, based on mixed integer linear programming (MILP). The method suggests corrections to the sets of reactions, biomass metabolites, nutrients and secretions. The method generates FBA models directly from Pathway/Genome Databases. Thus, FBA models developed in this framework are easily queried and visualized using the Pathway Tools software. Predicted fluxes are more easily comprehended by visualizing them on diagrams of individual metabolic pathways or of metabolic maps. MetaFlux can also remove redundant high-flux loops, solve FBA models once they are generated and model the effects of gene knockouts. MetaFlux has been validated through construction of FBA models for Escherichia coli and Homo sapiens.
Availability: Pathway Tools with MetaFlux is freely available to academic users, and for a fee to commercial users. Download from:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268246  PMID: 22262672
18.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases 
Nucleic Acids Research  2011;40(D1):D742-D753.
The MetaCyc database ( provides a comprehensive and freely accessible resource for metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are experimentally determined, small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains more than 1800 pathways derived from more than 30 000 publications, and is the largest curated collection of metabolic pathways currently available. Most reactions in MetaCyc pathways are linked to one or more well-characterized enzymes, and both pathways and enzymes are annotated with reviews, evidence codes and literature citations. BioCyc ( is a collection of more than 1700 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the full genome and predicted metabolic network of one organism. The network, which is predicted by the Pathway Tools software using MetaCyc as a reference database, consists of metabolites, enzymes, reactions and metabolic pathways. BioCyc PGDBs contain additional features, including predicted operons, transport systems and pathway-hole fillers. The BioCyc website and Pathway Tools software offer many tools for querying and analysis of PGDBs, including Omics Viewers and comparative analysis. New developments include a zoomable web interface for diagrams; flux-balance analysis model generation from PGDBs; web services; and a new tool called Web Groups.
PMCID: PMC3245006  PMID: 22102576
19.  Web-based metabolic network visualization with a zooming user interface 
BMC Bioinformatics  2011;12:176.
Displaying complex metabolic-map diagrams, for Web browsers, and allowing users to interact with them for querying and overlaying expression data over them is challenging.
We present a Web-based metabolic-map diagram, which can be interactively explored by the user, called the Cellular Overview. The main characteristic of this application is the zooming user interface enabling the user to focus on appropriate granularities of the network at will. Various searching commands are available to visually highlight sets of reactions, pathways, enzymes, metabolites, and so on. Expression data from single or multiple experiments can be overlaid on the diagram, which we call the Omics Viewer capability. The application provides Web services to highlight the diagram and to invoke the Omics Viewer. This application is entirely written in JavaScript for the client browsers and connect to a Pathway Tools Web server to retrieve data and diagrams. It uses the OpenLayers library to display tiled diagrams.
This new online tool is capable of displaying large and complex metabolic-map diagrams in a very interactive manner. This application is available as part of the Pathway Tools software that powers multiple metabolic databases including The Cellular Overview is accessible under the Tools menu.
PMCID: PMC3113945  PMID: 21595965
20.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases 
Nucleic Acids Research  2013;42(D1):D459-D471.
The MetaCyc database ( is a comprehensive and freely accessible database describing metabolic pathways and enzymes from all domains of life. MetaCyc pathways are experimentally determined, mostly small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains >2100 pathways derived from >37 000 publications, and is the largest curated collection of metabolic pathways currently available. BioCyc ( is a collection of >3000 organism-specific Pathway/Genome Databases (PGDBs), each containing the full genome and predicted metabolic network of one organism, including metabolites, enzymes, reactions, metabolic pathways, predicted operons, transport systems and pathway-hole fillers. Additions to BioCyc over the past 2 years include YeastCyc, a PGDB for Saccharomyces cerevisiae, and 891 new genomes from the Human Microbiome Project. The BioCyc Web site offers a variety of tools for querying and analysis of PGDBs, including Omics Viewers and tools for comparative analysis. New developments include atom mappings in reactions, a new representation of glycan degradation pathways, improved compound structure display, better coverage of enzyme kinetic data, enhancements of the Web Groups functionality, improvements to the Omics viewers, a new representation of the Enzyme Commission system and, for the desktop version of the software, the ability to save display states.
PMCID: PMC3964957  PMID: 24225315
22.  A systematic study of genome context methods: calibration, normalization and combination 
BMC Bioinformatics  2010;11:493.
Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use.
We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented.
We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature.
Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism.
Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.
PMCID: PMC3247869  PMID: 20920312
23.  Machine learning methods for metabolic pathway prediction 
BMC Bioinformatics  2010;11:15.
A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism.
To quantitatively validate methods for pathway prediction, we developed a large "gold standard" dataset of 5,610 pathway instances known to be present or absent in curated metabolic pathway databases for six organisms. We defined a collection of 123 pathway features, whose information content we evaluated with respect to the gold standard. Feature data were used as input to an extensive collection of machine learning (ML) methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods. We compared the ML methods to the previous PathoLogic algorithm for pathway prediction using the gold standard dataset. We found that ML-based prediction methods can match the performance of the PathoLogic algorithm. PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways.
ML methods for pathway prediction perform as well as existing methods, and have qualitative advantages in terms of extensibility, tunability, and explainability. More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. However, pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations.
PMCID: PMC3146072  PMID: 20064214
24.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases 
Nucleic Acids Research  2009;38(Database issue):D473-D479.
The MetaCyc database ( is a comprehensive and freely accessible resource for metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are experimentally determined, small-molecule metabolic pathways and are curated from the primary scientific literature. With more than 1400 pathways, MetaCyc is the largest collection of metabolic pathways currently available. Pathways reactions are linked to one or more well-characterized enzymes, and both pathways and enzymes are annotated with reviews, evidence codes, and literature citations. BioCyc ( is a collection of more than 500 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the full genome and predicted metabolic network of one organism. The network, which is predicted by the Pathway Tools software using MetaCyc as a reference, consists of metabolites, enzymes, reactions and metabolic pathways. BioCyc PGDBs also contain additional features, such as predicted operons, transport systems, and pathway hole-fillers. The BioCyc Web site offers several tools for the analysis of the PGDBs, including Omics Viewers that enable visualization of omics datasets on two different genome-scale diagrams and tools for comparative analysis. The BioCyc PGDBs generated by SRI are offered for adoption by any party interested in curation of metabolic, regulatory, and genome-related information about an organism.
PMCID: PMC2808959  PMID: 19850718
25.  EcoCyc: A comprehensive view of Escherichia coli biology 
Nucleic Acids Research  2008;37(Database issue):D464-D470.
EcoCyc ( provides a comprehensive encyclopedia of Escherichia coli biology. EcoCyc integrates information about the genome, genes and gene products; the metabolic network; and the regulatory network of E. coli. Recent EcoCyc developments include a new initiative to represent and curate all types of E. coli regulatory processes such as attenuation and regulation by small RNAs. EcoCyc has started to curate Gene Ontology (GO) terms for E. coli and has made a dataset of E. coli GO terms available through the GO Web site. The curation and visualization of electron transfer processes has been significantly improved. Other software and Web site enhancements include the addition of tracks to the EcoCyc genome browser, in particular a type of track designed for the display of ChIP-chip datasets, and the development of a comparative genome browser. A new Genome Omics Viewer enables users to paint omics datasets onto the full E. coli genome for analysis. A new advanced query page guides users in interactively constructing complex database queries against EcoCyc. A Macintosh version of EcoCyc is now available. A series of Webinars is available to instruct users in the use of EcoCyc.
PMCID: PMC2686493  PMID: 18974181

Results 1-25 (48)