Pathway Tools is a production-quality software environment for creating a type of model-organism database called a Pathway/Genome Database (PGDB). A PGDB such as EcoCyc integrates the evolving understanding of the genes, proteins, metabolic network and regulatory network of an organism. This article provides an overview of Pathway Tools capabilities. The software performs multiple computational inferences including prediction of metabolic pathways, prediction of metabolic pathway hole fillers and prediction of operons. It enables interactive editing of PGDBs by DB curators. It supports web publishing of PGDBs, and provides a large number of query and visualization tools. The software also supports comparative analyses of PGDBs, and provides several systems biology analyses of PGDBs including reachability analysis of metabolic networks, and interactive tracing of metabolites through a metabolic network. More than 800 PGDBs have been created using Pathway Tools by scientists around the world, many of which are curated DBs for important model organisms. Those PGDBs can be exchanged using a peer-to-peer DB sharing system called the PGDB Registry.
Genome informatics; Metabolic pathways; Pathway bioinformatics; Model organism databases; Genome databases; Biological networks; Regulatory networks
EcoCyc is a bioinformatics database available at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists, and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene, metabolite, reaction, operon, and metabolic pathway. The database also includes information on E. coli gene essentiality, and on nutrient conditions that do or do not support the growth of E. coli. The web site and downloadable software contain tools for analysis of high-throughput datasets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. This chapter provides a detailed description of the data content of EcoCyc, and of the procedures by which this content is generated.
The sets of compounds that can support growth of an organism are defined by the presence of transporters and metabolic pathways that convert nutrient sources into cellular components and energy for growth. A collection of known nutrient sources can therefore serve both as an impetus for investigating new metabolic pathways and transporters and as a reference for computational modeling of known metabolic pathways. To establish such a collection for Escherichia coli K-12, we have integrated data on the growth or nongrowth of E. coli K-12 obtained from published observations using a variety of individual media and from high-throughput phenotype microarrays into the EcoCyc database. The assembled collection revealed a substantial number of discrepancies between the high-throughput data sets, which we investigated where possible using low-throughput growth assays on soft agar and in liquid culture. We also integrated six data sets describing 16,119 observations of the growth of single-gene knockout mutants of E. coli K-12 into EcoCyc, which are relevant to antimicrobial drug design, provide clues regarding the roles of genes of unknown function, and are useful for validating metabolic models. To make this information easily accessible to EcoCyc users, we developed software for capturing, querying, and visualizing cellular growth assays and gene essentiality data.
Constraint-based models of Escherichia coli metabolic flux have played a key role in computational studies of cellular metabolism at the genome scale. We sought to develop a next-generation constraint-based E. coli model that achieved improved phenotypic prediction accuracy while being frequently updated and easy to use. We also sought to compare model predictions with experimental data to highlight open questions in E. coli biology.
We present EcoCyc–18.0–GEM, a genome-scale model of the E. coli K–12 MG1655 metabolic network. The model is automatically generated from the current state of EcoCyc using the MetaFlux software, enabling the release of multiple model updates per year. EcoCyc–18.0–GEM encompasses 1445 genes, 2286 unique metabolic reactions, and 1453 unique metabolites. We demonstrate a three-part validation of the model that breaks new ground in breadth and accuracy: (i) Comparison of simulated growth in aerobic and anaerobic glucose culture with experimental results from chemostat culture and simulation results from the E. coli modeling literature. (ii) Essentiality prediction for the 1445 genes represented in the model, in which EcoCyc–18.0–GEM achieves an improved accuracy of 95.2% in predicting the growth phenotype of experimental gene knockouts. (iii) Nutrient utilization predictions under 431 different media conditions, for which the model achieves an overall accuracy of 80.7%. The model’s derivation from EcoCyc enables query and visualization via the EcoCyc website, facilitating model reuse and validation by inspection. We present an extensive investigation of disagreements between EcoCyc–18.0–GEM predictions and experimental data to highlight areas of interest to E. coli modelers and experimentalists, including 70 incorrect predictions of gene essentiality on glucose, 80 incorrect predictions of gene essentiality on glycerol, and 83 incorrect predictions of nutrient utilization.
Significant advantages can be derived from the combination of model organism databases and flux balance modeling represented by MetaFlux. Interpretation of the EcoCyc database as a flux balance model results in a highly accurate metabolic model and provides a rigorous consistency check for information stored in the database.
Escherichia coli; Flux balance analysis; Constraint-based modeling; Metabolic network reconstruction; Metabolic modeling; Genome-scale model; Gene essentiality; Systems biology; EcoCyc; Pathway Tools
Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate.
The MetaCyc database (MetaCyc.org) is a comprehensive and freely accessible database describing metabolic pathways and enzymes from all domains of life. MetaCyc pathways are experimentally determined, mostly small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains >2100 pathways derived from >37 000 publications, and is the largest curated collection of metabolic pathways currently available. BioCyc (BioCyc.org) is a collection of >3000 organism-specific Pathway/Genome Databases (PGDBs), each containing the full genome and predicted metabolic network of one organism, including metabolites, enzymes, reactions, metabolic pathways, predicted operons, transport systems and pathway-hole fillers. Additions to BioCyc over the past 2 years include YeastCyc, a PGDB for Saccharomyces cerevisiae, and 891 new genomes from the Human Microbiome Project. The BioCyc Web site offers a variety of tools for querying and analysis of PGDBs, including Omics Viewers and tools for comparative analysis. New developments include atom mappings in reactions, a new representation of glycan degradation pathways, improved compound structure display, better coverage of enzyme kinetic data, enhancements of the Web Groups functionality, improvements to the Omics viewers, a new representation of the Enzyme Commission system and, for the desktop version of the software, the ability to save display states.
The EcoCyc database is an online scientific database which provides an integrated view of the metabolic and regulatory network of the bacterium Escherichia coli K-12 and facilitates computational exploration of this important model organism. We have analysed the occurrence of dead end metabolites within the database – these are metabolites which lack the requisite reactions (either metabolic or transport) that would account for their production or consumption within the metabolic network. 127 dead end metabolites were identified from the 995 compounds that are contained within the EcoCyc metabolic network. Their presence reflects either a deficit in our representation of the network or in our knowledge of E. coli metabolism. Extensive literature searches resulted in the addition of 38 transport reactions and 3 metabolic reactions to the database and led to an improved representation of the pathway for Vitamin B12 salvage. 39 dead end metabolites were identified as components of reactions that are not physiologically relevant to E. coli K-12 – these reactions are properties of purified enzymes in vitro that would not be expected to occur in vivo. Our analysis led to improvements in the software that underpins the database and to the program that finds dead end metabolites within EcoCyc. The remaining dead end metabolites in the EcoCyc database likely represent deficiencies in our knowledge of E. coli metabolism.
As more complete genome sequences become available, bioinformatics challenges arise in how to exploit genome sequences to make phenotypic predictions. One type of phenotypic prediction is to determine sets of compounds that will support the growth of a bacterium from the metabolic network inferred from the genome sequence of that organism.
We present a method for computationally determining alternative growth media for an organism based on its metabolic network and transporter complement. Our method predicted 787 alternative anaerobic minimal nutrient sets for Escherichia coli K–12 MG1655 from the EcoCyc database. The program automatically partitioned the nutrients within these sets into 21 equivalence classes, most of which correspond to compounds serving as sources of carbon, nitrogen, phosphorous, and sulfur, or combinations of these essential elements. The nutrient sets were predicted with 72.5% accuracy as evaluated by comparison with 91 growth experiments. Novel aspects of our approach include (a) exhaustive consideration of all combinations of nutrients rather than assuming that all element sources can substitute for one another(an assumption that can be invalid in general) (b) leveraging the notion of a machinery-duplicating constraint, namely, that all intermediate metabolites used in active reactions must be produced in increasing concentrations to prevent successive dilution from cell division, (c) the use of Satisfiability Modulo Theory solvers rather than Linear Programming solvers, because our approach cannot be formulated as linear programming, (d) the use of Binary Decision Diagrams to produce an efficient implementation.
Our method for generating minimal nutrient sets from the metabolic network and transporters of an organism combines linear constraint solving with binary decision diagrams to efficiently produce solution sets to provided growth problems.
Binary decision diagrams; Computational biology; Linear constraint solving; Minimal nutrient sets; SMT solvers; Metabolic and regulatory networks; Cellular metabolism
How we classify the genes, products, and complexes that are present or absent in genomes, transcriptomes, proteomes, and other datasets helps us place biological objects into subsystems with common functions, see how molecular functions are used to implement biological processes, and compare the biology of different species and strains. Gene Ontology (GO) is one of the most successful systems for classifying biological function. Although GO is widely used for eukaryotic genomics, it has not yet been widely used for bacterial systems. The potential applications of GO are currently limited by the need to improve the annotation of bacterial genomes with GO and to improve how prokaryotic biology is represented in the ontology. In this review, we will discuss why GO should be adopted by microbiologists, and describe recent efforts to build and maintain high-quality GO annotation for Escherichia coli as a model system.
EcoCyc (http://EcoCyc.org) is a model organism database built on the genome sequence of Escherichia coli K-12 MG1655. Expert manual curation of the functions of individual E. coli gene products in EcoCyc has been based on information found in the experimental literature for E. coli K-12-derived strains. Updates to EcoCyc content continue to improve the comprehensive picture of E. coli biology. The utility of EcoCyc is enhanced by new tools available on the EcoCyc web site, and the development of EcoCyc as a teaching tool is increasing the impact of the knowledge collected in EcoCyc.
The MetaCyc database (http://metacyc.org/) provides a comprehensive and freely accessible resource for metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are experimentally determined, small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains more than 1800 pathways derived from more than 30 000 publications, and is the largest curated collection of metabolic pathways currently available. Most reactions in MetaCyc pathways are linked to one or more well-characterized enzymes, and both pathways and enzymes are annotated with reviews, evidence codes and literature citations. BioCyc (http://biocyc.org/) is a collection of more than 1700 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the full genome and predicted metabolic network of one organism. The network, which is predicted by the Pathway Tools software using MetaCyc as a reference database, consists of metabolites, enzymes, reactions and metabolic pathways. BioCyc PGDBs contain additional features, including predicted operons, transport systems and pathway-hole fillers. The BioCyc website and Pathway Tools software offer many tools for querying and analysis of PGDBs, including Omics Viewers and comparative analysis. New developments include a zoomable web interface for diagrams; flux-balance analysis model generation from PGDBs; web services; and a new tool called Web Groups.
EcoCyc (http://EcoCyc.org) is a comprehensive model organism database for Escherichia coli K-12 MG1655. From the scientific literature, EcoCyc captures the functions of individual E. coli gene products; their regulation at the transcriptional, post-transcriptional and protein level; and their organization into operons, complexes and pathways. EcoCyc users can search and browse the information in multiple ways. Recent improvements to the EcoCyc Web interface include combined gene/protein pages and a Regulation Summary Diagram displaying a graphical overview of all known regulatory inputs to gene expression and protein activity. The graphical representation of signal transduction pathways has been updated, and the cellular and regulatory overviews were enhanced with new functionality. A specialized undergraduate teaching resource using EcoCyc is being developed.
EcoCyc (http://EcoCyc.org) provides a comprehensive encyclopedia of Escherichia coli biology. EcoCyc integrates information about the genome, genes and gene products; the metabolic network; and the regulatory network of E. coli. Recent EcoCyc developments include a new initiative to represent and curate all types of E. coli regulatory processes such as attenuation and regulation by small RNAs. EcoCyc has started to curate Gene Ontology (GO) terms for E. coli and has made a dataset of E. coli GO terms available through the GO Web site. The curation and visualization of electron transfer processes has been significantly improved. Other software and Web site enhancements include the addition of tracks to the EcoCyc genome browser, in particular a type of track designed for the display of ChIP-chip datasets, and the development of a comparative genome browser. A new Genome Omics Viewer enables users to paint omics datasets onto the full E. coli genome for analysis. A new advanced query page guides users in interactively constructing complex database queries against EcoCyc. A Macintosh version of EcoCyc is now available. A series of Webinars is available to instruct users in the use of EcoCyc.
The annotation of the Escherichia coli K-12 genome in the EcoCyc database is one of the most accurate, complete and multidimensional genome annotations. Of the 4460 E. coli genes, EcoCyc assigns biochemical functions to 76%, and 66% of all genes had their functions determined experimentally. EcoCyc assigns E. coli genes to Gene Ontology and to MultiFun. Seventy-five percent of gene products contain reviews authored by the EcoCyc project that summarize the experimental literature about the gene product. EcoCyc information was derived from 15 000 publications. The database contains extensive descriptions of E. coli cellular networks, describing its metabolic, transport and transcriptional regulatory processes. A comparison to genome annotations for other model organisms shows that the E. coli genome contains the most experimentally determined gene functions in both relative and absolute terms: 2941 (66%) for E. coli, 2319 (37%) for Saccharomyces cerevisiae, 1816 (5%) for Arabidopsis thaliana, 1456 (4%) for Mus musculus and 614 (4%) for Drosophila melanogaster. Database queries to EcoCyc survey the global properties of E. coli cellular networks and illuminate the extent of information gaps for E. coli, such as dead-end metabolites. EcoCyc provides a genome browser with novel properties, and a novel interactive display of transcriptional regulatory networks.
The goal of this group project has been to coordinate and bring up-to-date information on all genes of Escherichia coli K-12. Annotation of the genome of an organism entails identification of genes, the boundaries of genes in terms of precise start and end sites, and description of the gene products. Known and predicted functions were assigned to each gene product on the basis of experimental evidence or sequence analysis. Since both kinds of evidence are constantly expanding, no annotation is complete at any moment in time. This is a snapshot analysis based on the most recent genome sequences of two E.coli K-12 bacteria. An accurate and up-to-date description of E.coli K-12 genes is of particular importance to the scientific community because experimentally determined properties of its gene products provide fundamental information for annotation of innumerable genes of other organisms. Availability of the complete genome sequence of two K-12 strains allows comparison of their genotypes and mutant status of alleles.
Myxococcus xanthus cells carrying the Ω4408 Tn5lac insertion at the sde locus show defects in fruiting body development and sporulation. Our analysis of sde expression patterns showed that this locus is induced early in the developmental program (0 to 2 h) and that expression increases approximately fivefold after 12 h of development. Further studies showed that expression of sde is induced as growing cells enter stationary phase, suggesting that activation of the sde locus is not limited to the developmental process. Because the peak levels of sde expression in both an sde+ and an sde mutant background were similar, we conclude that the sde locus is not autoregulated. Characterization of the sde locus by DNA sequence analysis indicated that the Ω4408 insertion occurred within the sdeK gene. Primer extension analyses localized the 5′ end of sde transcript to a guanine nucleotide 307 bp upstream of the proposed start for the SdeK coding sequence. The DNA sequence in the −12 and −24 regions upstream of the sde transcriptional start site shows similarity to the ς54 family of promoters. The results of complementation studies suggest that the defects in development and sporulation caused by the Ω4408 insertion are due to an inactivation of sdeK. The predicted amino acid sequence of SdeK was found to have similarity to the sequences of the histidine protein kinases of two-component regulatory systems. Based on our results, we propose that SdeK may be part of a signal transduction pathway required for the activation and propagation of the early developmental program.