|Home | About | Journals | Submit | Contact Us | Français|
Pathway Tools is a production-quality software environment for creating a type of model-organism database called a Pathway/Genome Database (PGDB). A PGDB such as EcoCyc integrates the evolving understanding of the genes, proteins, metabolic network and regulatory network of an organism. This article provides an overview of Pathway Tools capabilities. The software performs multiple computational inferences including prediction of metabolic pathways, prediction of metabolic pathway hole fillers and prediction of operons. It enables interactive editing of PGDBs by DB curators. It supports web publishing of PGDBs, and provides a large number of query and visualization tools. The software also supports comparative analyses of PGDBs, and provides several systems biology analyses of PGDBs including reachability analysis of metabolic networks, and interactive tracing of metabolites through a metabolic network. More than 800 PGDBs have been created using Pathway Tools by scientists around the world, many of which are curated DBs for important model organisms. Those PGDBs can be exchanged using a peer-to-peer DB sharing system called the PGDB Registry.
Pathway Tools [1–3] is a software environment for management, analysis and visualization of integrated collections of genome, pathway and regulatory data. Pathway Tools handles many types of information beyond pathways, and its capabilities are very extensive. The software has been under continuous development within the Bioinformatics Research Group within SRI International since the early 1990s. Pathway Tools serves following several different use cases in bioinformatics and systems biology:
This article provides a comprehensive description of Pathway Tools. It describes both what the software does, and how it does it. Where possible it references earlier publications that provide more algorithmic details. However, in some cases those earlier publications are outdated by new developments in the software that are described here. This article also emphasizes new aspects of the software that have not been reported in earlier publications.
Pathway Tools is focused around a type of MOD called a Pathway/Genome Database (PGDB). A PGDB integrates information about the genes, proteins, metabolic network and regulatory network of an organism.
Pathway Tools has several components. The PathoLogic component allows users to create a new PGDB from the annotated genome of an organism. PathoLogic generates a new PGDB that contains the genes, proteins, biochemical reactions and predicted metabolic pathways and operons of the organism.
The Pathway/Genome Editors let PGDB developers interactively refine the contents of a PGDB, such as editing a metabolic pathway or an operon, or defining the function of a newly characterized gene.
The Pathway/Genome Navigator supports querying, visualization and analysis of PGDBs. The Navigator can run as a local desktop application and as a web server. The Navigator allows scientists to find information quickly, to display that information in familiar graphical forms and to publish a PGDB to the scientific community via the web. The Navigator provides a platform for systems-level analysis of functional-genomics data by providing tools for painting combinations of gene expression, protein expression and metabolomics data onto a full metabolic map of the cell, onto the full genome, and onto a diagram of the regulatory network of the cell.
Pathway Tools includes a sophisticated ontology and DB application programming interface (API) that allows programs to perform complex queries, symbolic computations and data mining on the contents of a PGDB. For example, the software has been used for global studies of the Escherichia coli metabolic network  and genetic network .
Pathway Tools is seeing widespread use across the bioinformatics community to create PGDBs in all domains of life. The software has been licensed by more than 1700 users to date. As well as supporting the development of the EcoCyc  and MetaCyc  DBs at SRI, and SRI's BioCyc collection of 500 PGDBs , the software is in use by genome centers, by experimental biologists, and by groups that are creating curated MODs for bacteria (such as the National Institute of Allergy and Infectious Diseases Bioinformatics Resource Centers PATRIC, BioHealthBase, Pathema and EuPathDB), for fungi (such as the Saccharomyces Genome Database and the Candida Genome Database), mammals (such as the Jackson Laboratory's MouseCyc) and for plants (such as Arabidopsis thaliana). See Section 9 for a more detailed listing of available PGDBs.
The organization of this article is as follows. Section ‘Pathway Tools use cases’ articulates in more detail the use cases for which Pathway Tools was designed. ‘Creating and curating a PGDB’ section relates how a new PGDB is created, and describes the computational inference procedures within Pathway Tools. It summarizes the interactive editing capabilities of Pathway Tools, and the associated author crediting system. It also describes tools for automatic upgrading of a PGDB schema, and for bulk updating of the genome annotation within a PGDB. ‘The pathway Tools schema’ section describes the schema of a PGDB. ‘Visualization and querying of PGDBs’ section relates the querying and visualization facilities of Pathway Tools. ‘Computational access to PGDBs’ section summarizes the mechanisms for importing and exporting data from Pathway Tools, and for accessing and updating PGDB data via APIs. ‘Systems biology analyses’ section describes multiple Pathway Tools modules for performing systems analyses of PGDBs including a tool for interactively tracing metabolites through the metabolic network, tools for performing network reachability analysis and for identifying dead-end metabolites, a tool for predicting antimicrobial drug targets by identifying metabolic network choke points and a set of comparative analysis tools. ‘Software and DB architecture’ section describes the software architecture of Pathway Tools. ‘Survey of pathway tools compatible DBs’ section lists the large family of PGDBs that have been created by Pathway Tools users outside SRI International, and describes a peer-to-peer data sharing facility within Pathway Tools that allows users to easily exchange their PGDBs. ‘Comparison with related software environments’ section compares Pathway Tools to related efforts.
This section articulates the objectives for which Pathway Tools was designed. Please note that when we assert that Pathway Tools supports a given type of use case, it does not mean that Pathway Tools provides every type of computational tool needed in that area. For example, omics data analysis is a huge field, and although Pathway Tools contributes novel and useful omics data analysis capabilities, it does not provide every omics data analysis method, and in fact it is intended to be used in conjunction with other omics analysis tools (such as for data normalization). Similarly, although Pathway Tools can contribute a number of useful capabilities to metabolic engineering, by no means does it solve every problem in metabolic engineering. ‘Limitations and future work’ section summarizes the limitations of Pathway Tools.
Organism-specific DBs (also known as MODs) describe the genome and other information about an organism [8–19]. We posit that every organism with a completely sequenced genome and an experimental community of significant size requires an organism-specific DB to fully exploit the genome sequence. Such DBs should provide a central information resource about the genome, molecular parts and cellular networks of the organism, and as such they must be able to capture a diverse range of information types. A critical role of organism-specific DBs is to integrate information that is scattered across the biomedical literature, both to assemble as a complete picture of the organism as possible, and to permit cross-checking and validation of isolated pieces of information. These DBs should both direct and accelerate further scientific investigations.
Pathway Tools facilitates rapid initial computational construction of organism-specific DBs, followed by manual refinement of the PGDB, to produce an extremely rich and accurate DB in minimal time. Our approach tracks experimental versus computationally inferred information whenever possible. Rapid construction of PGDBs is achieved by importing an annotated genome into a PGDB in the form of a Genbank file, and by applying several computational inference tools to infer new information within the PGDB, such as metabolic pathways. Scientists can then employ the Pathway/Genome Editors to correct and supplement computational inferences when necessary, and to perform ongoing manual curation of the PGDB if desired. Further details of our approach can be found in ‘Creating and curating a PGDB’ section.
The Pathway Tools DB schema (for definition see ‘The Pathway Tools schema’ section) is significant in both its breadth and its depth: It models an unusually broad set of bioinformatics datatypes ranging from genomes to pathways to regulatory networks, and it provides high-fidelity representations of those datatypes that allow PGDBs to accurately capture complex biology.
To speed user comprehension of the complex information within PGDBs, the Pathway/Genome Navigator provides many scientific visualization services including a genome browser, visualization of single metabolic pathways and entire metabolic maps, visualization of single operons and of entire regulatory networks and visualization of chemical compounds and reactions (see ‘Visualization and querying of PGDBs’ section for more details). These visualization tools operate within a web server, permitting developers of PGDBs to publish their PGDBs to the scientific community through a web site. This form of PGDB publishing supports interactive querying and browsing by individual scientists using a three-tiered series of web query interfaces (see ‘Query tools’ section) including a quick search, a set of object-specific query tools and a tool for interactively constructing queries whose power is comparable with that of SQL.
We have developed other publishing paradigms to support computational analysis and dissemination of PGDBs. Pathway Tools APIs exist in three languages . Web services access to PGDBs is under development now. PGDBs can be exported in several formats and imported into the BioWarehouse DB integration system . Finally, users can easily share and exchange PGDBs using a peer-to-peer DB sharing system that we have developed.
Pathway Tools extends the paradigm of genome analysis. After traditional analyses such as gene calling and gene function, predictions, are performed by external software packages; Pathway Tools provides additional computational genome analyses that layer additional information above the traditional genome annotation. Pathway Tools predicts the operons of the organism. It predicts the metabolic pathways of the organism. It also predicts which genes in the organism code for missing enzymes in the predicted metabolic pathways, thus using pathway information to predict additional gene functions. See ‘Creating and curating a PGDB’ section for more details.
Pathway Tools was the first software system to provide pathway-based analysis of gene expression data . Pathway Tools provides three genome-scale viewers for animated visualization of omics datasets in the context of the full metabolic network , full transcriptional regulatory network and full genome (see ‘System-level visualization of metabolic networks, system-level visualization of regulatory networks and system-level visualization of genome maps’ sections for more details).
As well as serving as an online reference for researchers, a PGDB is a computational model of the organism. It must be possible to employ such models to test and extend our understanding of the organism, by checking models for internal consistency, and for their consistency with experimental data. Both this use case and the next are concerned with PGDBs as models.
A symbolic systems biology workbench supports users in developing global analyses of a biological system that are symbolic (qualitative) in nature. ‘Symbolic computing is concerned with the representation and manipulation of information in symbolic form. It is often contrasted with numeric representation’ . General examples of symbolic computation include string matching for DNA and protein sequences, symbolic algebra programs (e.g. Mathematica and the Graphing Calculator), compilers and interpreters for programming languages, DB query languages, web crawlers and many Artificial Intelligence (AI) algorithms (e.g. expert systems and symbolic logic).
A strong motivation for applying symbolic computing techniques in systems biology is that these techniques can provide insight in areas where numerical techniques fail because of the unavailability of quantitative system parameters.
An example symbolic systems biology problem is: Let C be the set of carbohydrates for which E. coli has transporters. Which members of C are not the inputs to a degradative metabolic pathway in E. coli? Which members of C are not consumed by any metabolic reaction in E. coli? These questions are of interest because they may indicate incomplete or incorrect knowledge of a cell's transport and metabolic networks. Pathway Tools assists users in answering queries of this sort by providing a rich schema (ontology) for PGDBs that makes a wide array of information accessible for computational analysis, and by providing a rich library of Lisp functions (callable through the Lisp, Perl and Java APIs) for computing symbolic relationships among information in a PGDB, such as for retrieving all transported substrates in the cell.
Pathway Tools includes programs for symbolic analysis of biological networks (see ‘Systems biology analyses’ section for more details) that rely on the detailed biological network ontology underlying Pathway Tools. That ontology provides high-fidelity representations of a wide range of metabolic and regulatory interactions (see ‘Metabolites, reactions and pathways’ and ‘Pathway Tool regulation ontology’ sections).
Two tools perform consistency checking of metabolic networks: (i) The software identifies dead-end metabolites, which are metabolites that are only synthesized by the metabolic network, or only consumed by the metabolic network, and are not transported into or out of the cell. Although occasionally dead-end metabolites are biologically valid, usually they reflect errors or incompleteness of our knowledge of a metabolic network. (ii) The software performs forward qualitative propagation of metabolites through the metabolic network , which we call reachability analysis. The intuition here is that by tracing the paths of metabolites from a known growth medium through the metabolic network, we should be able to reach essential compounds that the cell must be producing from those starting metabolites. Failure to reach those essential compounds (such as the amino acids and cell wall components) usually indicates gaps in the metabolic network model and indicates the need for further model curation or basic research.
Pathway Tools indirectly supports a two-phased pathway-based paradigm for drug discovery. Phase I is the search for essential in vivo metabolic pathways: pathways whose function is essential for microbial growth in the host. Phase II is the search for targets within essential in vivo pathways. Both phases are supported by a Pathway Tools module that predicts choke point reactions within the metabolic network as likely drug targets .
Pathway Tools provides a suite of comparative analysis operations that can be applied to multiple user-selected PGDBs (see ‘Comparative tools’ section for more details). Pathway Tools emphasizes comparisons at the functional level, rather than the sequence level. Example comparisons include (i) highlighting on the cellular overview of one organism the reactions that it shares (or does not share) with one or more other organisms; (ii) a tabular comparison of the reaction complements of several organisms, organized by substrate type (e.g. small molecules, RNAs, and proteins) or by number of isozymes per reaction; (iii) a comparison of the pathway complements of several organisms, where the tabular pathway comparison is organized by a pathway ontology; (iv) a table showing which genes have orthologs in which PGDBs; and (v) a comparison of the genome organization of orthologs using the genome browser.
Metabolic engineering is a discipline that seeks to modify the metabolic network of an organism in a desired fashion, such as to achieve overproduction of desired end products, or degradation of specified compounds . Pathway Tools is designed to assist metabolic engineers in several respects. Its inference capabilities aid in rapid characterization of a host organism for metabolic engineering. Its editing tools permit refinement of that metabolic model. Its omics analysis capabilities aid metabolic engineers in understanding the activity levels of different portions of the metabolic network under different growth conditions. It provides a tool for tracing metabolites forward and backward through the metabolic network to aid understanding of the metabolic fate of specific molecules (see ‘Metabolite tracing’ section for more details).
The life cycle of a PGDB typically includes the following three types of procedures.
The following subsections describe the Pathway Tools components for addressing these procedures.
PathoLogic performs a series of computational inferences that are summarized in Figure 1. These inferences can be performed in an interactive mode, in which the user guides the system through each step, and can review and modify the inferences made by the system using interactive tools. PathoLogic can also execute in a batch mode in which all processing is automated. In batch mode, PathoLogic can process hundreds of genomes.
The input to PathoLogic is the annotated genome of an organism. PathoLogic does not perform genome annotation; its input must supply the genome sequence, the locations of genes and identified functions of gene products. The sequence is supplied as a set of FASTA-format files, one per replicon. The annotation is supplied as a set of files in Genbank format or PathoLogic format, each of which describes the annotation of one replicon (chromosome or plasmid), or of one contig for genomes that are not fully assembled.
The annotation specified in a Genbank or PathoLogic file can include the start and stop positions of the coding region for each gene, and intron positions. It can also include a description of the function of the gene product as a text string, one or more Enzyme Commission (EC) numbers and one or more GO terms. The annotation can also include a gene name, synonyms for the gene name and the product name, links to other bioinformatics DBs, and comments.
PathoLogic initializes the schema of the new PGDB by copying from MetaCyc into the new PGDB, the definitions of the approximately 3200 classes and 250 slots (DB attributes) that define the schema of a PGDB.
PathoLogic next creates a PGDB object for every replicon and contig defined by the input files, and for every gene and gene product defined in the input files. It populates these new objects with data from the input files, such as gene names and their sequence coordinates and gene product names. As a result of these operations, the new PGDB now mirrors the information in the input files.
Pathway Tools predicts the metabolic pathway complement of an organism by assessing what known pathways from the MetaCyc PGDB  are present in the annotated genome of that organism's; PGDB. This inference is performed in two steps that are described and evaluated further in Paley and Karp  and Karp et al. .
Enzymes in the PGDB are assigned to their corresponding reactions in MetaCyc, thus defining the reactome of the organism. PathoLogic performs this assignment by matching the gene-product names (enzyme names), the EC numbers and the GO terms to MetaCyc reactions assigned to genes in the genome. The program can use whatever combination of these three information types is available in a given genome. For example, the fabD gene in Bacillus anthracis was annotated with the function ‘malonyl CoA-acyl carrier protein transacylase.’ That name was recognized by PathoLogic as corresponding to the MetaCyc reaction whose EC number is 18.104.22.168. PathoLogic therefore imported that reaction and its substrates into the B. anthracis PGDB, and created an enzymatic-reaction object linking that reaction to that of B. anthracis protein.
Although hundreds of such enzyme-reaction assignments are performed automatically by PathoLogic, it typically does not recognize on the order of 20% of the enzyme names in a genome. Therefore, PathoLogic includes an interactive tool that presents names of putative metabolic enzymes (all proteins whose name ends in ‘ase’, with exclusion of certain nonspecific and nonmetabolic enzyme names) to the user, and aids the user in assigning those enzymes to reactions in MetaCyc. For example, PathoLogic provides an operation that runs an inexact string comparison search between the enzyme name and all enzyme names in MetaCyc, which sometimes allows the user to identify a match based on scrambled word orders within complex enzyme names.
Once the reactome of the organism has been established in the preceding manner, PathoLogic imports all MetaCyc pathways that contain at least one reaction in the organism's reactome into the new PGDB. Once imported, PathoLogic then attempts to prune out those pathways that are likely to be false positive predictions. That pruning process considers both the fraction of reaction steps in the pathway that has assigned enzymes, and how many of the reactions with assigned enzymes are unique to that pathway (as opposed to being used in additional metabolic pathways in that organism). The remaining pathways are those that are predicted to occur in the organism under analysis.
As MetaCyc has grown in size, we have seen a significant increase in the number of false positive predictions made by PathoLogic; thus, we have recently altered the pruning procedure to prune a predicted pathway from organism X if organism X is outside the expected taxonomic distribution of that pathway. MetaCyc records curated information about the expected taxonomic groups in which a pathway is expected to occur based on experimental observations of that pathway to date. For example, many pathways are expected to occur in plants only. This rule has significantly increased the accuracy of PathoLogic.
The Pathway Tools operon predictor identifies operon boundaries by examining pairs of adjacent genes A and B and using information such as intergenic distance, and whether it can identify a functional relationship between A and B, such as membership in the same pathway , membership in the same multimeric protein complex, or whether A is a transporter for a substrate within a metabolic pathway in which B is an enzyme.
A pathway hole is a reaction in a metabolic pathway for which no enzyme has been identified in the genome that catalyzes that reaction. Typical microbial genomes contain 200–300 pathway holes. Although some pathway holes are probably genuine, we believe that the majority are likely to result from the failure of the genome annotation process to identify the genes corresponding to those pathway holes. For example, genome annotation systems systematically under-annotate genes with multiple functions, and we believe that the enzyme functions for many pathway holes are unidentified second functions for genes that have one assigned function.
The pathway hole filling program PHFiller  (a component of PathoLogic) generates hypotheses as to which genes code for these missing enzymes using the following method. Given a reaction that is a pathway hole, the program first queries the UniProt DB to find all known sequences for enzymes that catalyze that same reaction in other organisms. The program then uses the BLAST tool to compare that set of sequences against the full proteome of the organism in which we are seeking hole fillers. It scores the resulting BLAST hits using a Bayesian classifier that considers information such as genome localization, that is, is a potential hole filler in the same operon as another gene in the same metabolic pathway? At a stringent probability score cutoff, our method finds potential hole fillers for ~45% of the pathway holes in a microbial genome .
PHFiller includes a graphical interface that optionally presents each inferred hole filler to the user along with information that helps the user evaluate the hole fillers, and allows the user to accept or reject the hole fillers that it has proposed.
Membrane transport proteins typically make up 5–15% of the gene content of organisms sequenced to date. Transporters import nutrients into the cell, thus determining the environments in which cell growth is possible. The development of the PathoLogic TIP  was motivated by the need to perform symbolic inferences on cellular transport systems, and by the need to include transporters on the Cellular Overview diagram. The motivating symbolic inferences include the problems of computing answers to the following queries: What chemicals can the organism import or export? For which cellular metabolites that are consumed by metabolic reactions but never produced by a reaction is there no known transporter (meaning that the origin of such metabolites is a mystery, and indicates missing knowledge about transporters or reactions that produce the compound)?
To answer such queries, we must have a representation of transporter function that is computable (ontology based). Pathway Tools has such a representation, in which transport events are represented as reactions in which the transported compound(s) are substrates. Each substrate is labeled with the cellular compartment in which it resides, and each substrate is a controlled-vocabulary term from the extensive set of chemical compounds in MetaCyc . The TIP program converts the free-text descriptions of transporter functions found in genome annotations (examples: ‘predicted ATP transporter of cyanate’ and ‘sodium/proline symporter’) into computable transport reactions.
TIP performs the following operations that are explained more fully in Lee et al. . Starting with the full set of monomeric proteins encoded by the genome, TIP first identifies the likely transport proteins by searching for proteins that include various keywords indicative of transport function (such as ‘transport’ and ‘channel’), and that lack certain counter-indicator keywords (such as ‘regulator’). Then, for each such identified transport protein T, the program performs these steps.
An evaluation showed that 67.5% of TIP predictions were correct; the remainder had an error in the substrate, in the directionality of transport, or in the energy coupling . TIP includes a graphical interface that allows the user to interactively review and revise its predictions.
The Editors support PGDB curation through interactive modification and updating of all the major datatypes supported by Pathway Tools. They can be invoked quickly from every Navigator window through a single mouse operation so that a user who sees within the Navigator an object that needs to be updated can quickly invoke an editing tool to make the required change. When the user exits from the editing tool, the modified version of the object is then displayed within the Navigator.
The Editors allow the user to invoke an external spelling checker (ispell) to check spelling within comment fields.
Curators typically become proficient at these tools after a day of training and a few weeks of experience.
The editing tools included in Pathway Tools are as follows:
Often, many curators collaborate on a given PGDB, and it is desirable to attribute their contributions accordingly. This not only helps to find out who should be asked if questions about particular entries arise, but more important, it will provide an incentive for high-quality contributions, because contributors will be able to clearly demonstrate their accomplishments.
The editing tools for the most important objects thus support attaching credits of several kinds. When an object such as a pathway is first created, by default, a ‘created’ credit is attached to the object, along with a timestamp. The curator is described by an author DB object, and a DB object describing the author's organization. The author frame records the name, email address and the organization(s) with which the curator is affiliated. Editing tools exist for authors and organizations, and substring search allows convenient retrieval. A given credit for an object can be attached to either authors, organizations or both, in a flexible manner. Every author and organization has a ‘home’ page that lists all the objects that have been credited.
Other kinds of credit are ‘revised’ when a curator substantially edits an object that was created some time ago, and a ‘last curated’ flag can be set to indicate when a curator has last researched the literature available for a given object. The last-curated flag is useful for those objects about which almost nothing is known, to distinguish between the case where no curator ever looked at the object, versus where an extensive search was performed but still nothing new was found.
Credits are included with pathways exported to a file, which allows exchange of pathway contributions between PGDBs, complete with proper credit attribution. An additional kind of credit called ‘reviewed’ can be used when such external contributions have been reviewed by a receiving curator, or to also attribute reviews of various objects by invited, external domain experts.
During the PGDB life cycle, a number of types of PGDB updates are required that would be extremely onerous to perform if the user were forced to perform them manually, one at a time. Therefore, Pathway Tools provides several facilities for performing bulk updates of a PGDB. The most general facility is that users can write their own programs to perform arbitrary types of updates through the Pathway Tools APIs in the Perl, Java and Lisp languages (see ‘Computational access to PGDBs’ section).
Some groups choose to store the authoritative version of their genome annotation in a DB external to the PGDB, such as groups that developed their own genome DB system prior to adopting Pathway Tools. Such users need the ability to update their PGDB with data from a revised genome annotation without overwriting or otherwise losing any manual curation that has been added to the PGDB. Pathway Tools provides an interface for doing just that. It takes as input one or more update files, either in GenBank format or PathoLogic file format. The files can contain either a complete revised annotation for the organism, or they can contain just the information that has changed. The software will parse the update files and determine all differences between the new data and the old. Types of changes that are detected include new genes, as well as updated gene positions, names, synonyms, comments, links to external DBs and updated functional assignments. None of the changes will be propagated automatically. Instead, a pop-up dialog will summarize different classes of changes. For example, it will list the number of new genes, the number of genes with name changes, the number of previously unassigned genes that now match a reaction and the number of previously assigned genes that now match a different reaction. For each class of changes, the curator has the option of either accepting all updates (e.g. creating DB objects for all the new genes), or of checking each proposed update. Once this phase is complete and any changes to functional assignments have been made, the software will re-run the pathway inference procedure described in ‘Pathologic inference of metabolic pathways’ section, identify any new pathways that are inferred to be present and any existing pathways that no longer have sufficient evidence and allow the curator to review those changes.
Pathway Tools contains an extensive set of programs for performing consistency checking of a PGDB to detect structural defects that sometimes arise within PGDBs. Also included in this component are tools for computing and caching aggregate statistics for a PGDB, such as computing the molecular weights of all proteins from their amino acid sequences. The statistics are cached so that they can be displayed quickly. At SRI, we run these programs as part of the quarterly release process for EcoCyc and MetaCyc.
Roughly half of the programs automatically repair PGDB problems that they find. Such problems could be caused by user data entry errors, or by errors in Pathway Tools itself. Example checks include to ensure that inverse relationship links are set properly (e.g. that a gene is linked to its gene product, and that the product links back to the gene); make sure pathways do not contain duplicate reactions; validate and update GO term assignments with respect to the latest version of GO; perform formatting checks in comment text; search gene reading frames for internal stop codons; and to remove redundant bonds from chemical structures.
The other checker programs generate listings of every error detected, and allow the user to click on each problematic object in the listing to enter the editor for that object to repair it.
Most new releases of Pathway Tools include additions or modifications to the Pathway Tools schema. Schema changes are made to model the underlying biology more accurately (such as adding support for introns and exons), extend the datatypes within Pathway Tools (such as adding support for features on protein sequences) or to increase the speed of the software. Because each new version of the software depends on finding data within the fields defined by the associated version of the schema, existing user PGDBs created by older versions of the software will be incompatible with these new software versions.
Therefore, every release of Pathway Tools contains a program to upgrade PGDBs whose schema corresponds to the previous version of the software, to the new version of the software. When a user opens a PGDB under a new version of the software, the software detects that the schema of the PGDB is out of date, and offers to run this schema upgrade program for the user. For users who have not upgraded the software for several releases, several upgrade operations are performed consecutively. Example upgrade operations include adding new classes to the PGDB from the MetaCyc PGDB, adding new slots to PGDB classes, deleting PGDB classes, moving data values from one slot to another and moving objects from one class to another. The schema upgrade leaves the user's; curated data intact.
Every new release of Pathway Tools includes a new version of the MetaCyc DB, which, in addition to providing new data content, typically contains updates and corrections to existing pathways, reactions and compounds. Pathway Tools includes an option to propagate such updates and corrections to an existing organism PGDB. However, because we do not want to override any manual edits made to a PGDB, this tool does not run automatically. Much like the tool for incorporating a revised genome annotation, described in ‘Bulk PGDB updating’ section, this tool organizes the changes into logical groups (such as all compounds with newly added structures, or all reactions with changed reaction equations), and allows the user to either accept an entire group of changes, or to examine and confirm each member of a group.
Conceptualizations of biological information are used within the Pathway Tools schema. The schema is a key part of Pathway Tools' ability to make many types of biological data accessible to computational analysis.
The Pathway Tools schema defines a set of classes and a set of slots. Classes describe types of biological entities, such as genes and pathways, and are arranged in a class–subclass hierarchy. Many of the important Pathway Tools classes are shown in Figures 2 and and3.3. DB slots store properties of the classes and objects within a PGDB. Slots store attributes of PGDB objects, and relationships between PGDB objects. Figure 4 provides an overview of the relationships among PGDB classes. For example, user queries can follow the relationship from a gene to the protein that it codes for, from a protein to a reaction that it catalyzes and from a reaction to a metabolic pathway in which it is a component, to answer questions such as ‘find all metabolic pathways in which the products of a given gene play a role’.
Every PGDB object has a stable unique identifier (ID)—a symbol that uniquely identifies that object within the PGDB. Example unique IDs include TRP (an identifier for a metabolite), Rxn0-2382 (an identifier for a reaction) and Pwy0-1280 (an identifier for a pathway). Relationships within a PGDB are implemented by storing object IDs within slots. For example, to state that the TRP (l-tryptophan) object is a reactant in the reaction Rxn0-2382, a slot of Rxn0-2382 called LEFT (meaning reactants) contains the value TRP. Many PGDB relationships exist in both forward and backward directions, for example, the TRP object contains a field called APPEARS-IN-LEFT-SIDE-OF that lists all reactions in which TRP is a reactant. The slots LEFT and APPEARS-IN-LEFT-SIDE-OF are called inverses.
There are two alternative ways in which one might choose to represent the metabolic network in a computer: as a simple listing of all metabolic reactions that occur in the cell, or by partitioning the reaction list into a carefully delineated set of metabolic pathways that describe small, functionally linked subsets of reactions. Which approach is preferred? Both approaches have value, and they are not mutually exclusive; therefore, Pathway Tools supports both views of metabolism in a PGDB.
Pathway Tools conceptualizes the metabolic network in three layers. The first layer consists of the small molecule substrates upon which metabolism operates. The second layer consists of the reactions that interconvert the small molecule metabolites. The third layer consists of the metabolic pathways whose components are the metabolic reactions of the second layer. Note that not all reactions in the second layer are included in pathways in the third layer, because some metabolic reactions have not been assigned to any metabolic pathway by biologists.
Scientists who choose to view the metabolic network within a PGDB solely as a reaction list can operate on the second layer directly without interference from the third layer. But for a scientist for whom the pathway definitions are important, the pathway layer is available in the PGDB.
The pathways in PGDBs are modules of the metabolic network of a single organism. Pathway boundaries are defined by considering the following factors. Pathways are often regulated as a unit (based on substrate-level regulation of key enzymes, on regulation of gene expression and on other types of regulation). Pathway boundaries are often defined at high connectivity, stable metabolites . Pathway conservation across multiple species is also considered, as are pathway definitions from the experimental literature .
The compounds, reactions and pathways in levels 1–3 are each represented as distinct DB objects within a PGDB. The relationships among the metabolic datatypes in a PGDB are depicted by the blue region of Figure 4.
The pathway for biosynthesis of l-tryptophan shown in Figure 5 is represented within a PGDB as shown in Figure 6. An object representing the pathway, at the top of the figure, is connected via slot Reaction List to objects representing every reaction within the pathway. One of those reactions (Rxn0-2382) is shown in the figure. It is connected, via slots Left and Right, to objects that represent each substrate of the reaction. In addition, it is connected to the enzymatic reaction object EnzRxn0-3701, which in turn is connected to an object representing the enzyme complex that catalyzes the reaction, Cplx0-2401. Note that every slot shown in this diagram has an inverse slot, meaning a slot that represents the inverse relationship. For example, slot In-Pathway represents the relationship from a reaction to the pathway containing that reaction.
Our discussion of PGDB definitions of metabolism thus far has not considered the metabolic enzymes, nor the genome that encodes those proteins. PGDBs define the proteome and the genome of an organism in the following manner, as depicted by the green region of Figure 4.
The proteome of the organism is described as a set of PGDB objects, one for each gene product in the organism, and one for each complex formed from two or more (identical or nonidentical) polypeptides. Furthermore, every chemically modified form of a monomer or of a multimer is encoded by a distinct PGDB object. For example, we might create one object representing an unmodified protein and another representing the phosphorylated form. Each protein object is in turn linked, through a slot in the object, to the metabolic reactions that it catalyzes. Proteins can also be substrates of reactions. Additional PGDB objects define features on proteins, as described in ‘Pathway Tools protein feature ontology’ section.
Each protein product resulting from alternatively spliced forms of a gene is also represented by a distinct protein object. Each protein object records the exons of the gene that encodes it.
Protein objects are also linked to gene objects that define the gene encoding each protein. Each gene in the genome is defined by a distinct PGDB object, as is every replicon (chromosome or plasmid) in the genome. Genes are linked to the replicon on which they reside. In addition, other features on the genome, such as operons, promoters and transcription factor binding sites, are described by PGDB objects.
The associations between enzymes and the reactions they catalyze are implemented using an intermediary object called an enzymatic reaction, as shown in Figure 6. This arrangement allows us to capture the many-to-many relationship that exists between enzymes and reactions—one reaction can be catalyzed by multiple enzymes, and multifunctional enzymes catalyze multiple reactions. The purpose of the enzymatic reaction is to encode information that is specific to the pairing of the enzyme with the reaction, such as cofactors, activators and inhibitors. Consider a bifunctional enzyme with two active sites, where one of the active sites is inhibited by pyruvate, and the second active site is inhibited by lactate. We would represent this situation with two enzymatic reactions linking the enzyme to the two reactions it catalyzes, and each enzymatic reaction would specify a different inhibitor.
Continuing the explanation of the example shown in Figure 6, the homodimer enzyme Cplx0-2401 is connected to Trypsyn–BProtein, which represents the monomer. It in turn is connected to Eg11025, which represents the gene encoding this monomer. It is connected to an object representing the E. coli chromosome, Ecoli-K12-Chromosome.
The Pathway Tools schema can represent all important regulatory interaction types within E. coli, ranging from substrate-level regulation of enzyme activity, to the transcriptional control of gene expression, to regulation of gene expression by small RNAs. In each case, a regulation object within a PGDB captures essential information about one regulatory interaction. A schema class called Regulation defines the class of all regulatory interactions; each of its subclasses defines a different mechanism of cellular regulation. Those subclasses are as follows.
Regulation of protein activity by chemical modification, such as by phosphorylation, is represented by a reaction that converts the unmodified form of the protein to the modified form.
MOD users want to know the type(s) of evidence that support assertions within a MOD, and they want to know the strength of that evidence. We have developed an evidence ontology  that can encode information about why we believe certain assertions in a PGDB, the sources of those assertions and the degree of confidence scientists hold in those assertions (although in practice the latter field is rarely populated). An example assertion is the existence of a biological object described in a PGDB—we would like to be able to encode the evidence supporting the existence of a gene, an operon or a pathway that is described within a PGDB. Has the operon been predicted using a computational operon finder? Or is it supported by wet-lab experiments? Our evidence ontology builds upon and substantially extends the GO evidence ontology, which applies only to gene products.
Evidence about object existence in PGDBs is recorded as a structured evidence tuple. An evidence tuple allows us to associate several types of information within one piece of evidence. Each evidence tuple is of the form
where Evidence-code is a unique ID for the type of evidence, within a hierarchy of 48 evidence types described in Karp et al.  and Pathway Tools Evidence Ontology . Citation is an optional citation identifier such as a PubMed ID that indicates the source of the evidence. For computational evidence, the citation refers to an article describing the algorithm used. Curator identifies the curator who created this evidence tuple; Timestamp encodes when this evidence tuple was created. Probability is an optional real number indicating the probability that the assertion supported by this evidence is correct, such as a probability provided by an algorithm.
The Pathway Tools editors allow users to manually enter evidence codes, and the PathoLogic pathway and operon predictors annotate objects that they create with appropriate computational evidence codes. The Navigator supports display and querying of evidence codes.
The Cell Component Ontology (CCO) is a controlled vocabulary of terms describing cellular components and compartments, and relationships between these terms . It was developed to provide a controlled vocabulary of terms for annotating the subcellular locations of enzymes, and compartments involved in transport reactions, in PGDBs. CCO spans all domains of life, and includes terms such as cytoplasm, cell wall and chloroplast. The ontology currently contains 150 terms. CCO includes many terms and their definitions from the GO , but substantially extends GO.
We have developed an ontology of protein features in order to identify and represent post-translational modifications, binding sites, active sites, conserved regions and other regions of interest on a protein. Starting from the list of feature types described in the UniProt User Manual , with some suggested additions from the SRI EcoCyc and MetaCyc DB curators, we created an ontology of 40 feature classes.
Features fall into two major classes. For amino acid site features, the feature location is a list of one or more amino acid residue numbers (or residue types, if the feature is associated with a generic protein whose precise sequence is unspecified). For protein segment features, the feature location is a range defined by its starting and ending residue numbers.
Feature types that are classified as binding features (either covalent or noncovalent) permit specification of an attached group. The attached group could be a compound or compound fragment, as in the case of a protein that binds a small molecule. The attached group can also be another protein feature, as in the case of disulfide bond or other cross-link between two features on different proteins, or any other type of molecule or binding site (such as a DNA binding site).
A different protein object is created in a PGDB for each biologically relevant modified form of a protein, and a single feature may be linked to multiple forms of the same protein. Some feature types are capable of existing in multiple states. For example, an amino acid modification feature can be in either the modified or the unmodified state (as in the case of a phosphorylation feature, which will be in the modified state when associated with the phosphorylated protein and the unmodified state when associated with the unphosphorylated protein), and a binding feature can be in either the bound or unbound state (as in the case of a metal-binding feature whose state indicates whether or not the metal ion is bound to the protein). We consider the state to be not an attribute of the feature, but rather an attribute of the pairing between a particular form of a protein and the feature. Thus, a reaction may convert a protein with a feature in the unmodified state to another form of the protein with the same feature in the modified state, making it clear that the only change was to the state of a single feature. Feature states may also be left unspecified—this enables us to avoid the combinatorial explosion of different protein forms that would otherwise result when a protein has multiple modification features, and a change in state of one feature does not depend on the state of other features.
The Pathway/Genome Navigator component of Pathway Tools provides mechanisms for interrogating PGDBs, and for visualizing the results of those queries. We begin by describing the query tools. We then describe visualization tools for individual biological entities (such as genes and pathways), followed by systems-level visualization tools that graphically display the entire metabolic network, entire regulatory network and entire genome map of an organism.
The Navigator runs as both a desktop application and a web server. The desktop mode is faster, and has more overall functionality (see  for details), but the web mode has some functionality not present in the desktop mode.
Version 13.0 of Pathway Tools, released in March 2009, introduced a completely redesigned web-based query interface. It provides a three-tiered query paradigm, meaning that three different types of query tools are available, each of which represents a different tradeoff between ease of use and query power. For example, the quick search is designed to provide a fast and simple way for new or casual users to find general information in the site. Statistics from our web logs presented below support the notion that the simpler search tools are used more frequently.
The ‘Quick Search’ box that appears at the top of most web pages generated by a Pathway Tools server is extremely easy to use. The user enters a search term and selects the organism whose PGDB the user wants to query. Pathway Tools searches that PGDB for objects whose primary name or synonyms contain the search term as a substring, and presents the list of results, organized by object type. The user can click on an object name to navigate to the display page for that object. A total of 62 349 quick searches were performed at BioCyc.org in May 2009.
A set of intermediate-level query tools provides the ability to construct more powerful and precise searches against objects of a single class. One such query page exists for genes, proteins and RNAs (Supplementary Figure S1); there are additional query pages for pathways, for reactions and for chemical compounds. A total of 3476 object searches were performed at BioCyc.org in May 2009.
Finally, the next section describes a tool called the Structured Advanced Query Page (SAQP) that allows advanced users to construct extremely powerful searches (that are approximately as powerful as provided by the SQL language). The graphical interactive nature of this web form makes these searches much easier to construct than using the SQL language. Six hundred and seventy four SAQP searches were performed at BioCyc.org in May 2009.
The SAQP enables a biologist to search a large number of DBs in a precise manner. The queries can be as simple as looking up a gene given a name, or as complex as searching several DBs and several object types interconnected by several relations. The SAQP allows biologists to formulate queries whose power and expressiveness closely approach SQL, but without having to learn SQL. The SAQP translates a formulated query into BioVelo, an OQL-like language , before sending it to the Web server.
The following explanation presents the elements of this Web user interface using one example. Figure 7 shows an example query against the class of protein monomers (polypeptides) in the EcoCyc DB.
The first step in building a query is to specify at least one DB and the class of objects to search.
Most queries include one or more conditions on the desired objects within the class. By clicking the button labeled add a condition in the initial blank SAQP, a where clause is added—visually boxed—in the search component. This operation adds a selector for an attribute (e.g. name) of the objects and a selector for a relational operator (e.g. contains the substring). It also adds a free text box to enter a number or string. Several other relational operators are provided, such as is equal to, is not equal to and is a substring of. Regular expression matching is also available as an operator, such as to allow wildcards within query strings.
This new field forms an atomic condition. Additional atomic conditions can be added to the query by using the button labeled ‘add a condition’.
When clicking the drop-down selector for a relational operator, the list of relational operators provided is compatible with the type of the selected attribute. In the case of the attribute name, the selectable operators are for strings, since the type of the attribute name is string. This notion of type extends to all biological objects such as genes, proteins, metabolic pathways, reactions and compounds. Thus, the user can select only those operators that are compatible with the selected attribute. The query in Figure 7 has three atomic conditions to filter the selected polypeptides.
Quantifiers on relations within the SAQP allow a join-like capability. For example, imagine that we want to extend the query with an additional restriction that depends on the gene encoding the polypeptide, not on the polypeptide itself.
To do so, the user would add an and condition, and then select the gene attribute, which represents the gene encoding the polypeptide. We then select the quantifier operator for some object … , meaning that we want to define a condition that applies to some of the genes in the gene attribute of this polypeptide (although in the majority of cases only one gene will be present).
At this point, the SAQP adds a new indented query clause, to allow a condition to be defined on the gene. We have specified a constraint that its nucleotide coordinate must lie after the first 500 kb of the genome. Since several attributes and logical connectors can be specified in this new clause, forming a complex condition by itself, the Web interface draws a box around this condition and introduces it with the we have keyword. A new unique variable, named Z2, is also introduced. This variable represents every value of the gene attribute.
The section titled Select attributes to include in the query output allows the user to describe the contents of the query results by selecting the attributes to display for each result object. The result of a query is always a table of at least one column. The tables have zero or more rows, one for each query result, and each column is a selected attribute. A new column can be added by clicking the button add a column. In the case of Figure 7, three columns are specified, two using variable Z1 (for the polypeptides) and one using the Z2 variable for the genes encoding them.
The selector provided in each column contains the list of accessible attributes for the object class selected for this query. When more than one search component is specified or a subquery is used with a quantifier, a variable selector is provided to select the desired variable. The interface provides the number of possible objects having at least one value for each attribute.
The output table produced by the SAQP can be formatted in two possible styles: tabulated and HTML. For the tabulated format, column entries are separated by a tab. It can be used as input to such software as Excel. The HTML format is the preferred format to navigate and analyze the results using a web browser.
Once the user submits its query, a web page similar to Figure 8 is returned. The rows of the resulting table can be sorted based on any user-selected column. It can be resorted at will on any column in the output page.
Pathway Tools can automatically generate organism-specific metabolic charts that we call Cellular Overview diagrams . The diagram can be generated as a graphic, on the computer screen, that can be interrogated interactively and used to analyze omics datasets. It can be generated as a PDF file for printing as a large-format poster. Supplementary File S1 contains such a poster for Caulobacter crescentus.
Figure 9 depicts the entire diagram at low resolution painted with gene expression data. It contains all known metabolic pathways and transporters of an organism (online example: ; example with animated display of omics data: ). Each node in the diagram represents a single metabolite, and each line represents a single bioreaction. Omics data (e.g. gene expression or metabolomics measurements) for a given organism can be painted onto the cellular overview to place these data in a pathway context and to allow the user to discern the coordinated expression of entire pathways [such as the tricarboxylic acid (TCA) cycle], or of important steps within a pathway. Omics data may be loaded from a data file and superimposed on the Overview diagram for that organism.
Cellular Overview diagrams are generated automatically using an advanced layout algorithm . Automated layout is essential to allow the diagram to accurately depict the underlying DB content as that content evolves, without requiring time-consuming manual updates by curators that are bound to overlook some updates. In addition, automated layout allows generation of organism-specific cellular overviews that reflect the exact pathway content of each organism-specific PGDB in large PGDB collections such as BioCyc.
The Cellular Overview in the desktop version of Pathway Tools has many capabilities that are described in more detail in Paley and Karp . These capabilities include semantic zooming of the diagram (where the highest magnification corresponds to the detail shown in the poster version), highlighting of user-requested elements of the diagram (such as metabolites or pathways), highlighting large biologically relevant subnetworks [such as all reactions regulated by a given transcription factor, and the results of a reachability analysis (see ‘Network reachability analysis and dead-end metabolite analysis’ section)] and highlighting comparative analysis results such as comparison of the metabolic networks of two or more PGDBs.
The Pathway Tools Regulatory Overview depicts the full transcriptional regulatory network stored in a PGDB in one screen, and allows the user to interrogate and explore relationships within the network. Figure 10 shows the Regulatory Overview for EcoCyc, after the user has asked the system to highlight all genes annotated under GO term GO:0001539 (ciliary or flagellar motility). We can see that a few transcription factors control all E. coli motility genes.
The user can also request that the system display a separate diagram containing only those genes that are highlighted in the full Regulatory Overview. The resulting ‘layer cake layout’, shown in Figure 11, shows the regulators in a set of layers, such that no two genes in the same layer regulate one another.
The Pathway Tools genome browser displays a selected replicon, and allows the user to zoom into a region of the chromosome by gene name or by coordinates. The browser supports semantic zooming: as the user moves deeper into the genome, additional features are displayed, such as promoters and terminators. It can be used in a comparative mode that displays replicon regions centered on orthologous genes across a user-specified set of genomes to show the genomic context of those genes (e.g. ). In comparative mode, the user retains the ability to navigate left or right in the genome, and to zoom in and out. The genome browser can also generate large-format genome posters in PDF format; an example for C. crescentus is provided as Supplementary File S2.
The genome browser also supports display of tracks, meaning the ability to view positional data from external files along the genome, such as viewing predicted transcription factor binding sites.
A user who zooms out far enough is presented with a depiction of all the genes on the replicon called the Genome Overview, shown in Figure 12. This diagram can be painted with omics data to provide a global genome view of large-scale datasets.
In addition to the user-friendly graphical interfaces to PGDBs provided through the web and desktop versions of Pathway Tools, the software supports several formats for importing and exporting data, and allows Perl, Java and Lisp programmers to construct programs that access and update PGDB data (Figure 13).
This section describes Pathway Tools modules for performing system-level analyses of biological networks.
The Metabolite Tracing facility enables users to interactively trace the path of a metabolite through the metabolic network and to view it on the cellular overview diagram. Since the metabolic network is highly interconnected, there will typically be many such paths. Rather than attempting to trace all of them at once, this facility stops at branch points to allow the user to select which one or more paths should be followed. This decision is not irrevocable—at any point, the user can elect to return to one of the previously not selected branches and follow it instead.
The user specifies a starting metabolite and a trace direction (either forward or backward). The software will highlight the path from that metabolite until it reaches a branch point on the cellular overview diagram. At this point, all possible steps that can be followed will be highlighted in a different color, and a checklist of resulting compounds will appear in the control panel. The user can select which path(s) to follow either by clicking in the overview diagram or by selecting compounds in the control panel. This process proceeds interactively until a dead end is reached or the user decides to stop. The beginnings of paths that were not followed continue to be shown in another color, in case the user changes his mind about which path to follow.
Alternatively, a user can request to follow all paths from the specified metabolite, for a certain number of iterations. In this mode, for many starting metabolites, the overview diagram rapidly becomes so thoroughly colored that it is difficult to follow any single path. The user can then select from a list of metabolites encountered during the search, and ask to just show the path to that metabolite from the starting metabolite (there is no guarantee, however, that all such paths will be shown).
The overview can become hard to read when many connections exist between reactions in different pathways along a path. Thus, Pathway Tools provides a command to display a specified path in a window by itself. This command creates a temporary pathway object, consisting of just the reactions in the current path. This temporary pathway is shown in a new window and can be viewed or printed the same way as any other pathway. A sample traced path, showing it both as it appears on the overview diagram and as a pathway object, is shown in Figure 14.
Both of the tools described here support validation of metabolic network models by computation of systems properties of those models. One application of these tools is to check whether a metabolic-network model is sufficiently well formed for flux-balance analysis [55, 56]. For example, if a model contains metabolites that are not reachable from a given growth medium, those metabolites could not be produced in a flux-balance model generated from that metabolic network.
The reachability analysis tool allows the user to ask what product metabolites are reachable through a series of reaction transformations from a specified set of input metabolites. The tool can be used to identify gaps in the metabolic network, and to identify discrepancies between experimentally determined growth media for an organism, and computationally determined growth media.
The user specifies a set of starting metabolites, using a graphical interface, which form the initial metabolite pool. Next, the system converts the reaction network within the PGDB into a system of production rules, and it repeatedly chooses an unfired rule, checks if all of its inputs are present in the metabolite pool, and if so fires the rule by adding all of its products to the metabolite pool . The metabolite pool is qualitative; it includes no concentrations. This process repeats until no additional rules fire.
The results of a reachability analysis can be displayed on the cellular overview diagram as shown in Supplementary Figure S8. Furthermore, the cellular overview becomes a debugging tool: a user who is curious why a given reaction did not fire moves the mouse over a reaction line in the diagram, and the system displays a pop-up window that shows the full reaction equation and lists which reactants are present in the metabolite pool and which are not.
The reachability tool will also compute the difference between a set of expected output metabolites provided by the user, and the actual list of produced metabolites, and will track that difference over a series of reachability runs to track the user's progress in resolving unproduced compounds.
The chief limitation of the reachability tool is that its qualitative approach has difficulty with cycles in the metabolic network . For example, consider ATP, which is required for its own biosynthesis (glycolysis consumes two ATP before producing four ATP). Thus, ATP must be provided as an input for all glycolysis reactions to fire, yet doing so raises the possibility that ATP could be broken down to supply carbon, nitrogen or phosphorus to the cell.
A related form of metabolic-network analysis is detection of dead-end metabolites, which informally are metabolites that are only produced by the metabolic network, or only consumed by the network. We provide two tools that identify dead-end metabolites that have complementary strengths and weaknesses.
The first tool is useful when a PGDB is not expected to contain very complete or reliable information on reaction directions (The accuracy of reaction direction information is largely a function of how extensively the PGDB has been curated. Although reaction direction can be inferred from the direction of a reaction within a pathway, many reactions in MetaCyc are not part of metabolic pathway, and have no assigned reaction directions. Furthermore, a reaction that was predicted to occur in one direction in the organism based on its stored direction in MetaCyc might in fact occur in the opposite direction in that organism. A recent paper by Maranas and colleagues  notes that even for E. coli metabolic models, reaction direction information is sometimes incorrect). It computes dead-end metabolites according to the following more limited definition. A small-molecule metabolite M is a dead-end metabolite in the cellular compartment C if and only if all the following conditions are true:
The second tool uses a more comprehensive definition of dead-end metabolites that requires more comprehensive information on reaction direction. A small-molecule metabolite M is a dead-end metabolite in the cellular compartment C if and only if one of the following conditions is true:
One application of a metabolic network model is to find network bottlenecks, which if blocked could kill the cell. Such bottlenecks could constitute antimicrobial drug targets. We have developed a tool for predicting these so-called choke points.
The Pathway Tools choke-point detection algorithm examines the reactions attached to a given metabolite, and processes one metabolite at a time. The first step is to assemble the list of metabolites to examine. This is done by collecting (i) all reactions that are in pathways, plus (ii) reactions that stand alone, but which use only small molecule metabolites. The reactions that came from pathways may use some macromolecular substrates, such as proteins that are modified by the reaction. From this list of reactions, the algorithm collects all of their substrate metabolites (meaning their reactant or product metabolites).
Definition : A ‘choke point reaction’ is a reaction that either uniquely consumes a specific substrate or uniquely produces a specific product in a metabolic network, and is also balanced by at least one reaction that respectively produces or consumes that substrate. Specifically, the algorithm searches for two types of choke point reactions: (a) Reactions R1 such that only a single reaction R1 produces metabolite M, and at least one reaction consumes M. (b) Reactions R2 such that only a single reaction R2 consumes metabolite M, and at least one reaction produces M. These definitions imply that to find a choke point, all reactions involving M must be unidirectional. These choke point reactions are collected and returned as the result. Note that the definition excludes reactions directly connected to dead-end metabolites.
The resulting candidate choke point reactions can be painted onto the cellular overview to facilitate further analysis.
Pathway Tools contains a rich set of operations for comparing the information in two or more PGDBs. These operations range from comparison of genome-related information to comparison of pathway information. These comparisons are of several types.
The comparative genome browser discussed in ‘Systems-level visualization of genome maps’ section displays replicon regions centered on orthologous genes across a set of genomes (Supplementary Figure S9).
The user can generate a comparative table for a given metabolic pathway across a specified set of organisms. For each organism, the table shows the presence of pathway enzymes and operon structures of genes within the pathway.
A global comparison of the metabolic networks of multiple PGDBs can be performed by highlighting on the Cellular Overview diagram (see ‘System-level visualization of metabolic networks’ section). This tool allows the user to highlight in the Cellular Overview reactions that are shared, or not shared, among a specified set of organisms.
Finally, a general comparative analysis facility allows the user to generate comparative report tables for many aspects of a PGDB. As well as being used for comparative analyses, these tools can be used to generate statistics regarding the content of a single PGDB. These tools are general in that they present their results in a standard format, and they allow the user to drill down to specific results in a consistent fashion. The initial report page shows summary statistics, but the user can drill down to compare all instances of a category by clicking on elements of a report table.
For example, consider the transporter report page in Supplementary Figure S10. Table 2 within that report summarizes the number of uptake transporters found in two organisms. A user who wants to see the actual transported substrates clicks on the text ‘Compounds transported into the cell’ to generate a new report page containing a table listing the union of all substrates imported by both organisms, along with an indication of which organisms transport each substrate, and which transporter is utilized. If the user clicks on a data cell within Table 2, such as the number of imported substrates in E. coli K-12 (156), a page is generated that lists those substrates only. Similar functionality applies to most tables in these reports.
The following report types are provided. An example comparative report is available at URL .
Pathway Tools is implemented in the Common Lisp programming language (we use the Allegro Common Lisp implementation from Franz Inc., Oakland, CA, USA). We chose Common Lisp because it is a high-productivity programming environment. Because Lisp is a very high-level language, one line of Lisp code is equivalent to several lines of code in a language such as Java or C++. Therefore, the same program can be written more quickly in Lisp, with fewer bugs. A study by Gat  found that compiled Lisp programs generally run faster than Java programs, and that a given program can be developed two to seven times faster in Lisp than in Java . Common Lisp also has a very powerful interactive debugging environment.
Lisp has powerful dynamic capabilities that are illustrated by a Pathway Tools feature called auto-patch. Imagine that a Pathway Tools user site has reported a bug in the software. Once our group has found a fix for the bug, we put a patch file that re-defines the offending Lisp function(s) on the SRI web site. The next time Pathway Tools is started at remote sites, it automatically downloads the patch (in compiled form) from the SRI web site, puts the patch in an appropriate directory and dynamically loads the patch file into the running Pathway Tools.
The architecture of Pathway Tools is depicted in Figure 15. The main bioinformatics modules of Pathway Tools are the Navigator, Editors and PathoLogic, plus a chemoinformatics subsystem that includes tools such as SMILES  generation and parsing and a chemical substructure matcher, plus a large set of shared utilities that we call the Pathway Tools core. Pathway Tools uses an object-oriented DB system called Ocelot. The Pathway Tools user interface relies on a graph layout and display package called Grasper , and web and desktop graphics packages called CWEST and CLIM (the Common Lisp Interface Manager).
Ocelot is an object/relational DB management system (DBMS) developed at SRI [1, 28]. Ocelot combines the expressive power of frame knowledge representation systems  developed within the AI community [whose object data model is far superior to the relational data model for representing biological data Superior aspects of the object data model include the following. The object data model is better at managing very complex schemas. That is, if the same domain is represented within the object data and within the relational model, the object schema is usually much more compact and easier to comprehend. One reason is that inheritance allows the object data model to define subclasses by extending existing classes (e.g. the class Polypeptides is a subclass of the class Proteins), whereas the relational model would force attributes shared between the two tables to be duplicated in each, which both obscures the fact that the two tables are related, and complicates schema evolution. Relational normalization also increases the size of the schema by forcing the creation of new tables for every multivalued attribute, which is not required in the object data model. The object data model used by Ocelot is particularly flexible in supporting any type of schema evolution without forcing the entire DB to be reloaded (unlike relational DBMSs), which is important in bioinformatics because the complexity of biological data forces never-ending enhancements to the schema (note that not every object DBMS provides such flexibility)] with the scalability of relational DB management systems (RDBMSs). Ocelot DBs are persistently stored within an Oracle or MySQL RDBMS. Ocelot objects are faulted on demand from the RDBMS, and in addition are faulted by a background process during idle time. Objects that were modified during a user session are tracked and saved to the RDBMS during a save operation. Ocelot uses optimistic concurrency control —during a save operation it checks for conflicts between the updates made by the user and updates saved by other users, since the saving user began their session or last made a save operation. This approach avoids the overhead of locking that becomes problematic in object DBs because modifications to one object often cascade to related objects and could require a large number of lock operations. The optimistic concurrency control works well in practice because curators tend to focus in different biological areas and therefore rarely update the same objects at the same time.
Ocelot DBs can also be saved to disk files, in which case the RDBMS is not needed (Figure 16). The file persistence configuration is simpler to use, since it does not require purchase or installation of an RDBMS. It provides an easy and low-cost way to begin a PGDB project; a project can switch to an RDBMS configuration as its complexity grows. The advantage of an RDBMS configuration is that it provides Ocelot with multi-user update capabilities, and it permits incremental (and therefore faster) saving of DB updates. The RDBMS configuration also allows Ocelot to maintain a history of all DB transactions—DB curators can examine the history of all updates to a given object to determine when a given change was made, and by whom. This functionality is very useful when diagnosing mistakes within a PGDB.
Figure 17 shows the graphics architecture of Pathway Tools. The Grasper graph toolkit is used in pathway layouts, and in the cellular overview and regulatory overview. Grasper graphics, and all other graphics generated by Pathway Tools, are rendered using the CLIM Common Lisp graphics system, which is implemented using the X window system on Linux and Mac, and the native Windows API on Windows. When Pathway Tools runs as a desktop application, CLIM graphics directly update the user's; screen.
Pathway Tools can also run as a web server, which is how it powers web sites such as BioCyc.org. Pathway Tools uses a somewhat nonstandard model of servicing web requests, and it does not run with an associated HTTP server such as Apache. Instead, Pathway Tools provides a fully functional web server that includes services such as compression and connection keep alive.
HTTP servers typically start a new process for each incoming web request that terminates after the request has been serviced. For bioinformatics DBs, the web server often issues a query to a relational DB server that runs as a separate process. In contrast, Pathway Tools starts one long-lived web server process that can service many thousands of web requests, with internal threads used to service overlapping requests. All DBs are stored in Pathway Tools virtual memory, and thus access is extremely fast. To date, this model has scaled to hundreds of genomes, although we are investigating other approaches to provide additional scalability.
Pathway Tools processes an incoming web request in the following manner. The top-level directory name within an incoming URL indicates whether the operation is requesting a static file or dynamically generated page.
PGDBs for many hundreds of organisms from all domains of life are available for use in conjunction with Pathway Tools. This summary lists what DBs are available, and their sources.
With highly curated PGDBs available for many important organisms, it is not clear why users would consider using the uncurated (and therefore lower quality) pathway DBs available for these same organisms from other pathway DB providers such as KEGG. For example, consider the highly curated AraCyc pathway DB for A. thaliana [66, 67]. AraCyc contains minireview summaries for enzymes and metabolic pathways; thousands of literature references; evidence codes for enzyme functions and metabolic pathways (indicating which pathways are supported by experimental evidence); and information on enzyme subunit structure, activators, inhibitors and cofactors. KEGG contains none of the preceding information. In addition, AraCyc curators have carefully refined the metabolic reactions and pathways present in AraCyc, including removing false positive computational predictions, and to add Arabidopsis reactions and pathways from the biomedical literature to AraCyc. Although KEGG updates its reference pathway map diagrams periodically to contain new pathways and reactions from different organisms, the KEGG approach of computationally coloring reactions within pathway maps based on the presence of enzymes for those reactions within a genome results in significant ambiguity. If AraCyc curators are reasonably certain that a reaction or pathway is absent from Arabidopsis, they remove it from the DB. The KEGG model does not allow such removal, so it is never clear within KEGG whether an uncolored reaction is truly absent from an organism, or whether the gene for its enzyme has not yet been identified in the genome. This situation results in a real conundrum for a scientist who wishes to assemble the list of reactions likely to be present in Arabidopsis from KEGG, since there is no way to distinguish the many uncolored reactions that are likely present but for which no gene has been identified, from the many uncolored reactions that are clearly known to be absent from Arabidopsis (which curators have deleted from AraCyc).
Available PGDBs include the following, with curated PGDBs in bold.
To facilitate sharing of PGDBs among multiple users, we have created a PGDB Registry that enables peer-to-peer sharing. PGDB sharing is desirable because a user whose own computer has a copy of a PGDB can use Pathway Tools functionality that would not be available through a remote Pathway Tools web server, such as functionality that exists in desktop mode only or comparative operations. Comparative analysis of two or more PGDBs is possible only when they are loaded into the same instance of Pathway Tools.
The PGDB Registry uses a server maintained by SRI that tracks the locations of available PGDBs that PGDB authors have registered for downloading. The author of a PGDB can register that PGDB by using a command within Pathway Tools that creates an entry for the PGDB in the Registry server, and places the PGDB on an FTP or HTTP server of the author's; choosing. Users who want to download a PGDB from the Registry can view available PGDBs by using a Web browser (see URL in ) or using Pathway Tools itself. With a few mouse clicks, a user can download a PGDB from the registry using Pathway Tools.
Pathway Tools stands out with respect to related software tools in the breadth of the functionality and the high level of integration that it provides. It addresses a very large number of use cases. And it provides schema, visualization and editing support for an unusually large number of datatypes in addition to pathways, including chromosomes, genes, enzymes, transporters and regulatory networks. Pathway Tools is particularly well adapted for microbes, with its support for operons and for prokaryotic gene-regulation mechanisms, and its genome browser is optimized for depicting prokaryotic genomes. The following comparison is organized according to the use cases presented in ‘Pathway Tools use cases’ section, although we consider the first two use cases together because they are strongly related.
Other software systems for managing metabolic pathway information are KEGG [89, 90], PUMA2  (inactive), Amaze  (inactive), GenMAPP [93, 94], PathCase [95, 96], VisANT , and Reactome [98–100]. KEGG, VisANT, PUMA2 and GenMAPP are based on static, predrawn pathway diagrams, a model that does not scale to produce custom pathway diagrams for tens of thousands of different pathways in different organisms. Nor can the static approach produce multiple views of a given pathway at different levels of detail, as can the Customize Pathway option in Pathway Tools that allows the user to choose exactly which graphical elements (e.g. gene names, EC numbers, metabolite structures, activators and inhibitors) appear in the pathway diagram.
PathCase and Reactome do have pathway layout capabilities, but the resulting diagrams bear little resemblance to those found in the biomedical literature, nor are they particularly compelling visually. They do not offer the customization or multiple-detail views offered by Pathway Tools.
Cytoscape  is a general tool for display of biological networks that embodies the philosophy that general graph layout techniques can satisfactorily depict any biological network. Although the Cytoscape layout algorithms are a terrific fit for display of protein interaction maps, we assert that they do not produce useful results for metabolic pathways. We believe that superior visualization results are obtained when the layout algorithm is specifically tailored to the data at hand. For example, Pathway Tools provides separate layout algorithms for circular, linear and tree-structured pathways to make the structure of those pathways stand out prominently to the biologist. Biologists developed their pre-computer depictions of metabolic pathways for important reasons, namely, to accurately depict subtleties of the data.
Most of the preceding tools lack pathway editing capabilities, exceptions being GenMAPP, Reactome, VisANT and PathCase. This limitation is a fundamental one for tools such as KEGG, for which users cannot introduce new organism-specific pathways, nor modify a reference pathway definition to customize it to a specific organism, thus eliminating the possibility of removing erroneous reaction steps from a pathway, or of adding missing reactions to a pathway.
No other tool except Reactome has analogs of our Cellular Overview diagram (which we introduced in 1999 [23, 44]), nor of our Omics Viewer capabilities. No other software system lays out its complete metabolic map diagram algorithmically as Pathway Tools does, providing the ability to generate custom diagrams for hundreds of genomes. KEGG provides a single overview metabolic map for all organisms in KEGG, as opposed to the organism-specific overviews that Pathway Tools generates through advanced layout algorithms. The KEGG diagram is not queryable or interactive as the desktop version of the Pathway Tools diagram.
All the preceding tools lack the metabolite tracing capabilities of Pathway Tools.
Many existing bioinformatics systems include genome browsers and gene pages. A representative sample of larger systems includes GBrowse [102, 103], IMG , Entrez Genome , CMR , the UC Santa Cruz Genome Browser , Ensembl  and PATRIC . Here we compare the salient features of these genome browsers.
A number of bioinformatics DBs include regulatory network information; however, the majority of these DBs and their associated software environments can represent information on transcription factor-based regulation only, such as RegTransBase , TRANSFAC , CoryneRegNet , ProdoNet , and DBTBS . The exception is RegulonDB , which can also capture RNA-based regulation including riboswitches, attenuators and small RNA regulators.
We are not aware of tools comparable with the Regulatory Overview in being able to display and interrogate large complete cellular regulatory networks, although CoryneRegNet and ProdoNet display smaller regulatory networks. CoryneRegNet also displays omics data onto its regulatory network diagrams. Cytoscape could probably display regulatory network data using its generic graph display capabilities.
Other bioinformatics DBs provide a subset of the three tiers of queries provided by Pathway Tools (quick search, object-specific searches and Structured Advanced Query Page). Virtually all provide a quick search. Sites providing particularly extensive object-specific searches are FlyBase , Mouse Genome Informatics , EuPathDB , and BioMart . BioMart is used by bioinformatics DBs including WormBase, Rat Genome Database, UniProt, Reactome and Galaxy. Its underlying query language is Perl using the BioMart libraries. However, none of the preceding systems provides the query power of the Pathway Tools SAQP. For example, BioMart does not allow the user to construct arbitrary queries that perform joins (queries that combine multiple data types); it provides only the ‘and’ logical operator (the ‘or’ operator is not available); and it includes only a limited form of ‘not’.
Biozon  (biozon.org) integrates several biological DBs and provides a web interface that is the closest in power to the SAQP. A query is created by first selecting an object type, entering some constraints for this type and then proceeding to another related object type if desired. That is, join operations between different types of objects are supported, making Biozon one of the few other bioinformatics web interfaces that allow joins. However, Biozon does not allow logical operators such as ‘or’ to be specified among all query components.
KEGG and Reactome are the only other tools that can predict pathways from genome data. The pathway hole filler and transport inference parser tools are unique to Pathway Tools. Many genome annotation pipelines include operon predictors.
Kono et al.  introduced a SVG-based tool for painting omics data onto individual KEGG pathway maps , although it does not paint onto whole-organism overview diagrams. This tool also does not produce animations as our omics viewers do. The Reactome Skypainter can paint omics data onto a human pathway overview that is customizable to other model organisms only by graying out regions of the overview (whereas Pathway Tools produces customized overviews for each organism). We argue that the utility of the Skypainter tool is compromised by its small size, and furthermore, it cannot display metabolomics data as the cellular overview can. GenMapp , VitaPad , VisANT and ArrayXPath  paint omics data onto single pathways, rather than onto a full metabolic overview.
The Palsson group has developed tools for detecting dead-end metabolites . We are not aware of other groups that have developed tools for reachability analysis, although flux-balance analysis techniques are able to predict whether a metabolic network will support growth under a given growth medium .
Singh et al.  produced another implementation of our chokepoint method, although the availability of that software is unclear. Rahman and Schomburg  enhanced the chokepoint method with the additional concept of load points, which are the number of k-shortest paths and nearest neighbor links for a metabolite. Flux-balance models can also be used to predict essential reactions . Kim et al.  combined flux-balance models with chokepoint analysis to predict drug targets.
No other tools include comparative pathway analysis functionality such as that provided by Pathway Tools.
Comparative genomics is a very large area in bioinformatics. The ‘Genome and proteome information’ section compares Pathway Tools' comparative genome browser with other tools. In general, other tools include a range of comparative genomics capabilities not found in Pathway Tools [104, 106].
Pathway Tools contributes many relevant capabilities for metabolic engineering, such as fast development of comprehensive genome-scale models of the metabolic network of an engineered organism, and a tool for tracing the fates of metabolites through the metabolic network. It lacks the optimization and pathway design tools that have been developed by metabolic engineering researchers.
Here we summarize limitations of Pathway Tools, organized by use case. Some of these limitations are being addressed in current research; many of the others will be addressed in future work.
Pathway Tools has an emphasis on prokaryotic biology, although over time we have added, and plan to add, more support for eukaryotic biology. For example, although the software can represent introns and exons internally, the genome browser does not yet depict intron/exon structure. The software can capture many types of prokaryotic regulation, but we have not attempted comprehensive coverage of eukaryotic regulation. Similarly, the ontology of cellular compartments used by Pathway Tools is oriented toward bacteria and plants, and does not describe mammalian compartments, nor can Pathway Tools define the variations in metabolic pathways across different cell types or developmental stages. Its biological sequence manipulation capabilities are limited, for example, editing of biological sequences is not supported. The editing tools within Pathway Tools are not web based, but require installation of Pathway Tools on every computer that will be used for editing.
To date, Pathway Tools has scaled to manage the BioCyc collection of 500 PGDBs, which includes three vertebrate genomes (human, mouse and cattle). However, we have been concerned that the current approach of loading all PGDBs into Common Lisp virtual memory will not continue to scale. Therefore, we have investigated an alternative approach in which PGDBs are stored in the Allegro Cache DB system from Franz Inc. Preliminary experimental results (unpublished) indicate that this approach will scale to 10 000 PGDBs with little degradation in performance.
Not all capabilities of Pathway Tools are available in both the web and desktop modes. For example, many comparative tools function in web mode only, whereas all aspects of PathoLogic are available in desktop mode only.
We would like to see many additional bioinformatics inference tools interfaced with Pathway Tools, such as for inference of protein cellular location and regulatory network inference.
Pathway Tools is not a general-purpose environment for analysis of omics data. Our assumption is that scientists will use one of the many other software packages for the early stages of omics data analysis (such as normalization), and provide the output of those analyses to Pathway Tools for display with the omics viewers. That said, we are working to supplement its existing omics analysis capabilities, such as with tools for computing over-representation analysis (e.g. are particular pathways over represented in a gene-expression experiment or metabolomics experiment?).
We would like to see many additional network analysis tools present within Pathway Tools, such for computing the scaling properties of metabolic networks , and functional modules within metabolic networks . Pathway Tools does not perform flux-balance analysis of metabolic networks, but we are actively working on the ability to automatically generate flux-balance models from PGDBs [55, 129].
We are not aware of striking limitations in comparative analysis.
Pathway Tools treats a genome as far more than a sequence and a set of annotations. Instead, it links the molecular parts list of the cell to the genome, and to a carefully constructed web of functional interactions. The Pathway Tools ontology defines an extensive set of object attributes and object relations that allows a rich conceptualization of biology to be represented within a PGDB, and queried and manipulated by the user.
Pathway Tools provides a broad range of functionality. It can manipulate genome data, metabolic networks and regulatory networks. For each datatype it provides query, visualization, editing and analysis functions. It provides MOD development capabilities including computational inferences that support fast generation of comprehensive DBs, editors that allow for refinement of a PGDB, web publishing and comparative analysis. A family of curated PGDBs has been developed using these tools for important model organisms.
The software also provides visual tools for analysis of omics datasets, and tools for the analysis of biological networks.
Pathway Tools runs on Macintosh, Windows and Linux. It is freely available to academic and government researchers; a license fee applies to commercial use. See http://BioCyc.org/download.shtml.
National Institutes of Health (grants GM75742, GM080746, GM077678 and LM009651). The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.
Pathway Tools has benefited from advice, input and contributions from many scientists during its lifetime. We particularly wish to recognize contributions from Robert Gunsalus, Monica Riley, John Ingraham, Jean-Francois Tomb and Peifen Zhang. We gratefully acknowledge guidance from Richard Gourse, Julio Collado-Vides and Charles Yanofsky, in developing the Pathway Tools treatment of cellular regulation. Lukas Mueller developed PerlCyc and has provided many helpful suggestions. Thomas Yan developed JavaCyc. Jeremy Zucker developed the SBML generation module, and contributed many other ideas. Christos Ouzounis was a co-developer of the original metabolic pathway prediction algorithm, contributed an early version of the import/export system, and has been a source of much sound advice. This paper draws on figures from a chapter entitled ‘Cellular Organization of Metabolism’ in the Wiley Encyclopedia of Chemical Biology, published by John Wiley and Sons Inc., with permission.
Peter D. Karp is the director of the Bioinformatics Research Group at SRI International. He received the PhD degree in Computer Science from Stanford University.
Suzanne Paley is a computer scientist in the Bioinformatics Research Group at SRI International.
Markus Krummenacker is a scientific programmer in the Bioinformatics Research Group at SRI International. He has worked on Pathway Tools for over 8 years and has interests ranging from computers to molecular life sciences and nanotechnology.
Mario Latendresse is a Computer Scientist at SRI International. He received the PhD degree in Computer Science from Université de Montréal.
Joseph M. Dale is a computer scientist in the Bioinformatics Research Group at SRI International.
Thomas J. Lee is a Senior Research Engineer at SRI International. He received his MS degree in Computer Science from the University of Wisconsin (Madison).
Pallavi Kaipa is a scientific programmer in the Bioinformatics Research Group at SRI International.
Fred Gilham has been a Software Engineer at SRI International for 20 years. He received the Masters Degree in Computer Science from Stanford University.
Aaron Spaulding is a computer scientist in the Artificial Intelligence Center at SRI International.
Ian T. Paulsen is Professor in Genomics at Macquarie University, Sydney, Australia. He received his PhD in Microbiology from Monash University, Melbourne, Australia.
Ingrid M. Keseler is a Scientific Database Curator in the Bioinformatics Research Group at SRI International. She received an M.S. degree in Microbiology from the University of Georgia and a Ph.D. in Biochemistry from Stanford University.
Ron Caspi is a curator of the MetaCyc database. He received his PhD degree in Marine Biology from the Scripps Institution of Oceanography, UC San Diego.