The life cycle of a PGDB typically includes the following three types of procedures.
- Initial creation of the PGDB: PGDB creation starts with one or more input files describing the functionally annotated genome of an organism. The PathoLogic component of Pathway Tools transforms the genome into an Ocelot  DB structured according to the Pathway Tools schema. Next the user applies one or more computational inference tools within PathoLogic to the genome to infer new information such as metabolic pathways. For several of the PathoLogic inference tools, we have created graphical user interfaces that allow the user to review the inferences made by these tools, and to accept, reject or modify those inferences.
- PGDB curation: Manual refinement and updating of a PGDB is performed using the Pathway/Genome Editors. This phase can last for years, or for decades, as in the case of EcoCyc . Curation can be based on information found about the organism in the experimental literature, on information from in-house experiments or on information inferred by the curator, perhaps with help from other computational tools. PGDB curation is multidimensional , involving addition and/or deletion of genes or metabolic pathways to/from the PGDB; changing gene functions; altering the structure of metabolic pathways; authoring of summary comments for genes or pathways; attachment of MultiFun or Gene Ontology (GO) terms to genes and gene products; entry of chemical structures for small molecules; defining regulatory relationships; and entry of data into many different PGDB fields including protein molecular weights, pIs and cellular locations.
- Bulk updating of a PGDB: A PGDB developer might run an external program that predicts cellular locations for hundreds of genes within the genome, and want to load those predictions into the PGDB. Or, although most users of Pathway Tools keep their authoritative genome annotation within the PGDB, some groups store the authoritative genome annotation in another genome data management system, and want to periodically import the latest genome annotation into Pathway Tools. Another type of bulk PGDB update is that applied by the Pathway Tools consistency checker, which scans a PGDB for noncompliant data (for example, see ‘Consistency checker and aggregate statistics’ section), and either repairs the problem automatically, or notifies the user of problems. In addition, most of the individual components within PathoLogic that were used to initially create a PGDB can be run again at a later date to take advantage of updated information.
The following subsections describe the Pathway Tools components for addressing these procedures.
PathoLogic PGDB creation
PathoLogic performs a series of computational inferences that are summarized in . These inferences can be performed in an interactive mode, in which the user guides the system through each step, and can review and modify the inferences made by the system using interactive tools. PathoLogic can also execute in a batch mode in which all processing is automated. In batch mode, PathoLogic can process hundreds of genomes.
Figure 1: Inputs and outputs of the computational inference modules within PathoLogic. The initial input to PathoLogic is either a Genbank or a PathoLogic-format file. The boxes labeled “PGDB” all indicate that a PGDB is an input to or an output (more ...)
The input to PathoLogic is the annotated genome of an organism. PathoLogic does not perform genome annotation; its input must supply the genome sequence, the locations of genes and identified functions of gene products. The sequence is supplied as a set of FASTA-format files, one per replicon. The annotation is supplied as a set of files in Genbank format or PathoLogic format, each of which describes the annotation of one replicon (chromosome or plasmid), or of one contig for genomes that are not fully assembled.
The annotation specified in a Genbank or PathoLogic file can include the start and stop positions of the coding region for each gene, and intron positions. It can also include a description of the function of the gene product as a text string, one or more Enzyme Commission (EC) numbers and one or more GO terms. The annotation can also include a gene name, synonyms for the gene name and the product name, links to other bioinformatics DBs, and comments.
PathoLogic initializes the schema of the new PGDB by copying from MetaCyc into the new PGDB, the definitions of the approximately 3200 classes and 250 slots (DB attributes) that define the schema of a PGDB.
PathoLogic next creates a PGDB object for every replicon and contig defined by the input files, and for every gene and gene product defined in the input files. It populates these new objects with data from the input files, such as gene names and their sequence coordinates and gene product names. As a result of these operations, the new PGDB now mirrors the information in the input files.
PathoLogic inference of metabolic pathways
Pathway Tools predicts the metabolic pathway complement of an organism by assessing what known pathways from the MetaCyc PGDB [29
] are present in the annotated genome of that organism's; PGDB. This inference is performed in two steps that are described and evaluated further in Paley and Karp [30
] and Karp et al.
Enzymes in the PGDB are assigned to their corresponding reactions in MetaCyc, thus defining the reactome of the organism. PathoLogic performs this assignment by matching the gene-product names (enzyme names), the EC numbers and the GO terms to MetaCyc reactions assigned to genes in the genome. The program can use whatever combination of these three information types is available in a given genome. For example, the fabD gene in Bacillus anthracis was annotated with the function ‘malonyl CoA-acyl carrier protein transacylase.’ That name was recognized by PathoLogic as corresponding to the MetaCyc reaction whose EC number is 126.96.36.199. PathoLogic therefore imported that reaction and its substrates into the B. anthracis PGDB, and created an enzymatic-reaction object linking that reaction to that of B. anthracis protein.
Although hundreds of such enzyme-reaction assignments are performed automatically by PathoLogic, it typically does not recognize on the order of 20% of the enzyme names in a genome. Therefore, PathoLogic includes an interactive tool that presents names of putative metabolic enzymes (all proteins whose name ends in ‘ase’, with exclusion of certain nonspecific and nonmetabolic enzyme names) to the user, and aids the user in assigning those enzymes to reactions in MetaCyc. For example, PathoLogic provides an operation that runs an inexact string comparison search between the enzyme name and all enzyme names in MetaCyc, which sometimes allows the user to identify a match based on scrambled word orders within complex enzyme names.
Once the reactome of the organism has been established in the preceding manner, PathoLogic imports all MetaCyc pathways that contain at least one reaction in the organism's reactome into the new PGDB. Once imported, PathoLogic then attempts to prune out those pathways that are likely to be false positive predictions. That pruning process considers both the fraction of reaction steps in the pathway that has assigned enzymes, and how many of the reactions with assigned enzymes are unique to that pathway (as opposed to being used in additional metabolic pathways in that organism). The remaining pathways are those that are predicted to occur in the organism under analysis.
As MetaCyc has grown in size, we have seen a significant increase in the number of false positive predictions made by PathoLogic; thus, we have recently altered the pruning procedure to prune a predicted pathway from organism X if organism X is outside the expected taxonomic distribution of that pathway. MetaCyc records curated information about the expected taxonomic groups in which a pathway is expected to occur based on experimental observations of that pathway to date. For example, many pathways are expected to occur in plants only. This rule has significantly increased the accuracy of PathoLogic.
PathoLogic inference of operons
The Pathway Tools operon predictor identifies operon boundaries by examining pairs of adjacent genes A
and using information such as intergenic distance, and whether it can identify a functional relationship between A
, such as membership in the same pathway [31
], membership in the same multimeric protein complex, or whether A
is a transporter for a substrate within a metabolic pathway in which B
is an enzyme.
PathoLogic inference of pathway holes
A pathway hole is a reaction in a metabolic pathway for which no enzyme has been identified in the genome that catalyzes that reaction. Typical microbial genomes contain 200–300 pathway holes. Although some pathway holes are probably genuine, we believe that the majority are likely to result from the failure of the genome annotation process to identify the genes corresponding to those pathway holes. For example, genome annotation systems systematically under-annotate genes with multiple functions, and we believe that the enzyme functions for many pathway holes are unidentified second functions for genes that have one assigned function.
The pathway hole filling program PHFiller [32
] (a component of PathoLogic) generates hypotheses as to which genes code for these missing enzymes using the following method. Given a reaction that is a pathway hole, the program first queries the UniProt DB to find all known sequences for enzymes that catalyze that same reaction in other organisms. The program then uses the BLAST tool to compare that set of sequences against the full proteome of the organism in which we are seeking hole fillers. It scores the resulting BLAST hits using a Bayesian classifier that considers information such as genome localization, that is, is a potential hole filler in the same operon as another gene in the same metabolic pathway? At a stringent probability score cutoff, our method finds potential hole fillers for ~45% of the pathway holes in a microbial genome [32
PHFiller includes a graphical interface that optionally presents each inferred hole filler to the user along with information that helps the user evaluate the hole fillers, and allows the user to accept or reject the hole fillers that it has proposed.
PathoLogic inference of transport reactions
Membrane transport proteins typically make up 5–15% of the gene content of organisms sequenced to date. Transporters import nutrients into the cell, thus determining the environments in which cell growth is possible. The development of the PathoLogic TIP [33
] was motivated by the need to perform symbolic inferences on cellular transport systems, and by the need to include transporters on the Cellular Overview diagram. The motivating symbolic inferences include the problems of computing answers to the following queries: What chemicals can the organism import or export? For which cellular metabolites that are consumed by metabolic reactions but never produced by a reaction is there no known transporter (meaning that the origin of such metabolites is a mystery, and indicates missing knowledge about transporters or reactions that produce the compound)?
To answer such queries, we must have a representation of transporter function that is computable (ontology based). Pathway Tools has such a representation, in which transport events are represented as reactions in which the transported compound(s) are substrates. Each substrate is labeled with the cellular compartment in which it resides, and each substrate is a controlled-vocabulary term from the extensive set of chemical compounds in MetaCyc [7
]. The TIP program converts the free-text descriptions of transporter functions found in genome annotations (examples: ‘predicted ATP transporter of cyanate’ and ‘sodium/proline symporter’) into computable transport reactions.
TIP performs the following operations that are explained more fully in Lee et al.
]. Starting with the full set of monomeric proteins encoded by the genome, TIP first identifies the likely transport proteins by searching for proteins that include various keywords indicative of transport function (such as ‘transport’ and ‘channel’), and that lack certain counter-indicator keywords (such as ‘regulator’). Then, for each such identified transport protein T
, the program performs these steps.
- It identifies the reaction substrates of T. The program parses the descriptions of transporter function to find the names of small molecules from the dictionary of compound names in MetaCyc.
- It determines the energy coupling for T (e.g. is T a passive channel, or an ATP-driven transporter?) Energy coupling is inferred by a number of rules that include analysis of keywords and identified substrates.
- It assigns a compartment to each substrate of T by searching for keywords such as ‘uptake’, ‘efflux’, ‘symport’, and ‘antiport’.
- It constructs a multimeric protein complex for T if so indicated. Most transporters are multimeric systems. A multimeric complex will be created for T if its gene is located within an operon containing other proteins annotated as transporting the same substrate, and if all proteins share the energy coupling mechanism of ATP or of the phosphotransferase system.
- It constructs a transport reaction for T by defining a new reaction object within the PGDB with appropriate reactants and products. If the coupling mechanism is phosphoenol pyruvate, the program creates a product that is a phosphorylated form of the transported substrate.
An evaluation showed that 67.5% of TIP predictions were correct; the remainder had an error in the substrate, in the directionality of transport, or in the energy coupling [33
]. TIP includes a graphical interface that allows the user to interactively review and revise its predictions.
The Editors support PGDB curation through interactive modification and updating of all the major datatypes supported by Pathway Tools. They can be invoked quickly from every Navigator window through a single mouse operation so that a user who sees within the Navigator an object that needs to be updated can quickly invoke an editing tool to make the required change. When the user exits from the editing tool, the modified version of the object is then displayed within the Navigator.
The Editors allow the user to invoke an external spelling checker (ispell) to check spelling within comment fields.
Curators typically become proficient at these tools after a day of training and a few weeks of experience.
The editing tools included in Pathway Tools are as follows:
- Gene editor: This supports editing of gene name, synonyms, DB links and start and stop position within the sequence.
- Protein editor: This supports editing of protein attributes as well as of protein subunit structure and protein complexes (Supplementary Figure S12), and also allows users to assign terms from the GO and MultiFun controlled vocabularies. Pathway Tools can store, edit and display features of interest on a protein; see ‘Pathway Tools protein feature ontology’ section for more details. When editing a protein feature, the user selects a feature type (e.g. phosphorylation site), defines the location of the feature on the sequence, a bound or attached moiety where appropriate, a textual label, an optional comment, citations and sequence motif. The feature location can be specified either by typing in the residue number(s) or by selecting a portion of the amino acid sequence with the mouse. In addition, the sequence can be searched for specific residue combinations, which may include wild cards.
- Reaction editor: This supports editing of metabolic reactions, transport reactions and signaling reactions.
- Pathway editor: This allows users to interactively construct and edit a metabolic pathway from its component reactions (Supplementary Figure S11).
- Regulation editor: This allows definition of regulatory interactions including regulation of gene expression by control of transcription initiation, attenuation and by control of translation by proteins and small RNAs (Supplementary Figure S13). This editor also allows creation of operons and definition of their member genes, as well as specifying the positions of promoters and transcription factor binding sites.
- Compound editor: This supports editing of compound names, citations and DB links. Pathway Tools has been interfaced to two external chemical structure editors: Marvin  and JME . A chemical compound duplicate checker runs whenever chemical structures are entered or modified, to inform the user if the resulting structure duplicates another compound in that user's; PGDB or in MetaCyc.
- Publication editor: This supports entry of bibliographic references.
- Organism editor: This supports editing information about the organism described by a PGDB, including species name, strain name and synonyms, and taxonomic rank within the NCBI Taxonomy.
Author crediting system
Often, many curators collaborate on a given PGDB, and it is desirable to attribute their contributions accordingly. This not only helps to find out who should be asked if questions about particular entries arise, but more important, it will provide an incentive for high-quality contributions, because contributors will be able to clearly demonstrate their accomplishments.
The editing tools for the most important objects thus support attaching credits of several kinds. When an object such as a pathway is first created, by default, a ‘created’ credit is attached to the object, along with a timestamp. The curator is described by an author DB object, and a DB object describing the author's organization. The author frame records the name, email address and the organization(s) with which the curator is affiliated. Editing tools exist for authors and organizations, and substring search allows convenient retrieval. A given credit for an object can be attached to either authors, organizations or both, in a flexible manner. Every author and organization has a ‘home’ page that lists all the objects that have been credited.
Other kinds of credit are ‘revised’ when a curator substantially edits an object that was created some time ago, and a ‘last curated’ flag can be set to indicate when a curator has last researched the literature available for a given object. The last-curated flag is useful for those objects about which almost nothing is known, to distinguish between the case where no curator ever looked at the object, versus where an extensive search was performed but still nothing new was found.
Credits are included with pathways exported to a file, which allows exchange of pathway contributions between PGDBs, complete with proper credit attribution. An additional kind of credit called ‘reviewed’ can be used when such external contributions have been reviewed by a receiving curator, or to also attribute reviews of various objects by invited, external domain experts.
Bulk PGDB updating
During the PGDB life cycle, a number of types of PGDB updates are required that would be extremely onerous to perform if the user were forced to perform them manually, one at a time. Therefore, Pathway Tools provides several facilities for performing bulk updates of a PGDB. The most general facility is that users can write their own programs to perform arbitrary types of updates through the Pathway Tools APIs in the Perl, Java and Lisp languages (see ‘Computational access to PGDBs’ section).
Some groups choose to store the authoritative version of their genome annotation in a DB external to the PGDB, such as groups that developed their own genome DB system prior to adopting Pathway Tools. Such users need the ability to update their PGDB with data from a revised genome annotation without overwriting or otherwise losing any manual curation that has been added to the PGDB. Pathway Tools provides an interface for doing just that. It takes as input one or more update files, either in GenBank format or PathoLogic file format. The files can contain either a complete revised annotation for the organism, or they can contain just the information that has changed. The software will parse the update files and determine all differences between the new data and the old. Types of changes that are detected include new genes, as well as updated gene positions, names, synonyms, comments, links to external DBs and updated functional assignments. None of the changes will be propagated automatically. Instead, a pop-up dialog will summarize different classes of changes. For example, it will list the number of new genes, the number of genes with name changes, the number of previously unassigned genes that now match a reaction and the number of previously assigned genes that now match a different reaction. For each class of changes, the curator has the option of either accepting all updates (e.g. creating DB objects for all the new genes), or of checking each proposed update. Once this phase is complete and any changes to functional assignments have been made, the software will re-run the pathway inference procedure described in ‘Pathologic inference of metabolic pathways’ section, identify any new pathways that are inferred to be present and any existing pathways that no longer have sufficient evidence and allow the curator to review those changes.
Consistency checker and aggregate statistics
Pathway Tools contains an extensive set of programs for performing consistency checking of a PGDB to detect structural defects that sometimes arise within PGDBs. Also included in this component are tools for computing and caching aggregate statistics for a PGDB, such as computing the molecular weights of all proteins from their amino acid sequences. The statistics are cached so that they can be displayed quickly. At SRI, we run these programs as part of the quarterly release process for EcoCyc and MetaCyc.
Roughly half of the programs automatically repair PGDB problems that they find. Such problems could be caused by user data entry errors, or by errors in Pathway Tools itself. Example checks include to ensure that inverse relationship links are set properly (e.g. that a gene is linked to its gene product, and that the product links back to the gene); make sure pathways do not contain duplicate reactions; validate and update GO term assignments with respect to the latest version of GO; perform formatting checks in comment text; search gene reading frames for internal stop codons; and to remove redundant bonds from chemical structures.
The other checker programs generate listings of every error detected, and allow the user to click on each problematic object in the listing to enter the editor for that object to repair it.
Most new releases of Pathway Tools include additions or modifications to the Pathway Tools schema. Schema changes are made to model the underlying biology more accurately (such as adding support for introns and exons), extend the datatypes within Pathway Tools (such as adding support for features on protein sequences) or to increase the speed of the software. Because each new version of the software depends on finding data within the fields defined by the associated version of the schema, existing user PGDBs created by older versions of the software will be incompatible with these new software versions.
Therefore, every release of Pathway Tools contains a program to upgrade PGDBs whose schema corresponds to the previous version of the software, to the new version of the software. When a user opens a PGDB under a new version of the software, the software detects that the schema of the PGDB is out of date, and offers to run this schema upgrade program for the user. For users who have not upgraded the software for several releases, several upgrade operations are performed consecutively. Example upgrade operations include adding new classes to the PGDB from the MetaCyc PGDB, adding new slots to PGDB classes, deleting PGDB classes, moving data values from one slot to another and moving objects from one class to another. The schema upgrade leaves the user's; curated data intact.
Every new release of Pathway Tools includes a new version of the MetaCyc DB, which, in addition to providing new data content, typically contains updates and corrections to existing pathways, reactions and compounds. Pathway Tools includes an option to propagate such updates and corrections to an existing organism PGDB. However, because we do not want to override any manual edits made to a PGDB, this tool does not run automatically. Much like the tool for incorporating a revised genome annotation, described in ‘Bulk PGDB updating’ section, this tool organizes the changes into logical groups (such as all compounds with newly added structures, or all reactions with changed reaction equations), and allows the user to either accept an entire group of changes, or to examine and confirm each member of a group.