Currently the best organized part of the KEGG/PATHWAY database is metabolism, which is represented by ~90 graphical diagrams for the reference metabolic pathways. Each reference pathway can be viewed as a network of enzymes or a network of EC numbers. Once enzyme genes are identified in the genome based on sequence similarity and positional correlation of genes, and the EC numbers are properly assigned, organism-specific pathways can be constructed computationally by correlating genes in the genome with gene products (enzymes) in the reference pathways according to the matching EC numbers. We are trying to extend this mechanism to include various regulatory pathways, such as signal transduction, cell cycle and apoptosis. There are, however, two major problems in automating the construction of regulatory pathways.
Because the metabolic pathway, especially for intermediary metabolism, is well conserved among most organisms from mammals to bacteria, it is possible to manually draw one reference pathway and then to computationally generate many organism-specific pathways. In contrast, the regulatory pathways are far more divergent and are difficult to combine into common reference pathway diagrams. Thus, we basically draw a pathway diagram separately for each organism. At the same time, we are trying to identify groups of organisms that share common pathways or assemblies and whose diagrams may be combined. Examples include one common apoptosis pathway diagram for human and mouse, three ribosome assembly diagrams separately for bacteria, archaea and eukaryotes.
The other related problem is the absence of proper identifiers for functions in the regulatory pathways. The EC numbers in the metabolic pathways play roles as identifiers of the nodes (enzymes) and also as keys for linking with the genomic information. We are preparing for the introduction of the ortholog identifiers to extend such capabilities of the EC numbers. The ortholog identifiers will be used to identify nodes (proteins) in the regulatory pathways and also to link with the genomic information. In addition, the ortholog identifiers will replace the EC numbers in the metabolic pathways in order to distinguish multiple genes that match one EC number, for example, different subunits of an enzyme complex or different genes expressed under different conditions.
Ortholog group tables
Orthologs are identified in KEGG not only by sequence similarity of individual genes but also by examining if all constituent members are found for a functional group, such as a conserved subpathway or a molecular complex. The KEGG ortholog group table is a representation of three features: whether an organism contains a complete set of genes that constitutes a functional group, whether those genes are physically coupled on the chromosome, and what are orthologous genes among different organisms. Currently there are 61 ortholog group tables, which contain, for example, a gene cluster in the genome coding for a functionally related enzyme cluster in the metabolic pathway. In KEGG such correlated clusters are first detected by a heuristic graph comparison algorithm, and then manually edited and compiled into the ortholog group tables. There are two types of graph comparisons that we use: genome–pathway and genome–genome comparisons (1
). An ortholog group table is a composite of such pairwise comparisons, representing a conserved portion of the pathway, or what we call a pathway motif.
Generalized protein–protein interaction
The KEGG pathway representation focuses on the network of gene products, mostly proteins but including functional RNAs. As illustrated in Figure , the metabolic pathway is a network of indirect protein–protein interactions, which is actually a network of enzyme–enzyme relations. In contrast, the regulatory pathway often consists of direct protein–protein interactions, such as binding and phosphorylation, and another class of indirect protein–protein interactions, which are relations of transcription factors and transcribed gene products via gene expressions. The generalized protein–protein interaction network that includes these three types of interactions is an abstract network, but it is especially useful to link with genomic information because the nodes (gene products) of this network can be directly correlated with the nodes (genes) in the genome. With this concept of generalized protein–protein interaction network, we are expanding the collection of manually drawn reference pathway diagrams.
Figure 2 The generalized protein–protein interaction includes an indirect protein–protein interaction by two successive enzymes, a direct protein–protein interaction, and another indirect protein–protein interaction by gene expression. (more ...)