As of January 2008, KEGG comprises 19 databases, categorized into systems information, genomic information and chemical information as shown in . The six databases in the chemical information category are collectively called KEGG LIGAND. The six databases in the lower part of the genomic information category are computationally generated, but all the other 13 databases are manually curated.
The KEGG databases are highly integrated. In fact, KEGG should be viewed as a computer representation of the biological system, where biological objects and their relationships at the molecular, cellular and organism levels are computerized as separate database entries. Each database entry, called a KEGG object, is given a unique identifier within KEGG. summarizes the naming convention of such KEGG object identifiers for the 13 core databases. Except for GENES and ENZYME that utilize the standard names of locus_tag and EC number, and for GENOME that distinguishes organisms with 3–4 letter KEGG organism codes, the KEGG object identifier is a five-digit number prefixed by an upper-case alphabet or a 2–4 letter code (map, br or organism code). Examples are: C00047 for lysine, K04527 for insulin receptor and hsa05210 for colorectal cancer pathway.
These identifiers may be used to directly obtain corresponding database entries with the ‘Get Entry’ option in the KEGG website (http://www.genome.jp/kegg/
). Interestingly, these identifiers may also be used in web search engines, such as Google and Yahoo, to obtain corresponding KEGG database entries. There are already many databases that are linked to/from KEGG. Such outside links will continue to be added to better integrate KEGG with various other web resources.
Genome annotation in KEGG assigns KO (KEGG Orthology) identifiers or K numbers to genes in a single genome or simultaneously to genes in multiple genomes. With the addition or revision of a KEGG pathway map or BRITE hierarchy, KO groups (K numbers) are defined for the pathway nodes (boxes) or the hierarchy nodes (bottom leaves). Then the corresponding genes in selected organisms (usually in the literature) are manually annotated with the new K numbers, which are reflected in KEGG GENES. Thus, KEGG GENES can be used as a reference database for genome annotation. The number of KO groups has been increasing at a rate of about 2000 per year, and it is now over 10 000.
The KO assignment is applied to a new genome as follows. First, the new genome is subject to SSDB computation, a comparison of protein coding genes against all existing genomes by the SSEARCH program. The result is stored in KEGG SSDB containing sequence similarity scores and best-hit information for all gene pairs. Then, computational KO assignment is done by the KAAS-SSDB program, followed by manual verification and additional assignment with the GFIT tool. An automated version of this genome annotation procedure is made available as the KAAS web service (3
), which utilizes BLAST rather than SSEARCH for pairwise genome comparisons.
The KO system is the basis for linking genomes to biological systems through the process of pathway mapping and BRITE mapping. For each organism in KEGG, organism-specific pathways and BRITE hierarchies are computationally generated based on its assigned K numbers. Microarray gene expression profile data may then be mapped to these pathways and hierarchies to infer systemic functions of the cell or the organism. In addition to the hierarchies of genes and proteins (K numbers), KEGG BRITE contains the hierarchies of chemical substances (C, D, G, R numbers) together with known relationships to K numbers, such as ligand–receptor interactions and drug–target relationships. By using these relationships, the BRITE mapping will be improved to present clues for understanding the interactions with the environments.
The KO system can also be used for chemical annotation, which is the linking of genomic or transcriptomic contents of genes to chemical structures of endogenous molecules. This is achieved by finer classifications of KO groups for specific classes of enzymes distinguishing different substrate specificity, as well as accumulating knowledge of biosynthetic pathways. For example, glycans are synthesized by a series of reactions catalyzed by glycosyltransferases. With the KEGG pathway maps for glycan structures (map01030 and map01031) or the KEGG GLYCAN composite structure map (4
), where edges (glycosidic linkages) correspond to K numbers (glycosyltransferase orthologs), the gene content in the genome can be converted to possible glycan structures. In a similar but more sophisticated way, glycan structures can be predicted from microarray gene expression data (5
). The KEGG resource will be made suitable to cope with the diversity of other molecules as well, including polyketides/non-ribosomal peptides (6
), polyunsaturated fatty acids and terpenoids.
Another type of chemical annotation is to characterize biological meaning in the chemical structures of small molecules. As reported previously (2
), the knowledge of enzymatic reactions and associated chemical structure transformations is stored in KEGG REACTION and KEGG RPAIR. Each structure transformation is characterized by the RDM pattern (7
), and most of the patterns are found uniquely or preferentially in specific categories of KEGG pathways (8
). This tendency was used to predict the metabolic fate of xenobiotic chemical compounds. Software for reaction/pathway prediction is being developed as an upgrade of e-zyme and PathComp in KEGG LIGAND.
Enhancements to KEGG pathway
KEGG PATHWAY has been significantly expanded over the last 2 years with the addition of about 50 new pathway maps, mostly for signal transduction, cellular processes and human diseases. However, the traditional KEGG metabolic pathway maps are still most widely used including the KGML (KEGG XML) version. They are now supplemented with two new features introduced as a response to user feedback. The first feature is a global map shown in , which is created as an SVG file by manually combining about 120 existing maps. Each node (circle) is a chemical compound and each line (curved or straight) connecting two nodes is a series of reactions (one to several reactions), which is also manually defined as a segment lacking branches. The new KEGG metabolism map allows the user to view and compare the entire metabolism, such as by mapping metagenomics data or microarray data. KGML users should also find the new KEGG metabolism map much easier to manipulate.
The new KEGG metabolism map created as an SVG file.
The other feature is KEGG MODULE, a new database that collects pathway modules and other functional units as a set of K numbers. Pathway modules are smaller pieces of subpathways (see the BRITE hierarchy ko00002), manually defined as consecutive reaction steps, operon or other regulatory units, phylogenetic units obtained by genome comparisons, etc. This new database also contains molecular complexes, facilitating better organization of data and knowledge, especially in KEGG BRITE. The hierarchy of molecular organization, such as the subunit organization of transporters or receptors, is represented by the M number that corresponds to a set of K numbers. Incidentally, a line segment in the new KEGG metabolism map that also corresponds to a set of K numbers is identified by the N number, representing a mechanistically defined network segment.
KEGG for medical and pharmaceutical applications
As of September 2007, KEGG PATHWAY contains 26 maps for human diseases, among which 19 were introduced in the last 2 years. The disease pathway maps are classed in four subcategories: 6 as neurodegenerative disorders (9
), 3 as each of infectious diseases and metabolic disorders and 14 as cancers. Although such maps will continue to be added, they will never be sufficient to represent our knowledge of molecular mechanisms of diseases because in many cases it is too fragmentary to represent as pathways. KEGG DISEASE is another addition to the KEGG suite of databases accumulating molecular-level knowledge on diseases including genes, drugs and biomarkers. Our current effort is focused on the four subcategories of diseases mentioned above.
The number of entries in KEGG DRUG has also significantly increased over the last 2 years, and now covers all approved drugs in the US and Japan. KEGG DRUG is a structure-based database. Each entry is a unique chemical structure that is linked to standard generic names, and is associated with efficacy and target information as well as drug classifications. Target information is presented in the context of KEGG pathways and drug classifications are part of KEGG BRITE. The generic names are linked to trade names and subsequently to outside resources of package insert information (patient information) whenever available. This reflects our effort to make KEGG more useful to the general public.