KEGG GENES and ortholog annotation
One of the main objectives of the KEGG project has been to uncover higher level systemic functions of the cell and the organism from genomic and molecular-level information. The basis for genome annotation in KEGG, which is continuously performed for all sequenced genomes, is the KO system consisting of manually defined ortholog groups that correspond to individual nodes in the KEGG pathway maps and the BRITE functional hierarchies. Once genes are assigned KO identifiers or K numbers by the ortholog annotation procedure described below, the collective body of K numbers can be mapped to KEGG pathway maps and BRITE functional hierarchies, highlighting any subsystems present and enabling higher level functional interpretation of the genome.
During the past two years, the ortholog annotation procedure has been significantly improved by the newly developed KOALA (KEGG Orthology and Links Annotation) tool. There are two types of annotation in KEGG. One is a genome-based annotation, assigning K
numbers to genes in a given genome. The other is a KO-based annotation, assigning a given K
number (such as in a pathway map) to genes in all organisms. In order to cope with an increasing number of complete genomes, the first annotation is now partially automated (except for selected reference organisms) with continuous efforts to manually improve the second cross-species annotation. The current KEGG annotation procedure is as follows.
- Gene information for completely sequenced genomes is computationally generated from RefSeq (7) and other public resources, and stored in the KEGG GENES database.
- Sequence similarity scores and best-hit relations are computationally generated from KEGG GENES by pair-wise genome comparisons using SSEARCH, and stored in the KEGG SSDB database.
- Automatic genome-based annotation is performed for a limited set (currently, about one-third) of K numbers, which are considered safe for such purpose based on the result of SSDB computation and the criteria of the KOALA tool.
- Manual annotation is performed across species for other K numbers using the KOALA and GFIT (8) tools. This step may involve addition/revision of ortholog groups, which is essential to increase the number of safe K numbers.
A glimpse of this procedure can be seen through the read-only versions of KOALA and GFIT tools available on the KO and GENES entry pages, respectively. The quality of KEGG ortholog annotation can be examined by two additional tools. The ortholog table tool displays the status of KO assignment for a given set of K numbers, which is useful to check the completeness of a pathway or a complex. The gene cluster tool displays the status of KO assignment along the chromosomal position of a given genome, which is useful to check the consistency of annotation for operon-like structures in bacterial genomes.
As of 3 September 2009, the KEGG GENES database contains 4.8 million genes in 1049 genomes. In comparison, the UniProt database (9
) contains 9.4 million proteins from one-half million species. KEGG already covers half of the known protein universe and >90% of protein sequence families (Kanehisa,M., unpublished data). As the number of complete genomes increases, the coverage of the protein universe will also increase, but there will be remaining fractions of protein families, such as for plant proteins and viral proteins. These protein families are useful to analyze, for example, EST data and metagenomics data, and they will be incorporated in the KO system.
KEGG PATHWAY and BRITE: reference knowledge bases
The KEGG reference pathway maps and BRITE reference hierarchies are created in a general way to be applicable to all organisms; namely, in terms of the orthologs defined by K
numbers. The organism-specific pathways and hierarchies can then be generated by converting K
numbers to gene identifiers in a given organism. In the past year, the KEGG PATHWAY database has been completely renovated. All the pathway maps have been redrawn using a newly developed tool called KegSketch, which generates KGML+ (meaning KGML + SVG) files. Internally the database update procedure is now based on the text manipulation of these files rather than the color manipulation of image files. For outside services, the coloring procedure continues to be done on image files, but the image file format has been changed from GIF to PNG to accommodate more colors. As a result, there is now no distinction between the global map (6
) and the regular pathway maps; they can be manipulated in the same way both in the new KEGG Atlas tool and the traditional image map viewer. Another new feature in the outside service is the XML version of KEGG pathway maps, which is made available in both the original KGML format and the converted BioPAX level 2 format (10
KEGG LIGAND for chemical bioinformatics
The KEGG LIGAND database contains information about chemical structures and chemical reactions of endogenous molecules, small molecules to larger biopolymers. Certain KEGG pathway maps contain reference chemical structures that can be used to link genomes to the chemical diversity of endogenous molecules. For example, the KEGG pathway map for N-glycan biosynthesis (map00510) contains both the biosynthetic pathway and the synthesized glycan structure. By mapping the genomic content of glycosyltransferases, such as for human (hsa00510), the organism-specific pathway and the organism-specific glycan structure can be seen. This type of structural mapping has been done more extensively in eukaryotic genomes to characterize the chemical structural diversity of glycans (11
) and lipids (12
). A potentially more interesting, but more difficult, problem is to link plant genomes to plant secondary metabolites. Plants are known to produce diverse chemical compounds including those with medicinal and nutritional values, but the chemical architecture is more complex than simple biopolymers of glycans and lipids. We have introduced KEGG PLANT, a new interface to the KEGG resource for plant research, especially for understanding relationships between genomic and chemical information of plant natural products.
We have also been trying to expand our knowledge on biochemical reactions from experimentally characterized reactions in Enzyme Nomenclature (KEGG ENZYME) to pathway-based definition of reactions (KEGG REACTION) to chemical structural motifs, called RDM patterns that characterize reactions (KEGG RPAIR). The RDM patterns have been used to predict microbial biodegradation pathways from chemical structures of environmental compounds (13
). As an extension of this line of research, the E-zyme tool for reaction prediction from a pair (or pairs) of chemical structures has been upgraded by introducing a new algorithm (14
KEGG MEDICUS for analysis of network–disease associations
In KEGG, disease and drug information is being organized in more computable forms, especially for the analysis of molecular networks. As shown in , the disease/drug resource, called KEGG MEDICUS, consists of each of the KEGG DISEASE and KEGG DRUG databases, a specific category of the KEGG PATHWAY database and a specific category of the KEGG BRITE database. Disease information is computerized in two forms: pathway maps and gene/molecule lists. The Human Diseases category of the KEGG PATHWAY database contains about 40 pathway maps for cancers, immune disorders, neurodegenerative diseases, circulatory diseases, metabolic disorders and infectious diseases. When the detail of the molecular network is not known but disease genes are identified, we use the gene/molecule list representation and create a KEGG DISEASE entry. The entry contains a list of known disease genes and other relevant molecules including environmental factors, diagnostic markers and therapeutic drugs. The list simply defines the membership to the underlying molecular system, but is still useful for computational analysis.
KEGG MEDICUS for disease and drug information
The KEGG DRUG database is a chemical structure-based information resource for all prescription and OTC drugs in Japan including crude drugs and TCM formulas, as well as most prescription drugs in the USA and many prescription drugs from Europe. In addition to chemical structures (or chemical compositions for multi-component drugs) and therapeutic efficacy of about 9000 drugs (as of September 2009), different drug classification systems are maintained as part of the KEGG BRITE functional hierarchies. Some are based on the established classification systems to which KEGG DRUG entries are assigned, including the ATC (Anatomical Therapeutic Chemical) classification by WHO, therapeutic category of prescription drugs in Japan and classification of OTC drugs in Japan. There are additional classification systems developed by KEGG, including those for crude drugs and TCM formulas.
Furthermore, KEGG DRUG contains information about two types of molecular networks. The first network is a molecular interaction network representing interactions and/or relations with target molecules (often in the context of pathway maps), drug metabolizing enzymes, drug transporters and other drugs (especially those causing adverse effects). The second network is a network of chemical structure changes in small molecules, which includes series of chemical modifications introduced by medicinal chemists in the history of drug development (in the KEGG drug structure maps), secondary metabolic pathways for biosynthesis of druggable natural products and drug metabolism (both in the KEGG pathway maps). We have analyzed the chemical architecture of marketed drugs and the patterns of chemical structure transformations in the history of drug development (15
), in a similar spirit to the RDM patterns of chemical structure transformations in enzyme-catalyzed reactions. As illustrated in , the second network may be used to analyze the chemical architecture of natural products and the chemical architecture of marketed drugs towards drug discovery from the genomes of plants and microorganisms. Furthermore, the second network may have relevance in understanding drug metabolism and biodegradation of environmental substances by considering not only the human genome but also the metagenome of the human body.
KEGG accumulates knowledge about the networks of chemical structure transformations for linking genomes to chemical structures.