As the number of fully sequenced genomes is rapidly growing thanks to the advancement of next-generation sequencing technology, we face the necessity of analysing huge amount of genomic data in recent genome science. For example, 3402 organisms have been fully sequenced and 13 796 additional organisms are currently being sequenced according to the Genomes OnLine Database (GOLD) (1
) as of writing this article. It is crucial to identify orthologous genes (orthologs) that are genes in different species and have branched from a single gene of their last common ancestor by speciation. The concept of orthologs plays a key role in functional annotation for newly sequenced genomes, because orthologs tend to have equivalent functions. In fact, functional annotation in many public databases is usually performed based on the sequence similarities of genes across different organisms. Those similar genes are often grouped together in a same ortholog cluster (OC) which naturally correlates with the functional classification. In practice, functional ontology classes such as Gene Ontology (GO) (2
) are assigned to each gene. However, the reliability of the similarity-based functional annotation depends heavily on the similarity threshold and it should vary from gene family to family. OC delivers appropriate boundary to each sequence family by which the quality and scalability of functional annotation can be much improved.
From the viewpoint of systems biology, automatic pathway reconstruction is also of importance, because higher-level biological functions can be understood by pathways, or molecular interaction networks of gene products (e.g. metabolic pathways, regulatory pathways). KEGG PATHWAY is a typical pathway database and has a pathway-based assignment of orthologs named KEGG Orthology (KO), where each KO entry represents an ortholog group that is linked to a gene product in the KEGG pathway diagram (3
). Once the KO identifiers (IDs) are assigned to genes in a genome, organism-specific pathways can be computationally generated, linking genomes to the biological systems. However, the KO entries are manually defined in KEGG, and a limited number of genes have been assigned to them. As the number of organisms stored into the KEGG database is exponentially growing in these days, manual assignment of the KO entries can be delayed. The use of automatically constructed OCs is expected to assist for the automatic pathway reconstruction in KEGG.
Computational identification of orthologs has been a longstanding problem in computational biology. The pioneering work is COG/KOG, which is based on the best-hit triangles between genes (4
). COG/KOG has high-quality reference clusters, but it requires manual curation and lacks reproducibility. Considering a rapidly increasing number of fully sequenced genomes, it is necessary to automatically construct and update OCs. A serious problem of automatic OC construction is the difficulty of clustering a huge number of genes at once because of prohibitive computational cost. Recently, a variety of computational methods and databases have been developed to construct OCs from gene sequence similarity, and the previous methods can be categorized into multiple genome comparison or pairwise genome comparison. The multiple genome comparison approach is based on the clustering of genes across more than two organisms, similarly as COG/KOG. Examples include EGO/TOGA (5
), MultiParanoid (6
), OrthoMCL (7
), OMAbrowser (8
), MBGD (Microbial Genome Database) (9
) and eggNOG (10
), where the taxonomic information is also used in OMAbrowser and eggNOG. The pairwise genome comparison approach is based on the matching of genes between only two organisms. Examples include InParanoid (11
) and Roundup (12
), which are based on the bidirectional-best-hit and the reciprocal smallest distance, respectively. However, it is difficult to use the previously constructed OCs by the other groups in the KEGG database because of the data incompatibility problems and the insufficient coverage of organisms. There is, therefore, a strong incentive to develop methods to identify orthologous genes and to construct OCs every time complete genomes are newly sequenced.
In this article we present KEGG OC (KEGG Ortholog Cluster), a novel database of OCs based on the whole genome comparison. The OCs in KEGG OC were constructed by applying a novel clustering method to all possible protein coding genes in all complete genomes, based on their amino acid sequence similarities. The originality of our clustering algorithm lies in the use of a quasi-clique search (a variant of clique search) and the incorporation of phylogenetic information in the clustering. It is computationally efficient to calculate OCs, which makes it possible to regularly update the contents. KEGG OC has the following advantages over the existing databases in terms of organism coverage and compatibility with KEGG. (i) It consists of all fully sequenced genomes registered in KEGG, from a wide variety of organisms from three domains of life (eukaryotes, bacteria and archaea), and the number of organisms is the largest among the existing databases. (ii) It is compatible with KEGG by sharing the same set of genes and IDs, which leads to seamless integration of OCs with KEGG PATHWAY (biological pathways), KEGG MODULE (functional modules), KEGG BRITE (functional hierarchy), KEGG MEDICUS (diseases and drugs) and many more (3