The ongoing sequencing of multiple genomes creates a growing need for functional annotation. Comparative approaches based on ortholog identification have been particularly useful, enabling protein function to be inferred based on information available from other species, and providing the raw material for evolutionary analysis (1
). Homologous proteins share a common ancestry, and may be characterized as orthologs (which diverged from a common ancestral gene owing to speciation) or paralogs (which derive from a gene duplication event) (2
). In general, orthologous genes are expected to retain similar (if not identical) function, while paralogs may more readily acquire novel functional roles.
OrthoMCL is a graph-clustering algorithm designed to identify homologous proteins based on sequence similarity, and distinguish orthologous from paralogous relationships without computationally intensive phylogenetic analysis (3
). The algorithm first flags probable orthologous pairs identified by BLAST analysis as reciprocal best hits across two genomes (1
), creating a graph in which edge weights connecting each protein pair are based on BLAST similarity scores. In addition, probable in-paralogs arising from duplication events subsequent to species divergence (2
) are identified as sequences within the same genome that are (reciprocally) more similar to each other than either is to any sequence from other genomes, i.e. reciprocal better hits (3
). Attaching these in-paralogous relationships, and incorporating edges connecting the resulting co-orthologs, overcomes the inability of simple reciprocal best hit approaches to detect many-to-many relationships (3
). Edge weights are then adjusted to account for genome-to-genome similarity averages, and the resulting graph is clustered using the MCL algorithm (5
), reducing large clusters containing weak single linkages into smaller clusters that are more robust in their representation of truly orthologous relationships (3
). In contrast to TribeMCL (7
), which clusters proteins based on all BLAST similarities, producing large protein families, OrthoMCL focuses on the identification of proteins whose similarity suggests true orthology. As a fully automated method, OrthoMCL is applicable to multiple species datasets by bypassing the labor-intensive manual curation involved in the construction of the NCBI KOG (euKaryotic Ortholog Groups) database (6
). Preliminary results indicate that OrthoMCL groups exhibit higher levels of functional consistency than other ortholog identification algorithms (data not shown).
OrthoMCL was designed to address the difficulties inherent in identifying eukaryotic orthologs, focusing on the recognition of recent duplications, and the use of Markov clustering to separate groups linked by protein fusions (7
). An initial report described clustering of six eukaryotic genomes and one reference prokaryotic species (Escherichia coli
). Many additional genome sequences have been released in the last 2 years, however, stimulating considerable demand for the identification of ortholog groups. This report describes clustering of the predicted proteomes for 35 eukaryotic and 20 diverse prokaryotic species (both bacteria and archaea), spanning the tree of life (), and an online database for perusing, querying and retrieving of these clusters.