|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
The OrthoMCL database (http://orthomcl.cbil.upenn.edu) houses ortholog group predictions for 55 species, including 16 bacterial and 4 archaeal genomes representing phylogenetically diverse lineages, and most currently available complete eukaryotic genomes: 24 unikonts (12 animals, 9 fungi, microsporidium, Dictyostelium, Entamoeba), 4 plants/algae and 7 apicomplexan parasites. OrthoMCL software was used to cluster proteins based on sequence similarity, using an all-against-all BLAST search of each species' proteome, followed by normalization of inter-species differences, and Markov clustering. A total of 511797 proteins (81.6% of the total dataset) were clustered into 70388 ortholog groups. The ortholog database may be queried based on protein or group accession numbers, keyword descriptions or BLAST similarity. Ortholog groups exhibiting specific phyletic patterns may also be identified, using either a graphical interface or a text-based Phyletic Pattern Expression grammar. Information for ortholog groups includes the phyletic profile, the list of member proteins and a multiple sequence alignment, a statistical summary and graphical view of similarities, and a graphical representation of domain architecture. OrthoMCL software, the entire FASTA dataset employed and clustering results are available for download. OrthoMCL-DB provides a centralized warehouse for orthology prediction among multiple species, and will be updated and expanded as additional genome sequence data become available.
The ongoing sequencing of multiple genomes creates a growing need for functional annotation. Comparative approaches based on ortholog identification have been particularly useful, enabling protein function to be inferred based on information available from other species, and providing the raw material for evolutionary analysis (1). Homologous proteins share a common ancestry, and may be characterized as orthologs (which diverged from a common ancestral gene owing to speciation) or paralogs (which derive from a gene duplication event) (2). In general, orthologous genes are expected to retain similar (if not identical) function, while paralogs may more readily acquire novel functional roles.
OrthoMCL is a graph-clustering algorithm designed to identify homologous proteins based on sequence similarity, and distinguish orthologous from paralogous relationships without computationally intensive phylogenetic analysis (3). The algorithm first flags probable orthologous pairs identified by BLAST analysis as reciprocal best hits across two genomes (1), creating a graph in which edge weights connecting each protein pair are based on BLAST similarity scores. In addition, probable in-paralogs arising from duplication events subsequent to species divergence (2) are identified as sequences within the same genome that are (reciprocally) more similar to each other than either is to any sequence from other genomes, i.e. reciprocal better hits (3). Attaching these in-paralogous relationships, and incorporating edges connecting the resulting co-orthologs, overcomes the inability of simple reciprocal best hit approaches to detect many-to-many relationships (3,4). Edge weights are then adjusted to account for genome-to-genome similarity averages, and the resulting graph is clustered using the MCL algorithm (5), reducing large clusters containing weak single linkages into smaller clusters that are more robust in their representation of truly orthologous relationships (3). In contrast to TribeMCL (7), which clusters proteins based on all BLAST similarities, producing large protein families, OrthoMCL focuses on the identification of proteins whose similarity suggests true orthology. As a fully automated method, OrthoMCL is applicable to multiple species datasets by bypassing the labor-intensive manual curation involved in the construction of the NCBI KOG (euKaryotic Ortholog Groups) database (6). Preliminary results indicate that OrthoMCL groups exhibit higher levels of functional consistency than other ortholog identification algorithms (data not shown).
OrthoMCL was designed to address the difficulties inherent in identifying eukaryotic orthologs, focusing on the recognition of recent duplications, and the use of Markov clustering to separate groups linked by protein fusions (7). An initial report described clustering of six eukaryotic genomes and one reference prokaryotic species (Escherichia coli K12) (3). Many additional genome sequences have been released in the last 2 years, however, stimulating considerable demand for the identification of ortholog groups. This report describes clustering of the predicted proteomes for 35 eukaryotic and 20 diverse prokaryotic species (both bacteria and archaea), spanning the tree of life (Figure 1), and an online database for perusing, querying and retrieving of these clusters.
Translated protein sequences for all eukaryotic genomes considered complete as on July 2005 were obtained from the following sources: bacterial and archaeal sequences from GenBank (8); many eukaryotic sequences (e.g. Drosophila melanogaster and Homo sapiens) from Ensembl (9); other sequences from the relevant sequencing centers or organism-specific databases (see Table 1). In some cases, this resulted in the inclusion of proteins derived from differentially spliced transcripts. Because various naming systems are used for protein identification at the different source sites, a unified sequence accession format (consisting of the genome abbreviation followed by a number) was used to provide each protein with a unique identifier. Original sequence identifiers were incorporated into the sequence description. A total of 627098 protein sequences was obtained from 55 genomes (see Table 1).
OrthoMCL was originally designed as a pipeline integrated with a GUS (Genomic Unified Schema) relational database (http://www.gusdb.org). In response to multiple requests from users, a stand-alone Perl script version of OrthoMCL is now available from the website, allowing this ortholog clustering algorithm to be run without a relational database. OrthoMCL accepts as input a tab-delimited summary of all-against-all sequence similarity search data, including estimates of statistical significance in the form of expectation values. For this dataset, a single FASTA file was compiled from all genomes, and a WU-BLASTP (10) search was performed using the following parameters: E = 1 × 10−5 wordmask = seg + xnu W = 3 T = 1000. BLAST results were fed into the stand-alone OrthoMCL program using a default MCL inflation parameter of 1.5.
Results from the OrthoMCL clustering were loaded into a custom MySQL relational database, along with additional computational analysis made available via the web interface. Pfam 17.0 domain assignments were generated for each sequence based on hmmpfam (http://hmmer.wustl.edu/), using the gathering cut-off (11). Summary statistics on sequence similarity for each group include percentage match pairs (fraction of protein pairs aligned in the initial all-against-all WU-BLASTP search), average E-value (based on log [E-value]), average percent coverage (fraction of aligned regions, based on the shorter sequence) and average percent identity. In addition, MUSCLE multiple sequence alignment (12) and BioLayout graphical visualization of sequence similarities (13) are provided for groups with ≤100 proteins. The OrthoMCL-DB web interface is run by Perl CGI scripts that implement a simple MVC (Model View Controller) architecture provided by the CGI::Application Perl module. The relational database schema and associated Perl scripts for data loading are available from the authors.
The unrooted species tree shown in Figure 1 was calculated using the PHYLIP program ‘neighbor’ for neighbor joining (14), where the distances between two species (dij) are calculated based on the number of ortholog groups shared between two species (nij), normalized to account for the number of ortholog groups observed in the two species considered separately (ni, nj):
Note that only ortholog groups containing proteins from at least two species were considered for this analysis.
In this implementation of OrthoMCL, 511797 of 627098 protein sequences (81.6%) were clustered into 70388 ortholog groups, as summarized for each species in Table 1. In some species—particularly those eukaryotes showing extensive gene duplications—the number of protein sequences is much higher than the number of ortholog groups identified. For example, while 3295 out of 4242 Escherichia coli sequences (78%) were clustered into 2536 groups (average of 1.3 E.coli sequences/group), 78731 of 88149 sequences from the Oryza sativa (rice) genome (89%) were represented by just 18933 groups (average of 4.2 O.sativa sequences/group). An average of 7.3 sequences were identified per ortholog group (min. 2, max. 822), representing an average of 4.3 species (min. 1, max. 55). As a consequence of the conservative approach used for ortholog identification, OrthoMCL groups tend to be small, containing only a handful of sequences from a limited number of species. In some cases, ancient out-paralogs of these genes may be represented by other groups, and protein family clustering methods such as TribeMCL (7) could be helpful in identifying such relationships.
A relatively non-stringent E-value threshold (1 × 10−5) was used for inclusion of BLAST hits in the OrthoMCL graph, in order to ensure identification of distantly diverged orthologs. Although this might be expected to include many false positives, rules applied during group identification (reciprocal best/better hits, Markov clustering) eliminate most poorly alignable sequences. Considering the entire clustered dataset, 79% of all pairs within OrthoMCL groups were recognized in the initial BLAST search, and display an average E-value = 1 × 10−114, average percent identity = 53% and average percent coverage = 85%. The performance of this algorithm has been validated by comparison with other ortholog identification algorithms, and assessing consistency with EC number annotations (3).
Only six ortholog groups, representing ribosomal proteins and tRNA synthetases, contain proteins from all 55 genomes. It is not surprising that so few universal ortholog groups can be identified by similarity-based clustering alone, given the reduced gene content of some minimalist genomes, and the high degree of horizontal transfer and gene displacement observed in bacterial and archaeal species. A total of 20583 ortholog groups contain only in-paralogs from a single species lineage, representing both organism-specific inventions, and ancient duplications retained in one lineage only (among those in the dataset).
The total number of shared ortholog groups for all pairwise species comparisons (available from the OrthoMCL-DB website as an Excel spreadsheet) can be used as an indication of phylogenetic distance between species (15), providing the basis for evolutionary reconstruction based on total proteomic evidence. The number of shared ortholog groups ranges from a low of 54 groups representing sequences from both Nanoarchaeum equitans and Chlamydophila pneumoniae, to a high of 15954 groups with members from both Mus musculus and Rattus norvegicus. A tree of life constructed from these data closely reflects current understanding of organismal evolution (Figure 1), clustering the Archaea, Bacteria and Eukaryota in distinct groups, and clearly defining known eukaryotic assemblages, including the Plants/Algae, Apicomplexa and Unikonts [animals, fungi, microsporidia, slime molds (Dictyostelium) and amoebae (Entamoeba)] (16).
This total evidence tree reflects the evolutionary history of complete genomes, and it is interesting to note the relatively uniform branch lengths for all taxa, in contrast to the extreme variations in branch length often observed for trees based on individual genes. Differences between the topology of this tree and individual gene phylogenies, such as rRNA trees (17), include the grouping of Dictyostelium, Entamoeba and microsporidia with animals, and the deeper branching of Plants/Algae than Apicomplexa within the eukaryotic world. Some of these differences may be explained by events producing significant changes in gene content: gene loss, evolutionary convergence (especially in pathogen species), endosymbiosis and other cases of massive horizontal gene transfer. Despite the low resolution of prokaryotic phylogeny in this analysis (based on a limited taxonomic sampling), the observed topology resembles other analyses of prokaryotes (18).
The OrthoMCL-DB web interface provides a convenient means to search for sequences (and their corresponding ortholog groups) based on protein accession number or text keywords. In addition, a BLAST-based sequence similarity search function is provided, allowing users to find their favorite sequence or identify homologs that have been clustered into ortholog groups. Users are cautioned that identifying a homolog in a given ortholog group does not necessarily imply that the query sequence is in fact an ortholog to members of that group. Ortholog groups themselves can be searched by group accession number, or based on ortholog group summary statistics, including group size, average pairwise BLAST expectation value, average pairwise percent identity/coverage or percentage of matched pairs.
To further assist users in extracting biologically interesting ortholog groups, an interface permits queries based on phyletic patterns of conservation, using either a graphical form or text-based expressions. Both methods allow the user to identify ortholog groups by defining the desired pattern of the species representation. The graphical form lists all 55 species, organized by taxonomic clade, with toggle buttons that the user clicks to change status. A green check mark ‘√’ icon is used to represent required presence of a protein from a given species or clade, a red ‘x’ icon for required absence, or a gray circle icon ‘•’ meaning that the presence or absence of proteins from this species should not affect the result. This query form may be used, for example, to identify all groups containing proteins found in all eukaryotes but completely absent from the bacteria, regardless of their presence or absence in archaea.
For more intricate queries, such as the identification of genes that are specifically amplified in insects, a text-based form allows patterns to be expressed using a custom grammar called phyletic pattern expression (PPE). Individual grammatical units of PPE expressions are composed of two parts:
Multiple expression units may be combined using ‘AND’ or ‘OR', and may use parentheses to provide explicit execution ordering.
OrthoMCL-DB also provides a query history page, detailing all of the queries executed in the current session. Previous query results may be retrieved, and separate results can be further merged via intersection, union or subtraction operations, permitting very complicated queries to be generated by combining different ortholog group query methods. For example, the user may wish to identify ortholog groups that are well conserved (percent identity ≥ 70%), entirely absent in bacteria and archaea, present in at least five eukaryotic genomes, and expanded in Homo sapiens to include at least 10 recent paralogs.
Ortholog groups are displayed to reflect patterns of phyletic conservation using a concise tabular form, along with summary statistics for the ortholog group and hyperlinks to view or download related sequence data (Figure 2). Precomputed information available for most ortholog groups includes Pfam domain architecture, visualizations of OrthoMCL similarity graphs generated using BioLayout software and multiple sequence alignments generated using MUSCLE. These resources provide useful insights into the evolution and organization of proteins within individual ortholog groups.
In summary, OrthoMCL-DB provides flexible web-based access to the results of a powerful algorithm for automated ortholog identification, applied to most of the currently available eukaryotic genomes and a representative selection of prokaryotic genomes. We anticipate reclustering and updating the database at least twice a year, as additional eukaryotic genomes become available; inclusion of additional prokaryotic genomes will also be considered.
In addition to information available for browsing and querying via the web interface, the following data are available for bulk download as flat-files and/or SQL export files: all protein sequences from the current implementation of OrthoMCL-DB (in FASTA format), all clustering data (accession numbers for all proteins in each ortholog group), Pfam domain assignments for all proteins and summary statistics calculated for each group. An Excel spreadsheet lists the number of ortholog groups shared by all possible species pairs (data used to assemble the tree shown in Figure 1). The stand-alone version of OrthoMCL software is also downloadable.
This research was supported by NIH grant R01-AI058515, with website implementation covered by NIAID contract HHSN266200400037C, supporting the ApiDB Bioinformatics Resource Center. We thank Drs Li Li and Shailesh Date for helpful discussions, Lucia Peixoto for running MUSCLE software and Leon Goldovsky (European Bioinformatics Institute) for providing a special version of BioLayout Software. DSR is an Ellison Medical Foundation Scholar in Global Infectious Diseases. Funding to pay the Open Access publication charges for this article was provided by NIH grant R01-AI058515.
Conflict of interest statement. None declared.