|Home | About | Journals | Submit | Contact Us | Français|
The concept of homology drives speculation on a gene’s function in any given species when its biological roles in other species are characterized. With reference to a specific species radiation homologous relations define orthologs, i.e. descendants from a single gene of the ancestor. The large-scale delineation of gene genealogies is a challenging task, and the numerous approaches to the problem reflect the importance of the concept of orthology as a cornerstone for comparative studies. Here, we present the updated OrthoDB catalog of eukaryotic orthologs delineated at each radiation of the species phylogeny in an explicitly hierarchical manner of over 100 species of vertebrates, arthropods and fungi (including the metazoa level). New database features include functional annotations, and quantification of evolutionary divergence and relations among orthologous groups. The interface features extended phyletic profile querying and enhanced text-based searches. The ever-increasing sampling of sequenced eukaryotic genomes brings a clearer account of the majority of gene genealogies that will facilitate informed hypotheses of gene function in newly sequenced genomes. Furthermore, uniform analysis across lineages as different as vertebrates, arthropods and fungi with divergence levels varying from several to hundreds of millions of years will provide essential data for uncovering and quantifying long-term trends of gene evolution. OrthoDB is freely accessible from http://cegg.unige.ch/orthodb.
Recognizing similarities as evidence of shared ancestry describes the general biological concept of homology that can be applied specifically to genes encoded in complete genomes to delineate orthologs, the ‘equivalent’ genes in different species, and paralogs, gene duplicates within one genome (1–3). The rapidly increasing number of sequenced genomes presents a remarkable opportunity as well as a formidable challenge to resolve complex gene histories in unprecedented detail. As orthologous relations are defined by speciation events where orthologs arise by vertical decent from a single gene of the last common ancestor, such classifications are inherently hierarchical. Gene duplication events after speciation disrupt the 1:1 correspondence of genes among species and lead to the formation of orthologous groups, comprised of all genes descended from a single gene of the last common ancestor. Many algorithms have been developed to apply these principles and meet the challenge of large-scale data analysis (4,5). These can be broadly classified into those that cluster the results from all-against-all pairwise sequence comparisons and those that employ phylogenetic tree-based methods. The growing number of different resources for cluster-based approaches [e.g. (6–12)], as well as for phylogenetic approaches [e.g. (13–20)], reflects the importance of ortholog identification as a cornerstone of comparative genomics that drives evolutionary and molecular biology research.
The preservation of orthologs across many species over long evolutionary periods, especially as single-copy genes, strongly supports hypotheses of conserved functionality (21). By contrast, duplication events may allow for functional divergence (22). Thus, although orthologous relations are not defined by gene function, inferences of common functions remain the most plausible evolutionary scenario and therefore justify one of the major objectives of orthology delineation: the tentative transfer of functional annotations from well-studied organisms to the newly sequenced species (23).
Here, we present the update of OrthoDB (10), the hierarchical catalog of eukaryotic orthologs, featuring an expanded species sampling, extensive functional and evolutionary annotation of the derived orthologous groups, as well as improved text-based searches, querying on the phyletic profile of orthologous gene copy numbers, and searches by sequence homology. OrthoDB is freely accessible from http://cegg.unige.ch/orthodb, and now referenced with link-outs from a number of resources including UniProt (24) and FlyBase (25).
Orthology is defined relative to the last common ancestor of the species being considered, thereby determining the hierarchical nature of orthologous classifications (1–3) (Supplementary Figure S1). This is explicitly addressed in OrthoDB (10) by application of the orthology delineation procedure at each radiation point of the considered phylogeny, empirically computed over the super-alignment of single-copy orthologs using a maximum-likelihood approach corroborated with known taxonomies. The OrthoDB implementation employs a BRH clustering algorithm based on all-against-all Smith–Waterman (26) protein sequence comparisons computed using PARALIGN (27). Gene set pre-processing selects the longest protein-coding transcript of alternatively spliced genes and of very similar gene copies (>97% identity). The newly optimized procedure triangulates BRHs with an e-value cutoff of 1e-3 to progressively build the clusters, (non-triangulated BRHs are considered with an e-value cutoff of 1e-6), and requiring an overall minimum sequence alignment overlap of 30 amino acids to avoid domain walking. These core clusters are further expanded to include all more closely related within-species in-paralogs, and the previously identified very similar gene copies. Inspections of the OrthoDB orthologous classifications as part of several genome projects (28–31) and other comparative genomic studies (32–35) have confirmed their biological relevance and acceptable accuracy.
The complete predicted protein-coding gene sets were retrieved from publically available genomic resources including 44 vertebrates from Ensembl (36) (Release 58, May 2010), 25 arthropods from AphidBase (37), BeetleBase (38), FlyBase (25), Hymenoptera Genome Database, SilkDB (39), VectorBase (40) and wFleaBase (41) (current releases in July 2010), and 46 fungi from UniProt (24) (August 2010 release). Gene sets for an additional five animal species were retrieved for orthology delineation across metazoa: lancelet, polyp, sea anemone, sea urchin, and worm (current releases in July 2010). For full details of the genome assembly and gene set releases used for each species, please see Supplementary Table S1.
Annotations describing putative functional attributes were sourced from UniProt (24), as well as from species-specific resources including Mouse Genome Informatics (MGI) (42), FlyBase (25) and Saccharomyces Genome Database (SGD) (43). UniProt identifier cross-referencing allowed mapping of gene annotations to the gene sets retrieved from Ensembl and other sources. The UniProt data were also employed to comprehensively map gene names and synonyms, as well as secondary gene identifiers and cross-referenced database gene identifiers, e.g. RefSeq, Entrez GeneID, GenBank, Protein Data Bank and Mendelian Inheritance in Man, as well as assigned Gene Ontology (GO) (44) attributes. The species-specific model organism databases (MGI, FlyBase and SGD) provided mapping to additional gene synonyms and identifiers as well as selected controlled-vocabulary gene phenotypes from relevant experimental data (Supplementary Table S2). Protein domain signatures were retrieved from InterPro (45) matches to the UniProt Archive (UniParc) of non-redundant protein sequences.
Analysis of the selected eukaryotic species focused on resolving orthologous relations at each radiation of the three sampled lineages, as well as delineating metazoan orthologous groups by analyzing the vertebrates and arthropods with five additional animal species. For the complete sets of vertebrates, arthropods and fungi, 87% of a total of 1611843 genes were classified into 18474, 20428 and 14088 orthologous groups, respectively (Table 1). The greater spans and faster evolutionary rates across the arthropod and fungal phylogenies (33,46,47) may limit the detection of very distant homology, leading to the observed lower proportions of classified genes compared to the vertebrates. Additional factors that may influence the proportions of classified genes include the completeness and coverage of genome sequencing as well as quality and consistency of gene repertoire predictions (e.g. variable strategies applied to arthropods).
Orthology delineation aims to identify groups of genes descended from a common ancestor, thereby enabling tentative functional attributes ascribed to one or more members to be generally extrapolated to describe the group as a whole. Protein-coding genes from model organisms are by far the best studied and therefore provide the most comprehensive annotations and insights into biological functions; however, ill-informed extrapolation of annotation across species can lead to error propagation. Leaving this to the expert, we merely summarize the available functional evidence of orthologous genes that is indicative of their common functional role.
OrthoDB orthologous group functional annotations are summarized from associated GO and InterPro attributes of individual genes, supplemented by data from representative model organisms. Of the just over 1.4 million orthologous group member genes, almost 95% are classified in orthologous groups that can be described by either GO terms (molecular function, biological process or cellular component) or InterPro domains, and more than 85% by both attributes (Figure 1, Supplementary Figure S2). For each orthologous group, summarizing the member gene GO (molecular function, biological process and cellular component) and InterPro annotations highlights the functional attributes that describe the orthologous group as a whole (Figure 2). These descriptions identify the frequencies of associated GO terms together with InterPro domains of member genes, and list succinct term and domain descriptions.
Mapping of selected model organism phenotype data identifies a significant proportion of orthologous groups with genes from model organisms that exhibit experimental phenotypes. This approach therefore facilitates querying of OrthoDB with key function-related terms from the respective phenotype ontologies, e.g. sterile, or cell cycle defective (Supplementary Table S2). For the representative model organisms in each lineage (Mus musculus, Drosophila melanogaster or Saccharomyces cerevisiae), gene synonyms and secondary identifiers, as well as selected associated phenotypes, are indicated with distinct icons linked to their respective database sources.
In addition, for each orthologous group member gene, concise UniProt functional descriptors are provided with links to the mapped entries. InterPro matches are displayed with domains ordered sequentially from the N- to C-terminus, describing the complete domain architecture of multidomain genes. The orthologous group summary annotations together with the attributes of individual gene members provide a snapshot of the available functional information, with extensive links to respective source databases, allowing further investigation of their putative biological roles.
Protein sequence divergence rate among orthologous group member genes, their phyletic gene copy-number profiles and their homology to genes in other orthologous groups are indicators of the level of confidence with which functional annotations from genes of well-studied model organisms may be transferred to other species. Evolutionary annotations of orthologous groups therefore complement the functional annotations by presenting these quantifiable evolutionary properties (Figure 2).
Orthologous groups that exhibit appreciably higher or lower levels of sequence divergence are highlighted through quantification of the relative divergence among their member genes. These are computed for each orthologous group as the average of interspecies identities normalized to the average identity of all interspecies BRHs, computed from pairwise Smith–Waterman alignments of protein sequences (Supplementary Figure S3).
Orthologous group phyletic profiles indicate the species coverage for the selected species radiation point and contrast the number of species with single-copy members and with multi-copy members.
Homologous relations among genes from different orthologous groups identify sets of related orthologous groups delineated for the specific level of the phylogeny. These relations are defined from pairwise Smith–Waterman comparisons between all members of an orthologous group to all members of any related groups with a cutoff of 1e-3. Related groups are identified at each level of the phylogeny-defined hierarchy, linking to ‘sibling orthologous groups’ as opposed to parent or child groups that would correspond to moving up or down the phylogeny.
The phylogeny-defined hierarchy of orthologous groups in OrthoDB allows searches to be performed at specific radiation points by selecting a node of the interactive species trees. Selecting a node encompassing only a few closely related species will focus the search results on more fine-grained orthologous groups of mostly one-to-one relations. Moving toward the root of the tree will include more distantly related species and will generally retrieve more inclusive orthologous groups that contain all the descendants of the ancestral gene.
OrthoDB enables relevant data retrieval through specific queries using protein, gene, InterPro or GO identifiers, or more general searches with keywords, names or synonyms, descriptor terms or phrases. Gene annotations sourced from UniProt, supplemented with data from specific resources for representative model organisms from each lineage, provide rich annotations that facilitate comprehensive database searching. The text search feature provides additional flexibility using simple logical operator syntax to build complex queries; e.g. to optionally include variations of a term, or to exclude terms (Supplementary Table S3). In addition, specific protein domain architectures may be queried with a comma-separated N- to C-terminus ordered list of InterPro identifiers.
OrthoDB features the ability to search for orthologous groups with specific phyletic profiles to retrieve groups matching specific copy-number criteria such as all single-copy or all multi-copy orthologs. Combining the criteria of absent, present, single-copy, multi-copy or no restriction, for each species within a selected clade can generate numerous variations of user-defined phyletic profiles for database querying. These profile query options are extended through a selection of predefined common profiles with more relaxed search criteria, e.g. single-copy orthologs but allowing for a gene loss or duplication event in one species.
The BLAST search facility ensures that data interrogation is not limited by the coverage of detailed gene annotations. The relevant data of the orthologous group closest to the root-level are returned if a protein sequence match with a significant BLAST hit is identified. Such sequence-based queries help to circumvent potential ambiguities arising from multiple gene identifiers or synonyms from alternative resources or database releases.
Queries are stored during each user’s web browser session to enable reviewing and re-running of their recently executed queries. The type of search and the level in the species phylogeny at which it was performed, together with the number of orthologous groups returned, are displayed for each query. The user may re-run or delete individual queries or clear their complete query history.
The data for each orthologous group may be exported as either a Fasta-formatted file of protein sequences or a tab-delimited text file of members with their InterPro annotations. In addition, data for the complete set of groups retrieved from any OrthoDB query may be exported as both Fasta-formatted sequence and tab-delimited annotation files. The ‘Print Tables’ option exports the data tables for all retrieved groups to a printer-friendly HTML-formatted document that may be printed or saved as required.
OrthoDB data are cross-referenced with numerous biological databases, linking retrieved orthologous group gene members to their respective sources and allowing direct access to additional information. In addition, OrthoDB groups are referenced through link-outs from major community resources including UniProt and Flybase.
Supplementary Data are available at NAR Online.
The Swiss National Science Foundation (31003A-125350). Funding for open access charge: Swiss Institute of Bioinformatics.
Conflict of interest statement. None declared.
The authors would like to thank Dr Ivo Pedruzzi and all members of the Computational Evolutionary Genomics Group for useful suggestions and discussions.