|Home | About | Journals | Submit | Contact Us | Français|
The identification of orthologous relationships forms the basis for most comparative genomics studies. Here, we present the second version of the eggNOG database, which contains orthologous groups (OGs) constructed through identification of reciprocal best BLAST matches and triangular linkage clustering. We applied this procedure to 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes), which is a 2-fold increase relative to the previous version. The pipeline yielded 224 847 OGs, including 9724 extended versions of the original COG and KOG. We computed OGs for different levels of the tree of life; in addition to the species groups included in our first release (i.e. fungi, metazoa, insects, vertebrates and mammals), we have now constructed OGs for archaea, fishes, rodents and primates. We automatically annotate the non-supervised orthologous groups (NOGs) with functional descriptions, protein domains, and functional categories as defined initially for the COG/KOG database. In-depth analysis is facilitated by precomputed high-quality multiple sequence alignments and maximum-likelihood trees for each of the available OGs. Altogether, eggNOG covers 2 242 035 proteins (built from 2 590 259 proteins) and provides a broad functional description for at least 1 966 709 (88%) of them. Users can access the complete set of orthologous groups via a web interface at: http://eggnog.embl.de.
Next-generation sequencing technologies are now generating a vast amount of sequence data. This leads to a dramatic increase in the number of predicted protein sequences, which serve as a starting point for structural, functional and phylogenomic studies. In such studies, high-throughput comparative analyses are often required to transfer information between organisms, for which the concept of orthology is crucial. The original definition by Fitch (1) describes orthologs as genes that diverged through a speciation event, as opposed to paralogs, which diverged after a duplication event. This has been extended and refined by introducing the concepts of orthologous groups (OGs) (2), in-paralogs and out-paralogs (3,4). In practice, however, the identification and classification of homologous genes remain very difficult and rely on operational definitions. An enormous effort is being put into the development of different approaches to establish orthologous relationships between genes from different genomes. This includes several algorithms using the simple graph-based methods, including reciprocal-best-hit approach (5), identification of best-hit triangles (2,6–8) and clustering-based approaches (9–11) as well as tree-based methods (12–16).
In addition to the quality of the grouping of genes, the practical usability of OGs is determined by the ability to provide a robust functional annotation. Thus, newer projects not only aggregate orthology information from various sources to allow comparison between methods but also aim to provide annotation tools (17,18). Nevertheless, evolutionary genealogy of genes: non-supervised OGs (eggNOG) (19) and the COG/KOG/arCOG resources (2,6,7) are still the only databases providing explicit functional annotations for the OGs at different hierarchical levels, whereby the COG/KOG resource is based on a robust manual expert annotation, which eggNOG is using and automatically extending (19).
Here, we describe the new features of the second version of eggNOG, a resource that provides OGs from the three domains of life at several levels of resolution. eggNOG v2 contains twice as many species and proteins as the previous version, additional hierarchical levels allowing higher resolution for a number of taxonomic groups, new annotation sources and an extended interface for an in-depth analysis of orthologous relationships.
The automated procedure described previously (19) has been used to assemble proteins into OGs from 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes). Complete proteomes were downloaded from the RefSeq (20), Ensembl (21), GiardiaDB (22) or TAIR (23) databases. This particular data set also forms the basis for STRING v8 (24) and STITCH v2 (25), allowing for easy integration across these databases.
Altogether, the protein data set covers 2 590 259 proteins of which 2 242 035 (87%) were included in at least one of 224 847 OGs generated by eggNOG. The growing number of species and proteins included in this release drastically increased the computational time. All-against-all similarity searches have therefore been performed using Basic Local Alignment Search Tool (BLAST) (26) instead of the Smith–Waterman algorithm (27).
Compared to the 4873 COGs and the 4850 KOGs that are constructed across all three domains of life and for all eukaryotes, respectively, this procedure assembles additional proteins into NOGs (440 359 proteins into 59 497 NOGs and 181 427 into 17 845 euNOGs). These complement the published COGs and KOGs built respectively for 66 and seven species (6), which are extended in eggNOG to cover 630 species encompassing, respectively, 1 547 381 and 483 043 proteins.
To provide a higher resolution of OGs in frequently used taxonomic groupings, we applied our procedure to several subsets of organisms separately. We updated the previously computed more fine-grained NOGs at the level of fungi (fuNOGs), metazoans (meNOGs), insects (inNOGs), vertebrates (veNOGs) and mammals (maNOGs) and added groups for archaea (arNOGs), fishes (fiNOGs), rodents (roNOGs) and primates (prNOGs).
An important feature of eggNOG is the functional annotations of the OGs. Our original pipeline, providing functional descriptions for the NOGs, is now complemented by an automatic inference of functional categories (FCs) which were taken from the COG database (2). The 25 FCs available from the COG resource have been widely used to assess comparative genomics studies and will enable higher-order analyses of OGs identified in any data set.
We use two complementary methods to infer FCs of OGs based on the 4617 COGs (used for NOGs and arNOGs) and 4381 KOGs (used for all other OGs). The first method uses Support Vector Machines (SVM) trained on the COGs and KOGs to classify NOGs into the 25 FCs based on feature vectors. Two feature vectors were created for each OG. One was built from functional information mapped onto the eggNOG protein data set, including KEGG pathways and modules (28), GO terms (29), SMART domains (30), PFAM domains (31), UniProt keywords (32) and words from UniProt/RefSeq (20) description lines. The second feature vector includes also words from MEDLINE abstracts referring to a particular protein (24). Each attribute in the feature vector encodes the fraction of proteins in the group having the feature in question.
The second method for assigning FCs makes use of the hierarchical structure of eggNOG, namely that the same proteins can be assigned to OGs at several levels in the tree of life (e.g. a KOG and a meNOG). In case an FC could not be assigned to a NOG by the SVM method, we check if most of the proteins in the NOG belong to a common functionally annotated COG or KOG, in which case we transfer the FCs from the coarse-grained level (COGs or KOGs) to the more fine-grained one (e.g. arNOGs or meNOGs). The assignment of an FC to a single NOG is achieved on the basis of a coverage value determined by the occurrence of that FC (via the proteins shared with the reference level) in respect to the total number of proteins in that NOG.
In addition to providing functional annotations via description lines for many NOGs (19), we are now able to predict functional categories as well. At the universal level, our function annotation pipeline provides description lines for 14 956 (25%) and an FC for 6262 (11%) of the 59 497 coarse-grained NOGs. At the eukaryotic level, 7566 euNOGs (52%) have a description line and 4120 (34%) have an FC. In addition, eggNOG contains 137 782 more fine-grained OGs of which 100 750 (73%) and 89 232 (65%) have been annotated with a description line and an FC, respectively (Table 1).
This enables us to assign 2 242 035 of the 2 590 259 genes (87% of the genes in the analyzed genomes) to an OG and to provide at least a broad functional description or FC for 1 966 709 of them (78% of the genes that could be assigned to an OG). The corresponding numbers for each set of OGs as well as for each individual genome are summarized in Figure 1.
To facilitate the in-depth analysis of the orthologous relationships within the groups of proteins, we now provide precomputed high-quality Multiple Sequence Alignments (MSAs) and maximum-likelihood trees via the web interface (Figure 2).
Numerous methods are available to build MSAs [e.g. ClustalW (33), Muscle (34), MAFFT (35) and PRANK (36)] but some programs appear to be more suitable for particular protein families than others (37). Thus, we applied a new approach, named Automated QUality improvement for multiple sequence Alignments (AQUA) (Muller et al., submitted for publication), which combines existing tools to deliver high-quality MSAs.
The construction of the different phylogenetic trees was carried out using the following steps. One hundred bootstrap replicates were created from the MSA using the SEQBOOT program from the Phylip package (38). Following this, PhyML (39) was used to find the maximum-likelihood tree for each of the 100 bootstrap replicates and for the original alignment using default parameters. Finally, a consensus tree was constructed, using the CONSENSE program from the Phylip package. We used ReadSeq (40) to convert between the different sequence file formats used by those programs.
The eggNOG resource can be queried via a web interface; data can be downloaded under the Creative Commons Attribution 3.0 License at: http://eggnog.embl.de or via FTP at: ftp://eggnog.embl.de/eggNOG/2.0/. Gene and protein names, database identifiers, amino acid sequences, or OG names can be used to query the database. As a default, the most fine-grained OGs available are displayed for maximal resolution. The user can navigate among the different levels of orthology using an available guide-tree of organisms to find the desired balance between phylogenetic coverage and functional specificity within our hierarchy of OGs. Through the new interface, users can access different information panels encompassing the detailed list of proteins belonging to a particular OG as well as the corresponding MSA and phylogenetic tree. The MSA can be interactively displayed using the Jalview applet (41) or downloaded in FASTA format. The phylogenetic trees are accessed through a dedicated iTOL (42) viewer together with mapped PFAM and SMART domains, via the ATV program applet (43), or can be downloaded in Newick format.
With 630 genomes covered, an increased OG hierarchy, and a high coverage of newly categorized functional annotation, the new version of eggNOG is one of the most comprehensive and complete resources for deciphering the orthologous relationships between proteins from various species. The changes and improvements in the interface and the availability of the OGs for download will not only facilitate the daily use of the database, but also the integration of eggNOG in high-throughput comparative genomics studies. Our future plans include the addition of more complete genomes and development of a more scalable and flexible pipeline for generating the groups.
EMBL, the European Commission Programme, Eurasnet EU [Grant LSHG-CT-2005-518238 (FP6), IMPACT 213037 (FP7)]; the Novo Nordisk Foundation Center for Protein Research, the Swiss Institute of Bioinformatics; and the University of Zurich (partial, through its Research Priority Program ‘Systems Biology and Functional Genomics’). This work was supported in part by the Bundesministerium fuer Bildung und Forschung (Nationales Genomforschungsnetz Foerderkennzeichen 01GS08169). Funding for open access charge: European Molecular Biology Laboratory.
Conflict of interest statement. None declared.