We assemble proteins into orthologous groups using an automated procedure similar to the original COG/KOG approach (2,8). When constructing coarse-grained orthologous groups across all three domains of life or for all eukaryotes, we first assign the proteins encoded by the genomes in eggNOG to the respective COGs or KOGs based on best hits to the manually assigned sequences in the COG/KOG database. In case of multiple hits to the same part of the sequence, only the best hit was considered. The many proteins that cannot be assigned to existing COGs or KOGs are subsequently assembled into non-supervised orthologous groups using the procedure described below. When constructing more fine-grained orthologous groups, this initial step is skipped.
Briefly, we first compute all-against-all Smith–Waterman similarities among all proteins in eggNOG. We then group recently duplicated sequences into in-paralogous groups, which are subsequently treated as single units to ensure that they will be assigned to the same orthologous groups. To form the in-paralogous groups, we first assemble highly related genomes into clades, usually encompassing all sequenced strains of a particular species in a single clade, but also other close pairs such as human and chimpanzee. In these clades, we join into in-paralogous groups all proteins that are more similar to each other (within the clade), than to any other protein outside the clade. For this, there is no fixed cutoff in similarity, but instead we start with a stringent similarity cutoff and relax it a step-wise fashion until all in-paralogous proteins are joined, requiring that all members of a group must align to each other with at least 20 residues.
After grouping in-paralogous proteins, we start assigning orthology between proteins, by joining triangles of reciprocal best hits involving three different species (here, in-paralogous groups are represented by their best-matching member). Again, we start with a stringent similarity cutoff and relax it to identify groups of proteins that all align to each other by at least 20 residues. This procedure occasionally causes an orthologous group to be split in two; such cases are identified by an abundance of reciprocal best hits between groups, which are then joined. Next, we relax the triangle criterion and allow remaining unassigned proteins to join a group by simple bidirectional best hits. Finally, we automatically identify gene fusion events by searching for proteins that bridge otherwise unrelated orthologous groups. In these cases, the different parts of the fusion protein are assigned to their respective orthologous groups. This step is a distinguishing feature of our approach and is crucial for the analysis of eukaryotic multi-domain proteins, as these would otherwise cause unrelated orthologous groups to be fused.
To construct a hierarchy of orthologous groups, the procedure described above was applied to several subsets of organisms. To make a set of course-grained orthologous groups across all three domains of life, we constructed non-supervised orthologous groups (NOGs) from the genes that could not be mapped to a COG or KOG. Focusing on eukaryotic genes, we constructed more fine-grained eukaryotic NOGs (euNOGs) from the genes that could not be mapped to a KOG. Finally, we build sets of NOGs of increasing resolution for five eukaryotic clades, namely fungi (fuNOGs), metazoans (meNOGs), insects (inNOGs), vertebrates (veNOGs) and mammals (maNOGs).