Gene function prediction relies heavily on proper orthology prediction [1
]. High quality orthology is not only essential for reliable annotation transfer, but also for predicting protein function by the co-occurrence of genes [2
], predicting the effect of mutations [3
], or the detection of subtle functional signals in the DNA [4
]. Crudely speaking, there are two approaches for orthology prediction: best hit-based clustering methods, and tree-based methods. Best hit-based methods cluster the most similar genes in orthologous groups. Best hit-based methods are generally fast. They differ in their specific clustering rules but may allow the addition of genomes after orthologous groups have been established, without a complete reprocessing of the sequences. Examples of group orthology are COG [5
] and KOG [6
] and Markov Chain Clustering [7
]. These methods tend to result in rather inclusive groups that may hold many paralogous genes within the same cluster. One specific cause of too inclusive orthologous groups in best-hit methods is gene loss of outparalogs in two species, causing the remaining outparalogs to become best bidirectional hits. Dessimoz et al. [8
] have introduced a method to address this issue that uses the relative levels of sequence identity to so-called witness
genes from a third species to detect cases of wrongly assigned orthology. Another best-hit based method, InParanoid [8
], is much less inclusive as it is only defined for pair-wise comparison of genomes. In general, gene duplication followed by differential gene loss and/or varying rates of evolution can easily lead to wrong, or inclusive orthologous groups in best-hit based methods.
Tree-based methods [10
] suffer less from differential gene-loss and varying rates of evolution than best-hit methods and offer, in principle, the highest resolution of orthology. In tree-based methods, one first has to establish the root of the tree. This is preferably done by using a known outgroup. Yet, outgroups must be selected carefully [14
], making the criterion less useful in automated large scale analysis. The outgroup species may e.g. not be present in some of the gene-families, or, when using several outgroup species, their genes may not always cluster together. In addition, when analyzing species that cover all kingdoms, an outgroup species does not exist [16
]. In those cases, one can e.g. use the longest branch as the root [17
], midpoint rooting, gene tree parsimony [19
], or a combination of methods. After deciding on the root of the tree, for each node must be established whether it represents a speciation event or a duplication event. To discriminate speciation from duplication events, species phylogenies can be mapped onto phylogenetic gene trees. Several automatic tree analysis methods have been described [13
]. Mismatches between the trusted species tree and sections of the gene tree are interpreted as duplication events followed by gene losses. Optionally, one can require that mismatches are supported by bootstrapping techniques [13
Instead of performing trusted species tree reconciliation, one can also use a simple "species-overlap" rule to decide whether nodes represent gene duplication or speciation events: a node is considered to represent a speciation event if its branches have mutually exclusive sets of species. Using an orthology benchmark, we will show that this species overlap rule performs remarkably well, especially considering its simplicity.
Irrespective of how is decided whether nodes in a tree represent speciation or gene duplication events, the phylogenetic relations between genes can be pretty complicated. The terms ortholog, paralog and even inparalog, outparalog, and co-ortholog [1
], defined to describe gene relations in pair-wise genome comparisons, are hardly sufficient to adequately describe them in case of multiple species comparisons. As an example of this is shown in Figure , a section of the tree for COG4565 that contains the genes from orthologous group 3 from this COG. Genes in orthologous group 3.1 are paralogous to genes in group 3.2, and genes in paralogous groups 3.1.1 and 3.1.2 are both orthologous to genes in group 3.1. Genes in groups 3.1.1 and 3.1.2 are outparalogous to each other because the duplication that separates these groups precedes the speciation events. Not only are genes from group 3.1.2 outparalogs to 3.1.1, but also genes from group 3.2. It is hard to specify in words that 3.1.2 is closer related to 3.1.1 then is 3.2. Deeper nesting makes these relations even more difficult to describe, as paralogous genes may split off at different levels and one ends up with different degrees of in- and outparalogy. This discussion demonstrates that an accurate verbose description of gene relations can be quite difficult and confusing. This has been recognized by others [13
], but a solution has not yet been provided.
Figure 1 Levels Of Orthology From Trees. Genes in a subsection of the tree for COG4565 (transcription regulatory protein Dpia) have been numbered according their levels of relatedness. The tree has been analyzed for gene duplication (red squares) and speciation (more ...)
To describe and understand complicated phylogenetic situations as the one above, one generally has to resort to drawing the phylogenetic tree. However for describing phylogenetic relations and for automatic, large-scale analysis, the tree may not be an appropriate format. We therefore introduce the levels of orthology concept: a numbering scheme for describing relations between genes that can e.g. be used for automated phylogenomics. These LOFT numbers also capture the non-transitive nature of orthology (Figure ): although genes from groups 3.1.1 and 3.1.2 are both orthologous to existent genes in group 3.1, they are paralogous to each other.
Note that in Figure the genes Photob.prof._CAG20687 and Vibrio chol._Q9KTU7 in cluster 3.1.1 could easily have been misplaced. Given that the 3.1.1/3.1.2 split relies on a very short branch (the 3.1.1 root), they might well belong to cluster 3.1.2 instead. Such errors in tree topology may result in erroneous orthology assignments, underscoring the sensitivity of tree-based orthology to errors in the tree. However, by maintaining the relative relations between orthologous groups in the LOFT numbers, that in this case indicates a close relation between orthologous groups 3.1.1 and 3.2.1, the situation is less troublesome than if these relations would not have been maintained and all, and 3.1.1 would have been considered as different from 3.1.2 as from 3.2.
The "levels of orthology" concept, in combination with the simple species overlap rule, is implemented in a software tool, LOFT (Levels of Orthology From Trees). LOFT colors the various orthologous groups in a phylogenetic tree, strongly facilitating their recognition, especially in large trees. Some additional features improve the practicality of the tool, e.g. the option to highlight a certain gene or group of genes which helps to rapidly localize them in large trees. To assess the value of these high-resolution multi-level orthology assignments, we develop a benchmark for orthology prediction based on gene-order conservation. This benchmark is also sensitive to errors in reconstructed tree topologies that result in erroneous placement of duplication events. The results show a high correlation between phylogeny based orthology as implemented in LOFT and gene-order conservation.