The algorithm of OMA takes as input a set of complete genomes and outputs pairs of orthologous genes that are optionally clustered into orthologous groups. The algorithm follows four steps (see Figure ):
Step 1: To find homology, we compute pairwise alignments between all pairs of sequences for all genes in all genomes. Pairs with significant alignment scores are retained as candidate pairs.
Step 2: Orthologs are usually the closest genes in two genomes, because they started diverging at speciation, whereas paralogs started diverging at a duplication prior to speciation. Genes across genomes that are mutually the most closely related sequences, taking into account inference uncertainty, are upgraded to stable pairs.
Step 3: In cases where an ortholog is missing, we seek to avoid erroneous classification of paralogs as orthologs (pseudo-orthologs) by verifying stable pairs with sequences in a third genome that can act as witness of evolution. Pairs that pass the verification step are upgraded to verified pairs, and pairs that do not pass are paralogs and referred to as broken pairs.
Step 4: For some applications, such as species tree reconstruction, it is advantageous to cluster orthologs into orthologous groups. Pairs of sequences in such groups are termed group pairs.
In the following, we describe each of the four steps, and motivate all parameter choices.
All-against-all alignments
The goal of the first step of the process is homology detection. All pairs of protein sequences from complete genomes are aligned using full dynamic programming. There are several advantages of using protein sequences rather than using DNA sequences. Very distant homologies are difficult to find at the DNA level, and protein sequences suffer less from convergence due to mutational biases. Also, the length of a protein is one third of that of the corresponding DNA sequence, a considerable advantage given that the time complexity of aligning sequences is quadratic with respect to length. The disadvantage to using protein sequences instead of DNA is that complications arising from multiple gene products must be handled explicitly by selecting the longest splice variant as well as isoforms with at least 10% non-redundant positions. The sequences used by OMA are from public databases (mainly Genbank [
23] for Prokaryotes and Ensembl [
8] for Eukaryotes) and all data are checked for consistency and quality.
Homology is established in two sub-steps. First, alignments between all sequences are performed using full, local dynamic programming with a fixed PAM matrix to find all homologous sequences [
20,
24]. Second, significant alignments (score > 85) are refined by searching among all PAM distances the scoring matrix that maximizes the alignment score. Since scores are the log of odd ratios, the PAM number of this matrix corresponds to the maximum likelihood of evolutionary distance. Empirically, we observed that with a mixture of homologous and non-homologous pairs of sequences as input, the PAM-224 matrix yields alignment scores that are on average closest to the ones obtained in the refinement part. Thus, this is the fixed matrix that we use in the first part. Refined alignments with scores above 181 (which roughly corresponds to an E-value of 10
-14) are considered significant. With scores below this value, the proportion of candidate pairs that end up being predicted as orthologs decreases rapidly (data not shown).
The all-against-all step is computationally expensive, and the run time increases quadratically with the total number of amino acids in a protein sequence. The use of a heuristic-based algorithm such as BLAST could potentially increase the speed of the homology search, but modern implementations of Smith-Waterman using SIMD instructions are almost as fast as BLAST [
25]. Moreover, most of the time is consumed by estimating evolutionary distances.
Since we consider entire proteins as the basic evolutionary unit, why then not use global alignments? Protein ends are often variable, and thus, it is reasonable to ignore them by using local alignments. To guarantee that a significant fraction of a sequence is aligned, we use a
length tolerance criterion. The length of the shorter aligned sequence must be at least the fraction
![[ell]](/corehtml/pmc/pmcents/x2113.gif)
of the longest sequence. That is
where a1 and a2 are the lengths of the aligned subsequences of s1 and s2. Alignments that pass both the length and score criteria are upgraded to candidate pairs (CP).
Parameter selection and validation
The parameter
![[ell]](/corehtml/pmc/pmcents/x2113.gif)
is determined by two tests. The first test, the
triangle inequality test, is performed over all candidate pairs. Under a time-reversible Markovian model, the evolutionary distances between homologous sequences should obey the triangle inequality condition which requires that in a triplet of sequences, any distance between two sequences be less than or equal to the sum of the other two distances. Because these distances are estimates, this property is expected only to hold within a confidence interval.
For example, this condition may not be met if the sequences x and y share one domain and sequences y and z share another domain, but sequences x and z are not related. The triangle inequality test detects such violations, which are likely to arise when inconsistent sequences parts are matched. With increasing (i.e. stricter) length criteria, a larger fraction of candidate pairs pass the triangle inequality test (Figure ).
Many proteins consists of several domains originating from gene fusions, deletions, and internal repetitions. The majority of multi-domain proteins have evolved by the stepwise insertions of single domains [
26]. In the second test, candidate pairs are verified by testing the assumption that the number of domains for orthologous sequences are in agreement, including identical domains (e.g. repetitions). Domain information is obtained from the Pfam database and consists of conserved protein regions and domains [
27]. The amount of proteins with the same number of domains increase with stricter length tolerance, but a "plateau" is observed for 0.6 <
![[ell]](/corehtml/pmc/pmcents/x2113.gif)
< 0.9 (Figure ).
Figure shows the results of the two validation tests and also that the number of orthologous relations (i.e. VP) decreases with increasing length criteria. A trade-off exists between sensitivity and selectivity. A value of
![[ell]](/corehtml/pmc/pmcents/x2113.gif)
= 0.61 is a good compromise between minimizing triangle inequality violations and numbers of different domains while still including enough ambiguous alignments.
Formation of stable pairs
In the second step of the algorithm, potential orthologs are detected by the identification of sequences in two genomes that are more closely related to each other than to any other sequence in the other sequence. We term these sequences stable pairs (SP). This name was chosen due to its close association with the stable marriage problem in computer science.
To measure the relatedness of sequences, either similarity scores or evolutionary distances can be used. Most methods employ the similarity score ("best hit"), because it is directly obtained by the alignment process and the highest scoring sequence is usually the most closely related sequence. However, scores do not constitute a direct measure of relatedness. In particular, they depend on protein lengths. Evolutionary distances such as PAM units, though more expensive to compute, constitute a sounder measure of relatedness, because distances are additive in their expected value (i.e. they are expected to equal the sums of branch lengths between the species) and have well characterized statistical properties.
A tolerance interval is used to allow the inclusion of more than one potential ortholog, as this becomes necessary when a gene duplication event occurred after speciation. The tolerance threshold can be defined by including similarity scores in an interval below the top score, or by using variance of distance estimates to compute a confidence interval.
Consequently, orthology assignment methods based on pairwise sequence comparison can be classified in four categories (see Figure ). Bidirectional best hits (BBH) is the most common approach and uses scores with no tolerance (e.g. [
12]). Reciprocal-best-BLAST-hits (RBH) is based on BLAST scores and uses a tolerance by including all hits within a p-value [
28]. The reciprocal smallest distance algorithm (RSD) use evolutionary distances without a tolerance [
14,
16]. The stable pair method of OMA use distance to measure relatedness between genes and their variances as a tolerance criterion.
As mentioned above, the use of confidence intervals is necessary to account for many-to-many orthologous relations, which arise when duplications occurred after speciation. Additionally, distance estimation is subject to inference uncertainty and, thus, true orthologs may not have the shortest estimated distance.
Formally, a pair of sequences (x, y) from genomes X and Y is considered a stable pair if and only if, for all x
i ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
X, x
i ≠ x, and for all y
j ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
Y, y
j ≠ y:
and
where
d is a pairwise maximum likelihood distance estimate and
k, the tolerance parameter of the standard deviation between the two distances, where

. An estimate of the variance is obtained by the distance estimation, while efficient estimation of covariance for this case was previously shown [
21].
Parameter selection and validation
The tolerance parameter k controls the balance between sensitivity (more true orthologs as stable pairs) and selectivity (few out-paralogs as stable pairs). The optimal value of k for our purpose is determined using the out-paralog test.
The out-paralog test is designed to discriminate cases of one-to-many orthology from cases of out-paralogy. More precisely, it determines whether the divergence of sequences x, y1 and y2, illustrated in Figure , is due to a speciation or a duplication event. This is evaluated by finding on which branch to place the root. If the root is located on the branches leading to y1 or y2, this suggests that the divergence is a speciation and that the sequence y2 is an out-paralog (Figure ). In contrast, if the root is on the branch leading to x, the divergence is a duplication and both sequences in Y are orthologous to x (Figure ). To find a suitable out-group z to place the root, the information of a trusted phylogenetic topology is used (i.e. a representative phylogeny that is indisputable). The sequence z is selected to be the gene closest to x in the out-group genome Z that is closest to the divergence of X and Y. Figure shows the quartet that is the result of y1 and y2 being in paralogs. If the length of the internal branch d for the given topology (i.e. the least square fit) is greater than zero, the sequences are accepted.
To evaluate the parameter k, the fraction of SP that passes the test is measured. Figure depicts the decreasing fraction of passing stable pairs with increasing stable pair tolerance at different length criteria. Again, the problem is to reduce the amount of conflicting out-paralogs while not discarding interesting many-to-many relationships. In this implementation, the required distance for a more distant stable pair must be within k = 1.81 of the closest stable pair.
Verification of stable pairs
Although the construction of stable pairs is likely to identify the corresponding ortholog of each sequence, at least one special case exists in which systematic failure will occur: differential gene loss. This problem affects all pairwise approaches, and is shown in Figure . An ancient duplication event is followed by two speciation events resulting in three species X, Y, and Z. In two of these species, each of the duplicates is lost (e.g. x2 and y1), and as a result, when comparing species X and Y, x1 and y2 are the highest scoring match. In such a case, (x1, y2), although paralogs, form a stable pair.
The purpose of the third step is to detect such stable pairs corresponding to non-orthology. The presence of a third genome Z, which has retained both copies z
1 and z
2 of the duplication event, acts as a
witness of non-orthology. We previously described the details of this procedure [
29], and the idea is illustrated in Figure . If

is significantly shorter than

and

is significantly shorter than

, there is evidence that x
1 and y
2 may not be orthologs. Figure depicts the most likely quartet predicted from the data provided in Figure . This approach can also be viewed as a tree-reconciliation that is based on quartets without assuming any species tree topology.
Each stable pair is verified by comparison to all other genomes. Stable pairs for which no witness of non-orthology could be found are termed verified pairs (VP) and are likely to be orthologs. Furthermore, stable pairs that are not verified were defined as broken pairs (BP) and are likely to correspond to paralogs.
Such cases of differential gene losses are not uncommon in nature. Among yeasts for instance, approximately 5% of stable pairs are detected as non-orthologous using the procedure described above.
Parameter selection and validation
The procedure again uses a tolerance parameter to tune the width of the confidence intervals required to detect non-orthologs. To optimize the parameter, we use the out-paralog test (described in the previous section). We select the VP tolerance value such that the fraction of verified pairs that pass the test is maximized. In figure , the fraction of passing VPs as a function of the VP tolerance is charted. In terms of the trade-off between VP- and the SP-tolerance, increasing the VP-tolerance has little effect when the SP-tolerance is low (i.e. only the closest stable pairs were chosen). The decrease in the amount of VPs with stricter VP-tolerance is much less than with stricter SP-tolerance. Hence, it is reasonable to have a stricter VP-tolerance than SP-tolerance to maximize coverage. From the plot, a VP-tolerance of k = 1.53 visibly constitutes a reasonable trade-off.
Note that although both the verification of the SP step and the out-paralogy test detect non-orthologs, the test requires knowledge of the species tree. To keep the orthology prediction independent from such (often uncertain) knowledge, we only used the out-paralogy test for parameter fitting, and only in cases where the species topology is undisputed.
Ortholog clustering
The final step of the algorithm creates groups of orthologs. Such grouping is non-trivial, because orthology is defined over pairs of sequences and is not necessarily a transitive relation. For instance, a sequence in one genome may form several verified pairs with sequences in another genomes, corresponding to several orthologous relations (co-orthologs). These in turn cannot be orthologous to each other. In OMA, we address this problem by making available both pairwise orthologous relations (the verified pairs) and groups of genes in which all pairs are orthologs. Though the OMA groups leave out orthologous relations, they are useful for some applications, such as species tree inference.
We use a clique algorithm to search for maximal, completely connected subgraphs in a graph, where the vertices are genes and the edges are verified pairs. To compute cliques, algorithms exist to maximize either the size of the clique (number of vertices) or the total weight of cliques (sum of edge weights). Figure shows a graph with edges between all vertices except (z
1, z
2) and (z
1, y
2), which are paralogous relations. The highest scoring partition is {w
1, x
1, z
1}, {y
2, z
2}, with the total sum of edge weights of 700 + 800 + 900 + 1000 = 3600. The score is higher than the highest scoring maximum size clique {w
1, x
1, y
2, z
2}, {z
1}, where the sum of the scores is 200+300+400+500+800+1000 = 3200. Hence, a smaller clique is chosen due to higher edge weights, which correctly assigns orthologs according to the hypothetical evolutionary scenario in Figure , where the duplication give subscripts that correspond to functionality. Finding cliques is a NP-complete problem and the implementations used here are based on an approximation of the vertex cover problem [
30]. Each clique constitutes an
orthologous group, where the sequence pairs in an orthologous group are denoted
group pairs (GP), corresponding to close orthologs.
Parameter selection and validation
To validate our methods and to compare different algorithms for clique construction, a species tree is built from the orthologous groups produced by each algorithm, and the fit of the data to the tree is measured using the dimensionless index [
31]. This technique assumes that if the groups inferred by the clique algorithms correctly predict orthology, the data will have a better fit to the species tree.
For verification, 100 trials using various genomes and different taxa are computed using four different clique versions. Maximum size clique chose the largest clique in the graph starting with the highest scoring edge (but does not use any other edge information). Maximum size score clique is an extension that uses the sum of the edge weights and selects the higher scoring clique from several maximum cliques of same size. The above described algorithm, maximum edge weight clique, is used twice, first with the scores and then with the distances complement as edge weights.
In general, the maximum edge weight clique algorithms perform better than the maximum size cliques (in 82 % of the trials), as shown in Table . This result supports the argument of the hypothetical example in Figure . To build maximum edge weight clique we chose scores as edge weights, over distances, because scores provides better fits of the data to the constructed species trees twice as often.