2.1 Theory

A graph whose vertices are genes and whose edges denote a SymBet relationship between a pair of genes (we call these adjacent if such an edge exists) is called a

*SymBets graph*,

*G*. A subgraph of

*G* is a

*triangle* if it is a set of three vertices such that each pair of them is adjacent to one another. As described elsewhere (Koonin,

2005; Tatusov

*et al.*,

1997,

2000), each COG is a subgraph of

*G* that is constructed by using a triangle as a seed and iteratively adding triangles that share a common side, until no new members can be added. This description holds both for the earlier described approach [as implemented, for instance, in the NCBI programs YOG and COGtriangles used for the construction of COGs and publicly available since the year 2007—hereinafter COGtriangles method (

ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/)] and the new approach proposed here, despite their radically different ways of following these guidelines.

Suppose *G* has *n* vertices and *m* edges. Because *G* is built of only SymBets between genes in different genomes, it is a *g-partite graph*, i.e. it has *g* groups of vertices, each corresponding to genes from the same genome, and edges are only allowed between different groups but not between the vertices in the same group (note that the collection of in-paralogs due to lineage-specific expansions is done separately prior to the main algorithm, and in the graph these are represented by a single vertex). Also, let *p* be the maximum number of genes in any given genome, which is a constant for a given set of genomes and does not depend on the number of genomes, even though it is possible that the value of *p* increases as larger genomes are sequenced and added to *G*.

The number of vertices

*n* grows linearly with the number of genomes

*g*, with the addition of each genome adding at most

*p* new genes to

*n*, so

*n* ≤

*gp* and the upper bound of the number of vertices

*O*(

*n*) is simply

*O*(

*g*). Since

*G* is a

*g*-partite graph with no edges between genes from the same organism, the number of edges

*m* is:

Thus, the upper bound of

*m* grows with the quadratic term of

*g* that dominates the behavior of this equation, and

*O*(

*m*) =

*O*(

*g*^{2}). Similarly, the number of triangles

*t* grows with the cube of the number of genomes as:

Thus, the upper bound of the number of triangles

*O*(

*t*) =

*O*(

*g*^{3}).

COGs are constructed as subgraphs of

*G*, starting as triangles and growing by iteratively merging triangles if they share a common side, until no more triangles can be added. For instance, the COGtriangles algorithm used to build the most recent available release of NCBI COGs (

ftp.ncbi.nih.gov/pub/COG/COG/) proceeds in two stages, by finding all possible triangles in

*G* and then iterating through each pair of them to merge those that share a common side, as follows:

*COGtriangles algorithm*:

- For each triangle
*T*_{a} taken from the list of unprocessed triangles
- (2) Initialize a ‘seed’ COG
*C* = *T*_{a} - (3) For each triangle
*T*_{b} not already part of an existing COG
- (4) If
*T*_{b} shares a side with a triangle in *C*, merge *T*_{b} into *C*

- (5) Print
*C*

Since there are at most

*O*(

*g*^{3}) triangles and thus

*O*(

*g*^{3}) initial COGs before the merging step, and this algorithm iterates over pairs of triangles (or pairs of a COG and a triangle), it scales with the high polynomial

*O*(

*g*^{3})*

*O*(

*g*^{3}) =

*O*(

*g*^{6}). Though the space of unused triangles is iteratively reduced as the algorithm progresses, this does not affect the algorithm complexity: if the number of COGs is

*c*, the algorithm complexity is

*O*(

*c*)*

*O*(

*g*^{3}), assuming a

*O*(1) lookup for a common edge between a triangle and a COG, and since

*O*(

*c*) =

*O*(

*g*^{3}) in the worst case, the overall complexity is

*O*(

*g*^{6}).

In practical terms, the COGs were first implemented on seven fully sequenced genomes treated as five lineages (

*g* = 5; see Tatusov

*et al.*,

1997), so at most

*C*_{5,3}*

*p* = 10

*p* triangles could exist, and iterating through each pair only cost

*C*_{10,2}*

*p*^{2} = 45

*p*^{2} computations. With 10 genomes, these numbers become 120

*p* and 7140

*p*^{2}, respectively, at 20 genomes they rise to 1140

*p* and 6*10

^{5} *p*^{2}, at 50 genomes they are 19 600

*p* and 2*10

^{8} *p*^{2} and at 100 they became 162 000

*p* and 10

^{10} *p*^{2}. Building a set of COGs with only bacteria (

*g* ≈ 10

^{3} in 2010) produces 10

^{8} *p* triangles, and iterating over each pair would generate 10

^{16} *p*^{2} computations.

In this work, we present an algorithm that builds triangles and COGs simultaneously, with the worst-case complexity of only

*O*(

*g*^{3} log

*g*). The main idea in our approach is to find a specific class of subgraphs, recently called

*triangularly connected subgraphs* (Fan

*et al.*,

2008), by iterating over edges instead of triangles, as explained in more detail below.

A *triangle-path* in *G* is a sequence of triangles *T*_{1}, *T*_{2},…, *T*_{k} in *G* such that for 1 ≤ *i* ≤ *k* − 1, the triangles *T*_{i} and *T*_{i+1} share a single edge and for *j* > *i* + 1, the triangles *T*_{i} and *T*_{j} do not share any edges. A connected subgraph *C* in G is *triangularly connected* if for any distinct edges *e* and *e*′ in *C*, there is a triangle path *T*_{1}, *T*_{2},…, *T*_{k} in *C* such that *e* is an edge in *T*_{1} and *e*′ is an edge in *T*_{k}. Under these definitions, a subgraph of *G* is a COG if it is a triangularly connected graph and is not a single edge. Note that such subgraphs of *G* are edge-disjoint with one another, i.e. no two subgraphs share an edge.

The EdgeSearch algorithm developed in this work is illustrated in . Like the COGtriangles algorithm, EdgeSearch starts by initializing a ‘seed’ COG

*C*, but instead of searching the space of all triangles to merge those with common sides, it searches for the pairs of edges that satisfy the following condition:

- If the vertices
*u* and *v* and edge (*u*, *v*) are in *C*, and a third vertex *w* not in *C*, and (*u*, *v*), (*v*, *w*) and (*w*, *u*) are edges in *G* (i.e. vertices *u*, *v* and *w* form a triangle in *G*), then add the vertex *w* and edges (*v*, *w*) and (*w*, *u*) to C.

The above process stops when no more vertices and edges can be added to

*C*. At this point, the algorithm proceeds with another seed edge not contained in any of the previously identified COGs, and this sequence of steps is repeated until all triangularly connected subgraphs (i.e. COGs) with the maximal number of vertices are found. Each such subgraph that is not a single edge is then declared an individual COG and output. In more detail, the EdgeSearch algorithm is:

*EdgeSearch algorithm*:

- For each edge taken from the list of unprocessed edges
- (2) Initialize a ‘seed’ COG
*C* with this edge and its vertices - (3) For each unprocessed edge
*e*(*u*, *v*) in *C*
- (4) For each vertex
*w* such that (*u*, *w*) and (*v*, *w*) are unprocessed edges in *G* (at least one of which is not already in *C*), add this vertex *w* and edges (*u*, *w*) and (*w*, *v*) to *C* (if they are not already part of *C*) - (5) Mark
*e* as processed

- (6) Print
*C* if it contains three or more vertices

T

heorem:

EdgeSearch finds all maximal triangularly connected subgraphs in *G*.

P

roof.

This result derives from the following:

*The subgraphs output by the algorithm are edge-disjoint*—i.e. no two subgraphs share an edge. This is because the iteration through all unprocessed edges in the subgraph *C* (Step 3) ensures that when triangles that share a common side are merged into *C* (Step 4), the result is that all triangles that contain a single edge *e* will become part of that same subgraph *C*. Then, since the algorithm does not proceed to build another subgraph until all edges of *C* are processed (Step 3), the subgraphs output in Step 6 are edge-disjoint.*Each triangle in G belongs to a unique output subgraph*. Since all edges are processed (introduced in either Step 1 or 3 and triangles formed in Step 4), all triangles in *G* will eventually be found and assigned to at least one subgraph. Also, since the subgraphs output in Step 6 are edge-disjoint (Statement 1), they are also triangle-disjoint—i.e. no two subgraphs share a triangle, and thus triangles can belong to at most one subgraph. Thus, each triangle in *G* belongs to a single subgraph.*The output is a set of triangularly connected subgraphs in G*. This is because, after forming the first triangle, each subgraph *C* is expanded by iteratively adding all triangles that share a common edge with an existing triangle in *C* (Step 4).*These triangularly connected subgraphs are maximal*. Since each subgraph is edge-disjoint (Statement 1) and each triangle belongs to a unique subgraph (Statement 2), there cannot be a triangle-path between two different subgraphs. Therefore, each triangularly connected subgraph found by the algorithm (Statement 3) is also maximal.*EdgeSearch finds all maximal triangularly connected subgraphs in G*. This follows since all edges are processed (introduced in either Step 1 or 3), and the result is maximal triangularly connected subgraphs (Statements 3 and 4).Q.E.D.

Analysis of the worst-case complexity of the EdgeSearch algorithm gives

*O*(

*g*^{2})*

*O*(

*g*)*

*O*(log

*g*) =

*O*(

*g*^{3} log

*g*). This is because the algorithm must: (i) iterate over all edges

*e*(

*u*,

*v*) in

*C* (Step 3), with the worst-case complexity

*O*(

*m*) =

*O*(

*g*^{2}); and for each, (ii) look for a vertex

*w* and edge

*f* (

*u*,

*w*) in

*G* (Step 4), which is at worst

*O*(

*g*) if it must look through all other genomes in the

*g*-partite graph; and finally for each of these, (iii) check whether

*u* and

*w* are adjacent in

*G*, which is an efficient

*O*(log

*g*) lookup from the list of all adjacent vertices of

*w* (or

*v*). The worst-case complexity of EdgeSearch is comparable to the

*O*(V

^{3}) (V = number of vertices) of another heuristic method described in Vashist

*et al.* (

2007), but uses different topological information, i.e. triangles in a SymBets graph rather than dense clusters (quasi-cliques) in a graph that may include all edges and does not require a species tree.

Our implementation of EdgeSearch, in addition to iterating in the lower-dimensional space of edges rather than in the space of triangles, also takes advantage of optimized data structures. The most important of these are: (i) a list of all edges to iterate through, (ii) a hash of all edges to quickly test the existence of an edge, (iii) a hash of all processed edges to quickly test whether an edge has been processed already and (iv) a list of all adjacent vertices for each vertex in the graph, to avoid searching the entire space of edges (also, as shown in , the vertex with the smaller of the two lists can be chosen to be *u* in Step 4 of the algorithm). The minimal extra cost of producing these data structures (compared to storing all possible triangles) often gives a large payoff: for example, the knowledge of which edges have already been processed allows EdgeSearch to iterate through edges in *O*(*m*)**O*(*g*) = *O*(*g*^{3}) time rather than a full pair-wise *O*(*m*^{2}) = *O*(*g*^{4}). Furthermore, many of these optimizations substantially reduce execution time in realistically sparse graphs, for example by using a list of edges that each gene is adjacent to in order to avoid the search through the entire set of genes to find a third vertex *w*.

2.2 Additional considerations

The EdgeSearch algorithm deterministically finds triangularly connected subgraphs in

*G*. Iteration over edges reduces its worst-case behavior compared with the COGtriangles algorithm, and several optimizations take advantage of the sparseness of the graph, further reducing its run-time on realistic datasets (see examples below). The question, however, is whether triangularly connected subgraphs are too strict a definition of orthologous groups. For instance, the incompleteness of the SymBets list or an artifact of domain fusion might cause a rare case where a triangle shares a side with a subgraph, but not necessarily with a triangle structure within the subgraph (

Supplementary Fig. 1; note that such subgraphs could no longer be called triangularly connected). Expanding the definition of COGs to include such subgraphs can be done by altering COG

*C* to become the subgraph

*induced* by the vertices in

*C* (i.e.

*C* contains all edges connecting its member genes), but this introduces an element of non-determinism to the process of building COGs, where the order of data processing affects the results. In the current implementation of EdgeSearch, we choose to avoid this non-determinism rather than extend the definition to handle these rare events. The alternatives are discussed further in the

Supplementary Material.

Another concern is whether genes should be allowed to belong to multiple COGs. The first approach is to assign each gene to a single COG and then disallow it from belonging to another COG i.e. COGs are defined as being vertex-disjoint), even though reasons such as differential combination of protein domains or improperly resolved paralogous relationships can make it appear to belong to multiple COGs [the former case can be dealt with by splitting proteins into their component domains (Koonin,

2005; Tatusov

*et al.*,

1997)]. illustrates a hypothetical example: in this scenario, the middle two proteins contain both domains, and thus each could be arbitrarily assigned to either the top or bottom COG depending on the order the input is processed in. This phenomenon can affect the overall number of COGs: for instance, in this example, if both proteins are assigned to the top COG, the remaining bottom two proteins are short of the 3-protein requirement to form a full COG. In the current implementation of EdgeSearch, we chose to allow genes to belong to multiple COGs, which is a natural consequence of the SymBets graph structure and preserves this information for future use, such as in domain dissection.

2.3 Performance in construction of phage orthologous groups

We implemented the algorithm in C++ and tested its ability to make COGs in a real-world dataset of protein-coding genes from 323 double-stranded DNA bacteriophages [these COGs in phages are called Phage Orthologous Groups or POGs (Kristensen

*et al.*,

2009; Liu

*et al.*,

2006)]. For a more direct comparison of the two algorithms, we integrated the new approach into the old framework by starting with the COGtriangles program, eliminating the makeTriangles() function, and replacing makeCOGs() with an implementation of EdgeSearch that makes triangles and subgraphs simultaneously. a demonstrates that, as the number of randomly chosen genomes from this set of phages increases, the time required by the original COGtriangles method (measured on a dual-processor Pentium 3 GHz with 2 GB of RAM) increases from a few seconds to several minutes, whereas EdgeSearch holds steady at <12 s throughout the entire tested range. More important than the actual performance is the shape of the curves, with EdgeSearch handling the increase in number of genomes much more easily than the original approach, indicating that as more genomes become available, the advantage conferred by the new algorithm at handling larger input sizes will become ever stronger.

Not only is EdgeSearch faster than the original COGtriangles algorithm, but it also outcompetes the newest version (2.0 beta 6) of the popular OrthoMCL approach (a). OrthoMCL is based on a generic clustering MCL algorithm (Enright

*et al.*,

2002; Van Dongen,

2000), which uses simulation of stochastic flow on the edges of the graph, with Markov matrices determining the transition probabilities among nodes of the graph. Since the worst-case complexity depends on the input parameters, and further complications arise from the fact that the MCL algorithm is applied iteratively until convergence and that its output is non-deterministic, here we content ourselves with a comparison of its run-time performance rather than an in-depth look into its complexity. Indeed, in the most direct comparison, OrthoMCL (Li

*et al.*,

2003) required 44 s to form groups of orthologs from three or more genomes in the 323 dsDNA phage genomes, whereas EdgeSearch required only 12 s and COGtriangles almost 5 min. These numbers (and those in ) represent only the clustering step of the respective approaches (performed by the separate MCL program (

http://www.micans.org/mcl/) in the case of OrthoMCL), but when the entire pipeline is considered, starting from the BLAST results and continuing all the way to the end point of groups of orthologs, then OrthoMCL (starting with an empty MySQL database to minimize run-time) required 4 min 25 s, EdgeSearch only 41 s and COGtriangles 5 min 9 s (

Supplementary Table 1).

In theory, the output of the EdgeSearch algorithm should be identical to that of the COGtriangles algorithm. In practice, differences were observed due to the non-deterministic resolution of several problems by COGtriangles, and to the changes made in EdgeSearch to make its output deterministic ( and

Supplementary Fig. 1). For instance, with COGs defined strictly as triangularly connected subgraphs in EdgeSearch but not in the COGtriangles approach, six pairs of POGs (0.6% of the total 2058) are seen to be split in the former but are merged together in the latter. In addition, 1.2% of the proteins were observed to belong to multiple POGs produced by EdgeSearch (regardless of whether proteins were first split into domains prior to POG construction or not, indicating that the major cause is unresolved paralogy), which affected the membership of 93 (5%) of the groups and caused 37 (2%) additional groups to be formed compared to the original approach where proteins are only allowed to belong to a single group.

To further confirm that the EdgeSearch and COGtriangles approaches produce identical results across a wider range of input graphs, 320 additional randomly chosen test sets were constructed from the 323 phage genomes (corresponding to the data points shown in ) by randomly sampling an increasing number of genomes, from 5 to 320 in steps of +5, with five independent replicates of each sample. In each case, the two outputs were again the same after accounting for the issues of merging a triangle with a non-triangle within a POG and multiply represented genes.

The results of EdgeSearch and OrthoMCL were less similar. This comes as no surprise given their different underlying rationales [reviewed in (Chen

*et al.*,

2007)]. As a result of analysis of the 323 phage genomes, EdgeSearch produces 2058 POGs, whereas OrthoMCL produces 2265 clusters with default parameters. OrthoMCL uses similarity score cutoffs to define SymBets (default

*e*-value of 1e-5), whereas the COG approach requires no cutoffs, but in practice discards matches with

*e*-value greater than the BLAST default of 10 (Tatusov

*et al.*,

1997). When these parameters were adjusted, OrthoMCL produces 2250 clusters with an

*e*-value cutoff of 10, and EdgeSearch produces 2062 with an

*e*-value cutoff of 1e-5, indicating that the difference is due to the underlying algorithms rather than the choice of

*e*-value threshold. The majority of the groups produced by the two programs were similar, with 86% of EdgeSearch groups overlapping (sharing three or more genes) with 68% of OrthoMCL's (using default parameters in each program), and 97% of the genes in POGs appearing in an OrthoMCL cluster. However, nearly a third of the clusters and an additional 5800 genes (44% of the shared total) in OrthoMCL's results are not found in POGs, and OrthoMCL's clusters are significantly larger, with a maximum size of 287 genes and average of 8.3 genes per group, compared to a maximum of only 141 genes and an average of 6.5 genes in POGs. OrthoMCL clusters contain a much higher number of paralogs, with the maximum of 13 paralogs in OrthoMCL versus only 4 in POGs, and 796 OrthoMCL clusters (35%) containing at least one paralog versus only 256 (12%) in POGs. The OrthoMCL method has been reported to produce smaller, tighter clusters whereas KOGs (essentially eukaryotic COGs) produces larger, more inclusive groupings. Conceivably, the differences between our results and those of Chen

*et al.* (

2007) could stem from different structures of the analyzed datasets, with a considerably greater extent of gene paralogy in the eukaryotic genomes analyzed in their study compared to the bacteriophage genomes that underlie the POGs. For further comparison between the COGs and OrthoMCL approaches (as well as other methods of ortholog identification), see Altenhoff and Dessimoz (

2009) and Chen

*et al.* (

2007).

2.4 Genomes of cellular organisms: construction of MOGs

To assess the performance of the EdgeSearch approach on larger genomes of cellular organisms, we derived the orthologous groups of 16 726 protein-coding genes from the completely sequenced genomes of 23 bacteria from the Mollicutes class (MOGs). The genome sizes of Mollicutes start at 475 genes in the small parasite

*Mycoplasma genitalium* and reach 1380 genes in the more metabolically complex

*Acholeplasma laidlawii* (Pollack

*et al.*,

1996). In addition to encoding larger protein sets than in viruses, these cellular organisms also contain considerably higher levels of paralogy than phages, with a maximum of 39 paralogs in MOGs (a large group of transposases in

*Mycoplasma mycoides*), compared to only 6 in POGs. Other relatively large groups of in-paralogs in MOGs include 13 DNA-binding protein HU in-paralogs of

*Candidatus Phytoplasma australiense*, and 13 ABC transporter ATP-binding protein in-paralogs in

*Mycoplasma hyopneumoniae*. Despite this greater genomic complexity, there were even fewer unresolved paralogy cases in MOGs than in POGs, with only three pairs of MOGs (0.7%) sharing a side with a non-triangle and only 0.5% of the genes belonging to multiple MOGs, which affected the membership of 18 (2%) of the groups and caused 16 (2%) additional groups to be formed.

In this set of cellular genomes, the performance of EdgeSearch is even more striking in comparison to OrthoMCL and COGtriangles (b). While OrthoMCL required >30 min to cluster orthologs, EdgeSearch completed the task in only 28 s (a >60-fold speedup), whereas COGtriangles took nearly 3 min. In the full pipeline starting from the BLAST results, OrthoMCL required nearly an hour (>55 min), whereas EdgeSearch required <3 min and COGtriangles >5 min (

Supplementary Table 1). Again, the number of clusters differed, with EdgeSearch producing 833 and OrthoMCL 930 with

*e*-value cutoff of 10.