2.1. Preparation of the ortholog dataset

We examined and selected the microbial genome database (MBGD,

http://mbgd.genome.ad.jp/) (

Uchiyama, 2003) as our primary data resource, because it was found to contain fewer paralogs than the clusters of orthologous groups of proteins (COG,

http://www.ncbi.nlm.nih.gov/COG/) (

Tatusov et al., 1997). The orthologous proteins (orthologs) in MBGD are obtained by constructing a phylogenetic tree of the possible orthologs in question. However, there is one problem in MBGD that the tree is constructed by UPGMA (

Michener and Sokal, 1957), which is known to be less reliable than other methods such as the neighbor-joining (NJ) method (

Saitou and Nei, 1987;

Saitou and Imanishi, 1989). The use of UPGMA would thus wrongly sort out out-paralogs that seriously disturb the construction of the correct tree topology. Therefore, we instead applied NJ method to the source data of MBGD for sorting out out-paralogs and obtain more reliable orthologs than the original ones. The procedure of identifying and excluding out-paralogs in the present study is as follows.

Let us suppose that the unrooted tree in was constructed by NJ method for eight possible orthologs from eubacteria (e) and archaea (a). Let us then assume that the tree is divided at the arrow or root into two clusters, one containing (e1, e2, e3, e4, a1, a2) and the other containing (e5, a3). In this case (e5, a3) is excluded, and (e1, e2, e3, e4, a1, a2) is kept in the refined database. If one of the two clusters contains only eubacteria that share at least one species in the other, we compare the number of species of eubacteria in both clusters and choose the one with a larger number and the archaea discarding the eubacterial cluster with a smaller number of species. We discard the case in which the eubacteria in the two clusters do not share a species, because we cannot place the root in the tree. The same is true for the case where the one cluster contains only archaea. Of course, there are many cases in which a constructed tree contains only one cluster, suggesting that no out-paralogs are included in the tree.

2.2. Phylogeny construction

Let us assume that we have

*N* ortholog groups with

*m*_{i} (

*i* =1, 2, 3, …,

*N*) species each, where

*m*_{i} is the number of species for the

*i*-th group, and max

*m*_{i} =

*M*. The orthologs of every

*N* cluster are aligned by MAFFT (

Katoh et al., 2002) and edited by Gblocks (

Castresana, 2000). The latter includes a procedure to retain gap(s) in the aligned sequences when half or more of

*M* species have amino acid(s) at the corresponding residue(s), or discard them from all species when less than half have amino acid(s). The aligned amino acid sequences are used for constructing a phylogenetic tree as follows.

An initial tree is constructed for one of the

*N* groups by NJ method in BioNJ (

Saitou and Nei, 1987,

Gascuel, 1997), and it is used in the maximum likelihood method in PhyML (

Felsenstein, 1981,

Guindon and Gascuel, 2003) as the initial tree to produce the final tree. (The parameters used for PhyML are: “JTT” for substitution model, “estimated” for proportion of invariable sites, “estimated” for gamma distribution parameters, “4” for the number of substitution categories, “yes” to optimize topology, “yes” to optimize branch length and rate parameter, and “BIONJ” for starting tree(s).) This is repeated

*N* times for

*N* ortholog groups and

*N* individual trees are constructed. We next concatenate the aligned sequences of

*N* ortholog groups for

*M* species. If a species does not have all

*N* orthologs, the missing orthologs are filled with gaps. As a result, we obtain

*M* concatenated sequences that are then used for constructing a concatenated tree by the same procedure as above.

2.3. Evaluation of the constructed phylogenetic tree

The concatenated tree is expected to be more accurate than each of the *N* individual trees, as the former is more resistant to the existence of false orthologs and the variation of the number of amino acid substitutions than the latter. The question then is how much accuracy the former has. To evaluate the accuracy of the concatenated tree, we compare the concatenated tree to every *N* individual tree by using the method we newly developed. The new method, ComTree, allows us to compare the topology of a pair of trees even for a pair having different numbers of OTUs and no roots.

ComTree works as follows. In , one concatenated and three individual trees are given. The question here is how to evaluate the internal node shown in the closed circle in the concatenated tree by comparing it to the three individual trees. There are 3 branches extending from the node which reach out to three sets of OTUs, (A, B, C), (D, E, F) and (G, H), respectively. Now, every individual tree is examined if it has a node with three branches extending each to one of the OTUs in the three OTU sets of the concatenated tree. Let us denote the number of such individual trees as Np. If such a tree further satisfies the condition that the three OTU sets are not intermingled with those in the concatenated tree, the tree is qualified to support the node in the concatenated tree. Let us also denote the number of the qualified individual trees as Nq. Then, we define the Positive Ortholog Ratio (POR) as,

POR indicates the relative number of orthologs that support a node of the concatenated tree. Note that there are *N*–Np individual trees that do not contribute to POR. It is noted that ComTree is independent from the number of OTUs. In the present case, each of trees 1and 2 has an OTU in one of the three OTU sets of the concatenated tree. Tree 2 further satisfies the condition mentioned above, while tree 1 does not. Tree 3 does not have G or H in one OTU set of the concatenated tree, and does not contribute to POR. Therefore, Np=2, Nq=1 and N–Np=1. The value of POR in the present case is thus 0.5, which means that 50% of the orthologs used for the POR computation support the node in question.