Once a genome is chosen for analysis, our algorithm consists of the following steps:
a) For each gene pair (as defined below) in the genome, a tree-based conservation score is computed;
b) For each gene in the genome, two neighbourhood conservation scores are computed, measuring the strength of proximity constraint of the gene with its upstream or downstream neighboring genes;
c) Adjacent genes with conservation scores between them exceeding some threshold are joined into conserved clusters.
The details are described in the following sections.
Definition of conserved neighboring gene pairs
For simplicity let us consider the problem of identification of a conserved gene pair, the building block of our gene clusters. We define a gene pair as two genes with no more than k
open reading frames separating them along the chromosome (k
= 1 in this paper). In contrast to the operon identification procedure [12
], the two genes in a pair do not need to reside on the same strand of the chromosome. When orthologs of a gene pair form a pair in other genomes, the pair is considered conserved. Orthologous genes are often detected as reciprocal best hits (BLAST E-value < 1E-5) between the two genomes by sequence comparison software, such as BLAST [17
]. However, evolutionary events such as gene duplication followed by diversification could obscure the reciprocal relationship, resulting in relatively high false negative rates in the identification of orthologous genes. To relieve this concern, we loosen the criteria to include genes with high similarity (BLAST E-value ≤ 1E-5) (see Results). However, unless specifically pointed out, the results presented are calculated on the basis of orthology data.
Computation of conservation scores using phylogenetic information
We assign a score to a conserved gene pair by computing the probability a particular pattern of conservation is observed in analyzed genomes based on a stochastic model of evolution. Before introducing the detailed implementation, we first lay out the theoretical foundation of the method and point out our assumptions.
As an example, consider the rooted bifurcating phylogenetic tree shown in Figure . The leaf nodes Q, A, B, C and D represent extant genomes, and internal nodes X0
are inferred ancestor genomes. Let us assume a gene pair in Q is conserved in A, B and C but not D. We model evolution as a stochastic process represented as a probabilistic graphical model in the form of a tree [30
]. We associate a binary random variable with each node in the tree, assigning it a value 1 if the specific gene pair is conserved in the genome and 0 if absent. The values of leaves in the tree are determined by our initial gene cluster identification procedure.
A simple genome phylogenetic tree.
We assume that the probability of a genome acquiring a gene cluster given its absence in the ancestor is negligible, i.e., P(child = 1|parent = 0) = 0. This is a simplifying assumption which considers the predominance of vertical inheritance and omits the negligible probability of independent formation of identical clusters. Under this assumption the most recent common ancestor of all the leaves that are assigned to 1 is also set to 1. Accordingly, in the Figure tree model, X0 is set to 1. We compute the significance of the conservation based on the probability of observing the specific gene cluster given that the closest common ancestor has it, that is,
P(conservation) = P(Q = 1, A = 1, B = 1, C = 1, D = 0|X0 = 1)
In our initial model we do not take into account the genomes that do not possess the cluster, although it is not difficult to do with the appropriate assumption on the conditional probability of loss of a cluster in a descendant of an organism that has it. Thus a leaf node D that lacks the cluster is dropped from further calculation. Now we have
P(conservation) = P(Q = 1, A = 1, B = 1, C = 1|X0 = 1)
The tree in Figure can also be interpreted as a Bayesian network in a tree form [31
]. In particular, the vertical inheritance along any path in the tree is a generative probabilistic process, and the probability that a child inherits a gene pair is only dependent on its immediate evolutionary ancestor. More specifically, we associate a conditional probability table with each edge of the tree enumerating the probability of the presence or absence of a gene pair in a genome given the state of its immediate ancestor. According to the tree model, we have
Assuming independence between the siblings, the above can be rewritten as
Note that this derivation is a simple generalization of the forward algorithm [30
]. According to our vertical inheritance assumption, the probability for a child to have a gene pair is approximately zero if the immediate ancestor does not have the gene pair. After applying this assumption and evaluating the equation recursively, the above formula reduces to
P(Q = 1, A = 1, B = 1, C = 1|X0 = 1) = P(Q = 1|X2 = 1)·P(A = 1|X2 = 1)·P(B = 1|X3 = 1)·P(C = 1|X3 = 1)·P(X2 = 1|X0 = 1)·P(X3 = 1|X0 = 1)
More generally, for a gene pair found in a set of genomes and a given genome phylogeny tree T, the following simple relation holds
where Y is an immediate ancestor of X in T.
Taking the negative logarithm of both sides, we have
Thus the probability of conservation of a gene cluster in a given tree is simply the sum of log conditional probabilities of all the branches on paths leading to the genomes where the gene pair is present.
We further assume that log(P(X = 1|Y = 1) is proportional to the length of the branch that connects X and its parent Y, that is, log(P(X = 1|Y = 1)) ~ d(X,Y). Thus,
where Y is the parent of X. The summation of all the tree branches in the phylogenetic tree is proportional to the logarithm of the overall probability.
In our implementation, the genome phylogenetic tree is built by using the genome distance metric based on the shared gene content suggested by Snel and coworkers [34
]. We first calculate the pairwise distance between genomes by d
), where s = (number of shared orthologs)/(average of total gene numbers in two genomes). In this implementation, Escherichia coli
and Salmonella typhi
, have a distance of 0.35, while Escherichia coli
and Bacillus subtilis
, which are much more distantly related, have a larger distance of 1.18. The genome phylogenetic tree is then constructed from the pairwise distance matrix using the neighbor joining algorithm [35
For each gene pair, a genome phylogenetic tree is built on the genomes that have the pair, the conservation score of the gene pair is the summation of all the branch lengths in the tree. Notice that another nice property of the score is that it is independent of the query genome.
Detection of long conserved gene clusters
We now extend the methodology by using the gene pair conservation to detect longer conserved gene clusters. For the gth gene on the query chromosome, we estimate its upstream conservation score (Cu) and downstream conservation score (Cd) by:
where each s(g-i, g) is the conservation score assigned to gene pair (g-i, g) calculated using the method above; k is the maximum number of intervening genes allowed in a conserved gene pair (k = 1 in this paper). As a result, each gene will be associated with two numerical scores, measuring the extent of conservation between itself and its upstream or downstream neighboring genes respectively.
The statistical significance of the conservation scores is inferred from a bootstrap simulation. For each genome, the null distribution is computed by calculating the conservation scores of the randomly shuffled genome. The P-value cutoff is set at about 1E-4 in this paper, which corresponds to a conservation score of about 5.0 for most genomes. Notice that P-values are related to genome size since genes in very small genomes may have higher chance of forming conserved gene clusters. For instance, conservation score of 5.0 corresponds to a larger P-value (3.5E-3) in a small genome Mycoplasma genitalium than in E. coli (2.4E-4).
For genes that are at the boundaries of the cluster, only one of the conservation scores will exceed the threshold, which provides a convenient way of detecting the boundaries. We detect the maximal conserved gene clusters by scanning the genomes sequentially. The gene with Cd over the threshold, but not for its upstream genes, marks the start of a new cluster. The gene whose downstream genes have Cu scores below the threshold mark the end of the cluster. All the genes between are considered as part of the cluster.
More sophisticated dynamic programming procedures, or single-linkage clustering algorithm to identify maximal conserved gene clusters are also possible.
The main program was written in C and used the LEDA library for the manipulation of trees. Scripts for generating genome BLAST data and for analyzing data were written in Perl. GeneChords was built with PostGreSQL. All scripts are available upon request from the authors.