Our method predicts new genes according to the known genes annotated on a related DNA sequence by employing a probabilistic pair hidden Markov model (pairHMM) (
19). The states and transitions of the pairHMM underlying Projector are the same as for Doublescan (
11) and can model the most prevalent configurations which can arise through the alignment of pairs of related genes which appear in colinearity in two DNA sequences, including exon fusion or splitting events. However, instead of taking two similar DNA sequences of unknown gene contents as the input information and predicting the genes within the two sequences as well as an alignment between them as done by the comparative
ab initio gene prediction method Doublescan, Projector takes the known genes of one of the two sequences as additional input information and predicts the genes of the other DNA sequence according to the known genes of the related DNA sequence. In order to implement the constraint imposed by the known genes into Projector, the algorithm by which the state path with the highest overall probability (the optimal state path) is derived is altered.
Let
X be the input DNA sequence with known genes and
Y the input DNA sequence whose genes are unknown. We think of a sequence of time steps, in each of which the current state
s reads a fixed, state-dependent number of letters Δ
x(
s) from sequence
X and a potentially different fixed number of letters Δ
y(
s) from sequence
Y. The information on the known genes of sequence
X are translated into annotation labels
Xiann for each sequence position
i in sequence
X. In contrast to Doublescan, which considers all possible annotations and alignments of the two input sequences in the calculation of the optimal state path, Projector considers only those annotations and alignments which are compatible with the known genes of input sequence
X. This constraint is implemented in the following way into the recursion step of the Viterbi algorithm (
20):

where v(s, i, j) denotes the element of the Viterbi matrix which corresponds to the probability of the state path with highest probability which ends in state s and which so far has read i letters from sequence X and j letters from sequence Y, ts′(s) is the transition probability to go from state s′ to state s. es(i, j) is the emission probability of state s to read Δx(s) letters from sequence X (letters Xi–Δx(s), ..., Xi–1) and Δy(s) letters from sequence Y (letters Yj–Δy(s), ..., Yj–1). δ(Xkann, X(s)kpre) is 1 if the label X(s)kpre predicted by state s for sequence position k coincides with the annotated label Xkann, and 0 otherwise. It is thus this extra factor Πi–1k=i–Δx(s) δ(Xkann, X(s)kpre) in the above formula which implements the constraint into the state path calculation, because all state paths which do not reproduce the known annotation of sequence X are assigned zero probability as soon as a discrepancy occurs between the annotated and the predicted label of a position in sequence X. Using the above formula for the recursion step, the Viterbi algorithm calculates the state path with the highest probability which simultaneously satisfies the following conditions: (i) it reproduces the known genes of sequence X, (ii) it predicts genes in sequence Y which correspond to the known genes of sequence X and (iii) it predicts an alignment between the two sequences.
Special emission probabilities
Technically, the extra factor in the Viterbi recursion which constrains the state paths to those that reproduce the known annotation of sequence X can be interpreted as a modification of the nominal emission probabilities,
es(
i,
j), of the pairHMM. We call these emission probabilities which now also depend on the position within the input sequences rather than only the letters at that position special emission probabilities and denote them by
e′
s(
i,
j). For Projector, they have the form
e′
s(
i,
j) =
es(
i,
j) Π
i–1k=i–Δ(s) δ(
Xkann,
X(
s)
kpre) for every state
s which reads letters from sequence
X [see also Yeh
et al. (
9) and Korf
et al. (
15)].
Special transition probabilities
A novel feature of the pairHMM underlying Doublescan and Projector is the concept of position dependent transition probabilities which are used to implement the sequence signal scores provided by external programs into the pairHMM framework in a way which fully preserves the pairHMM’s probabilistic interpretation. We call these position dependent transition probabilities special transition probabilities. Both Doublescan and Projector use an external program called StrataSplice (
21) similar to that in Burge and Karlin (
22) to gain more detailed information on every potential splice and translation start site within the two input sequences. The sequence signals of these states are too complex and contained in a sequence interval which is too wide to be adequately captured directly within the states and transitions of our HMM. Before the state path calculation is started within Doublescan and Projector, StrataSplice goes along each sequence separately and assigns a log-odds score (in bits, i.e. log base 2) to each possible splice site and translation start site. This score is a measure of how likely the potential splice site is to be true. Once these scores have been assigned, they are used within the state path calculation to modify the nominal values of some transition probabilities. Every transition leading into a translation start or a splice site state is assigned the full nominal probability if the corresponding sequence signal scores are high, and it is reduced to a lower value if the corresponding scores are low. We choose to calculate the special transition probability for such a transition from state
s′ to state
s at sequence positions
i in
X and
j in
Y as follows:
t′s′(s, i, j) = ts′(s) (prior(i, j) · 2score(i, j)) / (prior(i, j) · 2score(i, j) + 1 – prior(i, j))
where prior(
i,
j) =

and score(
i,
j) = score
x,i + score
y,i if the state
s is a match state, or (prior(
i,
j) = prior
x,i and score(
i,
j) = score
x,i) or (prior(
i,
j) = prior
y,j and score(
i,
j) = score
y,j) if the state reads only letters from sequence
X or only sequence
Y, respectively. The priors prior
x,i and prior
y,j are the respective prior probabilities of seeing the sequence signal (in the general case their values may depend on the sequence position, as the prior probability of seeing a GC 5′ splice site may for example be different from seeing a consensus GT 5′ splice site) and the scores score
x,i and score
y,j are the respective log-odds scores of the sequence signals at sequence position
i in
X and
j in
Y.
Empirical studies led us to take the geometric mean for merging both the two individual priors as well as the two individual probabilities underlying the scores. The natural way to combine two probabilities which are not mutually exclusive might appear to take their product, but if the splice site is fully conserved this would effectively score it twice, which is wrong. The geometric mean scores it once. An arithmetic mean does not work because it allows a very poor site, e.g. with probability 0, to be accepted if paired with a good site. In practice the geometric mean worked best of several methods we tried (with the Doublescan not the Projector test set). A more complex approach could make use of the level of similarity, but this would make the combination depend on the local sequences as well as the scores, adding significantly to complexity and compute time. It remains an option for future exploration.
The probability of such a transition thus becomes dependent on the value of the sequence signal scores and priors at the given pair of sequence positions, and the corresponding transition within the pairHMM is called special. In order to retain the probabilistic interpretation of the transition probabilities, we have to ensure that the sum of transition probabilities emerging from each state at any pair of sequence positions remains equal to one. Once the values of the special transition probabilities emerging from one state have been calculated, the remaining non-special transition probabilities are rescaled by a common factor which ensures that the sum of all transition probabilities emerging from that state add up to one. The rescaling factor for a non-special transition from state s′ to state s″ at sequence positions i in X and j in Y is:
Hence Σs ts′(s, i, j) = 1 for all possible triplets of state s′, sequence positions i in X and j in Y.
With the aid of special transition and special emission probabilities, the recursion step within the Viterbi algorithm of Projector can then be written as:
v(s, i, j) = maxs′ {v(s′, i – Δx(s), j – Δy(s)) t′s′(s, i – Δx(s), j – Δy(s)) e′s(i, j)}
where
t′
s′(
s,
i – Δ
x(
s),
j – Δ
y(
s)) and
e′
s(
i,
j) correspond to the non-special terms whenever a transition or emission is not special. The baseline transition probabilities
ts′(
s) and non-special emission probabilities
es(
i,
j) of Projector are the same as for Doublescan. The time and memory requirements of the Viterbi algorithm both scale with the product of the sequence lengths
LX and
LY of the two input sequences
X and
Y. As for Doublescan, we use Projector with the Stepping Stone algorithm (
11,
23) which heuristically restricts the search space by constraining the alignment to an envelope around a set of mutually compatible BLAST matches. The Stepping Stone algorithm implementation also employs the linear memory implementation of the Viterbi algorithm due to Hirschberg (
24) to reduce both time and memory requirements to effectively linear behavior.