To perform gene prediction, FrameD relies on a weighted directed acyclic graph (DAG) designed in such a way that every path in the graph represents a possible gene prediction (consistent with START and STOP codons use, including possible partial genes on the sequence border). The graph in Figure is an example of the graph used for a short sequence. It has seven parallel tracks that respectively correspond, from top to bottom, to the assumption that a region is coding in frame 3, 2, 1, non-coding, or coding in frame −1, −2 and −3. Each nucleotide is represented by seven edges in the graph, one on each track, represented below each nucleotide in the figure. These are called ‘content’ edges. Other edges connect vertices from the nucleotide at position i
to the nucleotide at position i
+1. These are called ‘signal’ edges: the occurrence of a possible translation START in a given frame p
induces the creation of a signal edge that allows to go from the non-coding track to the track coding in frame p
. Conversely, a STOP codon in frame p
induces the creation of a signal edge that goes from the frame p
coding track to the non-coding track and also prevents any path (
prediction) from crossing a STOP codon by removing the corresponding horizontal signal edge (Fig. ). When dealing with unfinished sequences, degenerated START and STOP codons are also detected.
Figure 1 A simplified view of the directed acyclic graph built for analyzing the sequence CATGAGTACNGA. This view ignores the additional complexity induced by gene overlapping regions and frameshift modeling. The occurrence of a START codon at position 2 to 4 (more ...)
A source-sink path in this graph represents a possible prediction. However, genes may not overlap. To allow gene overlapping, frequent in bacterial genomes, we further added six so-called ‘bi-coding’ tracks to the graph that represent regions coding in two different frames on the same strand. Each occurrence of a START codon (resp. STOP) is represented by an additional edge that allows to reach (resp. quit) the corresponding ‘bi-coding’ track.
Finally, to represent possible frameshifts, additional edges have been added that allow, for example, to reach a given coding track from another one (for a deletion, the edge jumps over one nucleotide).
In this final graph, every path from the source to the sink represents a possible prediction. In order to choose a prediction among the exponential number of paths in the graph, each edge receives a weight. If we interpret these weights as the opposite of the logarithm of the probabilities that the edge can be used in a path, a shortest path in this weighted graph represents a most reliable path (prediction).
Weighting the graph
The probability associated with content edges is defined by the emission probability of the corresponding nucleotide in a track-specific interpolated Markov model (IMM) (4
). Each track is associated with a specific model (0th order Markov model for the non-coding track, three-periodic IMM for coding tracks and a mixture of two coding models for bi-coding tracks). In order to deal with noisy sequences, basic IMMs have been extended to deal with arbitrary sequences over the alphabet A,T,G,C,N and are therefore able to provide the probability of emission of a given nucleotide A,T,G,C or N after any k
mer that possibly contains Ns. The basic statistics needed for the estimation of an IMM are the number of occurrences of arbitrary k
ranging from 0 to 8 in practice) in a learning set of sequences over the alphabet A,T,G,C. For k
+1mers that do not contain a degenerated N this count is obtained by enumeration of occurrences in the learning set. For k
+1mers containing Ns, the number of occurrences can be computed recursively from counts associated with k
+1mers containing one less N.
For signal edges, we consider that only one of all the signal edges that leave a vertex may be used, acting as a switch between tracks. The following weights are used:
- A constant STOP penalty is associated with each signal edge representing a STOP occurrence.
- A constant frameshift penalty is associated with each signal edge representing a frameshift (deletion and insertion are considered as equally probable).
- For START codons, a RBS hybridization energy E is estimated from an approximate thermodynamic model using elementary energies (5) and a free end-gap like alignment algorithm between an RBS motif and a short region before each START codon. By analogy with thermodynamics, the probability that the START is used is then defined as α/(1+β exp(E)) where α and β are constant to be estimated.
- The remaining horizontal signal edge is weighted by the logarithm of 1 minus the probability of all other signals that occur at the same position (if any, frameshift probability being considered as negligible).
When protein similarities are available, the coding score of each nucleotide is enhanced in the corresponding frame using a simple scheme. We define the mean (bit) score S
of a hit as the (bit) score divided by the hit length. For each nucleotide in the sequence, the strongest mean (bit) score is used to enhance the corresponding content edge using a pseudo-count approach: if p
is the original probability of the content edge for the nucleotide, the updated probability is p
) where γ is a user-defined parameter (BlastX hits confidence). When γ is not null, the similarity information can also enhance frameshift prediction (see 6
for a related similarity-based approach to frameshift detection).
Gene and frameshift prediction
Once all parameters are fixed, the problem of finding a most reliable path in the previous directed graph can be solved using any DAG shortest path algorithm. This is done in linear time and space in the sequence length.
This algorithm will only compute one optimal prediction. In practice, it may be the case that several sub-optimal predictions differ significantly from the optimal one. This may be the case for the choice of a START codon or the use (or not) of a frameshift. To make such uncertainty in the optimal prediction visible, FrameD computes for every signal edge in the graph the probability that this edge is used over all possible predictions. This information can be interpreted as a ‘mean’ prediction that summarizes all possible predictions, taking into account their respective reliability. This computation is done using a forward-backward like dynamic programming algorithm, in linear time and space.
Fixing parameter values
In order to compute all edge weights, FrameD requires the value of several parameters.
The frameshift penalty being set to a very large value, the STOP penalty, α and β parameters have been estimated by optimizing the quality of the prediction on curated annotated sequences. The quality of the prediction was measured at the nucleotide level and at the gene level by counting correctly predicted STOP and START codons. For a given type of predicted item, sensitivity (Sn) is defined as the ratio of the number of correctly predicted items to the number of annotated items; specificity (Sp) as the ratio of the number of correctly predicted items to the number of predicted items.
The criteria optimized to estimate the parameters is the sum of these six ratios. It has been optimized using a dedicated genetic algorithm on a region of the GC-rich Sinorhizobium meliloti genome. The parameters are quite robust and provide good prediction performances on genomes with rich and medium GC content (see Results). For low GC%, another parameter set has been tuned using Rickettsia prowazekii sequences.