The basic strategy of seeding alignments used here is the same as for BLAST, in that alignments seeds are generated, and then extended to form High-scoring Segment Pairs (HSPs), which are then joined together to form the alignments. The alignments are seeded using an Aho and Corasick [20
] type finite state machine (FSM) built using the word neighbourhood of the query sequence. This generates the seeds which are extended to form the HSPs. For large scale analyses, the FSM is multiplexed using word neighbourhoods from multiple sequences. This allows analysis of multiple queries during a single pass of a genomic database, in a manner similar to that used in MPBLAST [21
However the methods for seeding HSPs are independent from those used for building the alignments, and this paper only deals with algorithms involved in the generation of gapped alignments from sets of HSPs, and not in the calculation of the HSPs themselves.
The following strategies are employed to enable alignments to be built efficiently from sets of HSPs:
• To connect the underlying alignment model to the heuristics, a portal describes a set of states in the model which correspond to a set of HSPs, a span refers to a looping state for large alignment features such as introns, and a SAR (Sub-Alignment Region) describes a rectangular region on an HSP to which DP is applied.
• To avoid DP in every SAR, upper bounds are generated for the best alignment score for each SAR, and BSDP (Bounded Sparse Dynamic Programming) exploits these bounds to yield alignments in an efficient manner.
• To perform various types of DP in these SARs, the required models are generated automati cally, including C code to produce an efficient viterbi implementation for each model.
Bounded sparse dynamic programming
Dynamic programming (for any alignment model which can be represented by a regular grammar) requires quadratic time, and hence is the most computationally expensive part of building an alignment. For pairs of sequences more than a few kilobases long, DP becomes prohibitively slow. The approach used here is similar to sparse dynamic programming
], and the fragment chaining approaches used in the program sim2 [23
], in that DP is applied to rectangular regions surrounding alignment seeds. However, there are two major differences in our approach. Firstly, DP is only applied to two small discrete regions on each HSP, as it is assumed that most of the HSP itself should appear in the alignment. These sub-alignments
improve the quality of the overall alignments, and they allow complex alignment models, and large gaps such as introns to be integrated into a sparse DP framework. Secondly, as it would take too long to apply DP to every sub-alignment region (SAR), upper bounds are calculated for the DP scores for each of these SARs. This allows the sub-alignment DP to be avoided in cases where it joins HSPs which cannot feature in the final alignment, so that alignments can still be generated very rapidly even when large numbers of HSPs and SARs are involved.
Before the BSDP can be performed, a single point on each HSP is selected which will feature in any alignment generated using that HSP. This point corresponds to a pair of equivalenced symbols which must feature in any alignment to include that HSP. A point is chosen where half the HSP score is generated by equivalenced symbols on either side of it, as this is likely to be in the highest quality portion of the HSP. As shown in the example in Figure , this strategy is particularly beneficial where one end of the HSP has a much lower quality than the other.
Figure 1 Example of joining two HSPs. A single point on each HSP is chosen to be be included in the alignment. This is the centre of the HSP according to the scores, such that equivalenced bases on either side contribute to half of the HSP score. As shown in this (more ...)
The five types of region used for sub-alignments are classified in Figure . Each of these require a slightly different alignment algorithm. The alignment path must meet corners of the SARs that contain an HSP, so that the sub-alignments can be integrated with the HSPs to produce the final alignment. This approach has been primarily designed for local models, but BSDP may also be used for global and semi-global models, in which case constraints are added such that both the terminal regions (cases A and E in Figure ) and the resulting sub-alignments must contain the relevant sequence ends. In the case of C and D, the two HSPs and their SARs are separated by a span, allowing large gaps or introns in the alignment. In these cases DP must be applied to the SAR before the span, and the end state scores must be integrated in an intermediate matrix before being provided as start state scores for the DP in the SAR after the span.
Figure 2 The five types of sub-alignment region around HSPs in which dynamic programming may be applied. The start of an HSP (A), handling small gaps by joining two adjacent ends of HSPs (B), handling large gaps by joining two distant HSPs via a span (C,D) and (more ...)
Regions for the sub-alignments are selected within the area between the centre points of the two HSPs to be joined, or in the case of terminal HSPs, between the HSP and the ends of the sequences. In addition, the positions of the SARs must be constrained to limit their size, and so that the HSPs correctly intersect with the corners of the SAR. In the case of overlapping HSPs, where there is a choice of positions for placement of the SAR, the position is chosen such that the highest scoring parts of the HSPs are outside the SARs.
Once the SARs have been selected, an upper bound is placed on the score for each sub-alignment. The calculation of these bounds is described in a later section describing automation of this method. The BSDP approach can then utilise these upper bounds to avoid application of DP in some SARs, as demonstrated by the example in Figure . In the case of a real alignment, a much greater number of HSPs will be involved, so the amount of DP avoided will be larger.
Figure 3 A toy example of bounded sparse dynamic programming. In this example the bounds indicate that if the overall alignment threshold is greater than 280, no sub-alignment DP is required. Otherwise, the region between A and B is confirmed first, and the bounds (more ...)
The BSDP is mediated through a set of priority queues, one of which is associated with each HSP, and will contain an entry for each partial alignment that ends at the HSP. The key for these priority queues is the highest score for any partial alignment ending in that HSP. Additionally, there is one global priority queue containing the highest scores from each of the other priority queues. The upper bound scores are confirmed by DP in the SARs in the current highest scoring putative alignment path. The highest scoring path will change as the scores are updated during this process. DP is applied to SARs in this way until the highest scoring path does not contain any bound scores, and then the alignment may be extracted. Upper bounds dictate that there can be no better alignment using these HSPs.
This algorithm is similar to A* search, (but differs in that many different points may be the start or end of the search), and retains the admissibility property of A* search, such that the result of the BSDP computation is guaranteed to be the same as if DP had been performed on every candidate SAR prior to calculation of the alignment. This is because no alignment can be extracted until all the alternative sub-alignments (which have upper bounds that indicate they could contribute to a higher-score) have been eliminated.
The BSDP alignment process can be iterated to generate sub-optimal alignments similar to those generated by the Waterman-Eggert algorithm [24
], with only minimal recalculation of the partial alignments in the SARs. Each HSP may only appear in a single alignment, but further constraints are required to prevent overlapping alignments arising from overlapping SARs. The likelihood of this is occurring is greatly increased during translated alignments when HSPs in different reading frames may overlap each other.
After the first alignment has been reported, the scores for any SARs which have already been confirmed by DP, but which are not yet included in a reported alignment, are then considered as an upper bound. SARs are disallowed before recalculation when the region between the centre of the HSP and its SARs overlaps with a previously reported HSP, in which case, the SARs are disallowed. Otherwise the DP is recomputed for SARs which contain part of an alignment which has been reported since the DP for the SAR was last calculated.
Automated model generation
As illustrated in the previous section, BSDP becomes quite complex and requires a large number of DP algorithms for computation of the alignment through the SARs. We have build a system to facilitate implementation of these models and the integration of the sub-alignments which they produce. To allow generalisation of the BSDP, everything must be defined in terms of the underlying alignment models. The alignment models are described as finite state machines, consisting of states and transitions, similar to those used in Dynamite [6
Briefly, to convert these models into DP implementations, each state must correspond to a score in each cell of the DP matrix, and the scores for each cell are calculated by taking the maximum of the score from transitions arriving at each cell. In addition, a topological sort is required to satisfy dependency ordering for silent states.
However, in addition to automated generation of code from alignment models, the generation of the models required for DP in the SARs is also automated, as described below.
Building simple models
Table shows a few example alignment models which are generated by this system. The models are built in a modular fashion, allowing reuse of common components such as intron models and gap models. These models may be used for exhaustive alignment in quadratic time, but in order to use them for heuristic alignment, manipulations of the models are necessary to perform DP in the SARs, as detailed in the following sections.
Examples of hierarchical model building in C4. Instead of building each model from scratch, most models are created in a modular fashion by making adaptions and additions to other models. This facilitates the building of complex models.
Building the heuristic model
To enable DP in the SARs for the BSDP, a heuristic model is generated from the original alignment model. This model is not used directly for calculation of alignments, but a derived model is generated corresponding to each transition in the heuristic model. Each of these derived models correspond to a type of SAR used in the BSDP.
The model is first annotated, labelling certain states as either portal states or span states. A portal defines a group of states which can share a set of HSPs (High-scoring Segment Pairs); these are the match states. A span is a state which has sequence independent looping transitions (e.g. states for introns, or non-equivalenced regions).
The heuristic model is build using states corresponding to each portal and span state, with transitions between these states in cases where there is a valid path between the corresponding states in the original model. An example model is shown in Figure for alignment of ESTs to genomic DNA. In this example, there is a portal which corresponds to the match forward and match reverse states, and the intron forward and intron reverse states are span states.
Figure 4 Models for alignment of an EST to genomic sequnce. The exhaustive model for aligning ESTs to genome DNA, showing portal states and span states is shown in Figure 4 (a). The portal states are match forward and match reverse, and these share a set of HSPs. (more ...)
Building derived models
Derived models are produced for the DP in SARs automatically from each transition in the heuristic model. The source and destination states from each transition in the heuristic model become the start and end states in each derived model. All reachable states and transitions from these states in the original model are recursively copied to the derived model. An example of this process is shown in Figure . The relationships between the states and transitions in the derived model and the original model are tracked to allow conversion of the partial alignments from the derived models back to complete alignment in the original model. Terminal models (case A and E in Figure ) are generated between portal states and a start or end state. Join models (case B in Figure ) are generated between portal states, including from a portal state to itself. Span models (case C and D in Figure ) are generated from a portal state to a span state, and vice versa. These allow incorporation of a large feature such as an intron into an alignment. The span models require a special end state in the model at the start of the span, and a special start start in the model at the end of a span, so that scores can be transferred from one DP matrix to the other via an intermediary score matrix.
Figure 5 Example of generation of a derived model. The original model on the left (for Smith-Waterman-Gotoh) is used to generate the derived model on the right for joining two adjacent HSPs. Extra transitions are allowed from the START state to the INSERT and (more ...)
For some cases, such as between the match forward and match reverse states, shown in Figure , there is no possible path, and no corresponding transition in the heuristic model, in which case, a derived model is not produced.
Ten different derived models are generated from the model shown for cDNA to genomic alignment in Figure , because there are two portal states and two span states in this model, and therefore, a derived model is generated for each of the five cases in Figure for both the forward and reversed genes.
Building models for calculation of upper bounds
For each derived model, an additional model is created which is used for the pre-calculation of upper bounds for all possible sizes of SARs. For each transition in the model which has a position dependent score associated with it, the upper bound is also supplied. For example, in a match transition, the upper bound is the maximum value from the substitution matrix. A special model is created using these transition upper bounds, instead of the normal sequence-dependent transition weights. As this removes the sequence-dependent components of the algorithm, it allows pre-calculation of the upper bound for alignment score of any sequences up to the maximum permitted size of SARs. These bounds are then stored in a table for retrieval during the BSDP.