In order to solve the LSE minimization problem efficiently, we model it as a shortest path network optimization problem. The key to constructing a network representation is to decompose a protein sequence of length
n into an ordered sequence of nodes in a network. The label for each node corresponds to a subsequence of four consecutive amino acids from the protein sequence (a tetramer), with directional arcs connecting the nodes. Thus, in this representation each sequence of length
n is represented as a path of length
n – 3. We also prefix each node label with its ‘stage’, which is the position of its first amino acid in the protein sequence. For example, the protein sequence MERLTG is represented as the following sequence of nodes and arcs:
Note the overlap in the node labels—the last three amino acids in each node label match the first three amino acids in the label corresponding to the next node. We can define an entropy measure for the full sequence by summing the entropy values for each of the three arcs in the above representation. The values for each arc are the entropy values derived by Chan
et al. that correspond to the tetramer at the destination node (Chan
et al.,
2004). Note that our method takes advantage of a property of the LSE scoring function, namely that the structural entropy of each tetramer can be calculated independently of other tetramers. This property is crucial to the application of a shortest path network approach.
Given a set of sequences that give different possibilities for the amino acids at some positions of the sequence, we can construct a network in such a way that each possible path through the network corresponds to a possible hybrid of the sequences in this set. Formally, node i in this network has the label (ni, ai1, ai2, ai3, ai4), where ni is the stage number corresponding to this tetramer and (ai1, ai2, ai3, ai4) is the tetramer itself. There is a directed arc from node i to node j if the following four conditions hold: ni+1=nj, ai2= aj1, ai3=aj2, and ai4=aj3. (Node i is said to be a predecessor of node j.) The arc connecting nodes i and j is assigned a weight equal to the entropy value for the tetramer (aj1, aj2, aj3, aj4) corresponding to node j. There are two additional nodes in the network: a source node from which arcs emanate to all the first-stage nodes, with weights equal to the entropies for the tetramers at those nodes; and a sink node that is the destination for arcs emanating from all nodes at stage n – 3, with all weights zero.
Now consider the two protein sequence example in . The figure illustrates a network corresponding to a small protein sequence fragment with two variable positions. The two homologous six amino acid sequences give rise to a network with three stages, in addition to the source and sink nodes. The LSE value of the tetramer in each position is indicated on the incoming arcs. Note that the numeric node labels within parentheses in are aliases for the original node labels. For example, 2 is the numeric node alias for the original node label ‘1.M.E.R.L’. These labels will be used interchangeably in the ensuing sections for the sake of simplicity.
Any connected path through the network from source to sink corresponds to a protein sequence of the same length as the original sequence. The amino acid at each position is chosen from amino acids observed at that position in the multiple sequence alignment. The entropy corresponding to this path is obtained by summing the arc weights along the path. In this way, the problem of finding the optimal sequence can be reduced to finding the shortest path through this network from source to sink. Fortunately, a standard and highly efficient algorithm from network optimization—Dijkstra's algorithm—can be used to solve problems of this type. Dijkstra's algorithm is a ‘greedy’ algorithm that determines the cost of the shortest path from the source node to every node in the graph, including the sink node.
As with any dynamic programming approach, Dijkstra's algorithm starts by solving a simple subproblem, then expands this subproblem incrementally, modifying the solution at each step, until the solution of original problem is obtained. In the case of Dijkstra's algorithm, each subproblem consists of a subset of nodes whose shortest path to the source node is known. (The trivial subset for the first subproblem is the source node itself.) At each step, this subset is expanded by a single node, by adding the closest node to this source from outside the set of nodes with known distances. The algorithm terminates when the subset encompasses the sink node. The solution is recovered by backtracking along the arcs from sink node to the source node.
Since the algorithm is simple to describe, we now outline details of Dijkstra's approach and demonstrate its application to the given example. Given a directed graph
G=(
V,
E) where
V is the set of nodes,
E is the set of arcs, and
C[
i,
j] is the cost of the edge going from node
i to node
j, the algorithm maintains a set of node
S for which the shortest distance from the source is known.
- We initialize S to contain only the source node.
- All nodes in the set (V−S) that have a predecessor in S are associated with a ‘special path’ starting at the source node that passes only through nodes in S, whose total length is as short as possible. The special path for every node is stored in an array D, while the predecessor node (in S) on this special path is stored in an array P.
- At each step, we add to S a single node from the set (V−S), whose distance from the source (D-value) is as short as possible.
- The algorithm terminates once all the nodes have been moved to S.
- The shortest path for any given node can be determined by tracing the predecessors in P.
Further details on Dijkstra's algorithm have previously been published (Aho
et al.,
1983).
Details of how the algorithm operates on the example are provided in . Note that we use the parenthesized numeric node labels in the ensuing explanation below.
Initially, the set S={1} and the D-values for the neighboring nodes 2, 3, 4 and 5 are D [2]=20, D [3]=10, D [4]=30, and D [5]=40. Since nodes 6, 7, 8 and 9 cannot be directly reached from node 1, their distances are set to infinity.
At every iteration of the algorithm, w represents the node that is selected to enter the set S. For example, in the first iteration, we select the node w=3 to enter S, since it has the smallest D-value. Proceeding to the second iteration, the distances to the neighbors of node 3 in the set (V−S) are updated. Since node 6 is the only such neighbor of node 3, we get D [6]=min (infinity, D[3]+C [3,6])=min (infinity, 10+20)=30. The other distances do not change because there is no way to reach them as part of a path containing node 3. The P-number for node 6 i.e. P[6]=3 indicates that node 3 is the predecessor to node 6 on the shortest path from node 1 to node 6. The sequence of D-numbers at the end of each iteration is indicated in and the final P-numbers are given the predecessor array table (). By tracing the predecessor array backwards from the sink (node 9), one can clearly see that the nodes on the shortest path tracing backwards from sink (node 9) to source (node 1) are {9,8,6,3,1}. The shortest path is shown in .
| Table 1.The D (distance)-values for every iteration of Dijkstra's algorithm when applied to the example given in |
| Table 2.The P (predecessor)-numbers for the nodes given in |
The optimal string is retrieved from the shortest path by stripping out the node stage numbers and merging the tetramers along the path, eliminating overlaps. In the given example, we thus obtain the string IERLTG.
Our implementation of Dijkstra's algorithm in C++uses the adjacency list representation to store the vertices in V−S. The overall running time for an efficient implementation of Dijkstra's algorithm on a general graph is O(e log v) where v is the number of nodes and e is the number of arcs. For our graph, we calculate the value of v in terms of the number of candidate amino acids pi at each stage i. The number of nodes at stage i equals the total number of possible tetramers starting at position i, which is the product (pi, pi+1, pi+2, pi+3). By defining pn+1=pn+2=pn+3=1, we obtain the total number of nodes v by summing (pi, pi+1, pi+2, pi+3) over i=1,2,…, n. Note that this quantity is bounded by a constant multiple of sequence length n. We obtain a bound on the number of arcs e by noting that the number of outgoing arcs from any node at stage i is exactly pi+3. Hence, the total number of arcs e is at most a factor maxi=1,2,… ,n, pi greater than v, so that e is also bounded by a constant multiple of n. The general complexity analysis thus implies that Dijkstra's algorithm would have a running time of O(n log n) on our graph. In practice, we will usually have only an O(n) running time. This is because, since our graph has a linear structure with edge weights that are not too diverse, the two main operations at each iteration—choosing the minimum D-value from among the nodes in V−S to enter the set S, and updating the D-values of the neighbors of the latest node added to S—can usually be performed in time that is bounded independently of n.