Traditionally, near-optimal alignments are presented together in a single display with a path graph or dot plot (Zuker, 1991
; Naor and Brutlag, 1994
). This representation effectively high-lights sections of high variability between alignments, but these displays lose information by combining all alignments into one display. provides an example of a partial path graph of a subset of near-optimal alignments of two serine proteases 1TGSZ
(NCBI GI: 230350) and 1UVTH
(NCBI GI: 2781297). The alignment shows variability at the beginning and end of the alignment by drawing the multiple paths that the alignment could follow. In contrast, in the middle of the alignment there is only one path and hence only one way for the amino acids to align (this is expected as the ‘H’ in the active site is aligned). The primary difficulty with a path graph display is determining which of the alignment paths are more likely. Additionally, path graphs are difficult to use for residue level analysis, because the actual sequences are placed perpendicular and hence distant from one another. Readers of path graphs may have difficulty mentally mapping horizontal, vertical and diagonal edges into insertions and deletions (e.g. with which amino acids in 1UVTH
does the 25th amino acid in 1TGSZ
align?). Options for annotating a path graph with additional information are also limited.
Fig. 1 A partial path graph of multiple near-optimal alignments of 1TGSZ (vertical) and 1UVTH (horizontal) generated with the SUBOPT program (Naor and Brutlag, 1994) using the PAM250 matrix and a gap penalty of −3 per residue. While variable regions (more ...)
Traditional text-based alignments, with one sequence placed above the other [as seen in BLAST Altschul et al. (1990)
and FASTA (Pearson and Lipman, 1988
) output], are ideal for displaying the precise residue-to-residue mapping between the two sequences. However, it is difficult to show alternative text-based alignments and highlight the differences between the alignments. One strategy puts alternate alignments above and below the optimal alignment in some regions. This becomes more difficult when the number of gaps differs among the alignments, as this changes the overall alignment length. As the number of alignments increases, the difficulty increases.
Near-optimal alignment viewer
To address the problems with displaying alternative alignments in a text-based format, we have developed a program to display the alignments sequentially, each for a short period of time, like a movie. Our movie presentation exploits the pattern recognition capabilities of the human eye so that the user will be able to discern changes or areas of stability that occur between alignments as they are displayed.
Although it is relatively straightforward to display a large set of alignments as successive frames in a movie, the naïve approach does not make it easy to distinguish constant from variable alignment regions. We seek to highlight the residues that consistently align with one another and distinguish them from those positions that are more variable. To do this, the sections of an alignment that are most consistent should remain steady on the screen while more variable regions should move around. This is not possible with the conventional constant spaced character placement used by BLAST and FASTA.
To address this difficulty, we developed an algorithm for placing pairs of residues (the two residues that align—an edge in a path graph) according to the frequency with which the pairs occur in the set of all alignments and the relative position of that pair within an alignment. The result is a display where residues that consistently align with one another remain stationary in the display, while those that align with many different residues appear to move about.
The first step is to determine the relative position of each aligned pair of residues in relation to the overall length of the alignment. To calculate this, we divide the index (position in the alignment) of the pair by the length of the alignment. Next this relative position is added to all the other relative positions for the given pair, to produce an average position for the pair. Pairs can occur in many different alignments within a set of solutions. The frequency at which each pair occurs with respect to the number of alternative alignments is also calculated.
When we are rendering the text on the screen, we use the cumulative position of given pairs to determine the placement of the particular pair. To do this we access a lookup table to get the position of a pair, which is expressed as a percentage of the length of the alignment. This percentage is multiplied by the width of the display to get the exact position on the screen where the pair will be rendered. The width of the display is determined by the longest alignment.
Recall that this cumulative position is an average value for this pair of residues across all alignments in the near optimal set. Each pair only has one cumulative position and is therefore rendered in the same location on the screen every time. The visual effect of stability is an emergent property of the data. Pairs that appear in a large number of alignments appear stationary on the screen, since they are always rendered in the same location. Residues that are part of pairs that appear infrequently move around, since the different pairs have different cumulative positions.
Alone, the Steady Display algorithm provides powerful visual evidence of reliably aligned regions of a set of alignments. However, we can also use the recorded pair-frequency information to color the background of each pair according to the frequency of the pair ( and ). The most frequent pairs are colored a saturated orange, with the color gradually decreasing in saturation in proportion to the frequency of the pairings. Thus, in , the regions around each of the three residues in the serine protease catalytic triad of 1TGSZ and 1UVTH are the most saturated. shows an alternative alignment, where residues 48, 49 (YK) in 1TGSZ are moved to the right, and a second gap is opened after position 54. The least frequent pairings have a white background. In the specific alignment shown, the lightest colors correspond to regions that are not present in 1TGSZ. This coloring provides further visual indication of the consistency of certain sections of alignments.
Fig. 2 A screen shot of an optimal alignment of 1TGSZ and 1UVTH. Eighty-one alignments within 2% of optimal were calculated. The Steady Display Conserved, numbering and active site options are selected. The three active site residues of the serine protease catalytic (more ...)
Fig. 3 (A) A subsection of an alternative alignment of 1TGSZ and 1UVTH. This section shows a light region following the active site ‘H’ in the two proteins. While there is variation near the active site, the rest of the alignment is substantially (more ...)
The display applet can also highlight matching (red) and similar residues (pink) (). If the user specifies a protein by the GID/accession number then we also have access to the annotation information available in the NCBI database. If any alpha helices or beta strands are specified, we provide an option for those features to be displayed as well. The icons for both the alpha helices and beta strands are designed to combine into a more meaningful icon when both residues in a given pair share the same secondary structure. For alpha helices, an arch beneath the upper sequence combines with an X above the lower sequence to create a small loop that symbolizes a helix (). Similarly, the icons for beta strands combine to produce an arrow. In addition to these automated features, users can specify their own features to be highlighted on the alignments. All display features may be turned on and off by the user to suit individual preferences.