We developed a C++ program to perform gapless local alignment of multiple sequences (GLAM), which can be downloaded from: http://zlab.bu.edu/glam/
. The only required input is a file of DNA sequences in FASTA format, although there are many options for modifying the program’s behavior. The program performs r
(default: 10) alignment ‘runs’ on the sequences, and outputs the best alignment found, the score and E
-value of this alignment, and the number of runs which converged to this alignment. It also prints the marginal score of each sequence’s aligned segment, i.e. the score difference between the alignment including this segment and the alignment excluding this segment. The marginal scores are useful indicators of how well each segment matches the rest of the alignment. If some of the runs converged to lower scoring alignments, details of these alignments are also printed. The program can align sequences in either OOPS (one occurrence per sequence) or ZOOPS (zero or one occurrence per sequence) modes (14
), and it can perform single- or double-stranded alignment. Upper and lower bounds on the alignment width can be set if desired. All results described here were obtained with default options (r
= 10, ZOOPS, no bounds on width) unless otherwise specified. The n
parameter, defining how many iterations to persist for without improving the alignment, was generally made large enough so that at least three runs converged to the best alignment found. For the sake of comparison, MEME alignments were performed with default options, except that the width bounds were set to the maximum allowed extent (2–300 bp). Data sets used in this paper are available at http://zlab.bu.edu/glam/sup/
Tests of alignment width optimization
We tested the ability of our algorithm to determine suitable widths for alignments by analyzing sequences containing known transcription factor-binding sites. We obtained 50 bp genomic sequences surrounding 25 mammalian estrogen response elements (EREs), 19 vertebrate LSF-binding sites, 27 mammalian E2F-binding sites, and 35 bicoid- and 27 Krüppel-binding sites from Drosophila
. Regulation by E2F, bicoid and Krüppel has been extensively analyzed using experimental approaches, and their binding sites have also been studied in previous computational analyses (22
). Known binding sites for LSF and ERE were collected sites from experimental literature (24
) (R.B.O’Lone, M.C.Frith and U.Hansen, submitted). The consensus sequences of all these motifs (Fig. , see below) are based not only on alignments of known binding sites, but also on mutational analyses to determine both the critical positions and nucleotide preferences for binding of the transcription factors, on chemical and enzymatic ‘footprinting’ experiments to determine base pairs in contact with the proteins and, in the case of the estrogen receptor, on the structure of the protein–DNA complex determined by X-ray crystallography (27
). These various methods agree on determination of the critical DNA–protein contact regions, and in general on the extents of the DNA-binding sites.
Figure 3 Alignments of transcription factor-binding sites. (A) Alignment of 25 EREs by GLAM. The remaining panels show pictogram representations (C. Burge and F. White, http://genes.mit.edu/pictogram.html) of GLAM and MEME alignments, compared with independently (more ...)
In every case, the GLAM alignment is very similar to consensus sequences of the motif established by experimental data, differing in width by a few base pairs at most (Fig. ) (22
). For instance, in the case of the ERE, mutagenesis studies have confirmed that the extent of sequence important for binding affinity is indeed the 13–15 bp region indicated (25
). The MEME program aligns three of the motifs (E2F, bicoid and Krüppel) just as successfully as GLAM, but it returns excessively wide alignments for ERE and LSF.
Alignment of Alu elements
It proved surprisingly instructive to apply GLAM to a random set of human DNA sequences, where it aligns the ubiquitous Alu element (Fig. ). The initial alignment (gray shapes above the lines) covered half an Alu sequence, which is the greatest extent possible without exceeding the bounds of two elements that are truncated. We then masked the nucleotides within this alignment (replaced them with ‘n’), and applied GLAM a second time, recovering the other half Alu sequence. Surprisingly, we found no other program that can align these elements as cleanly as GLAM. MEME also identifies the Alu elements, but its alignments do not cover the full width of the element (Fig. ). Dialign aligns the full extent of the sequences rather than picking out the Alu elements (29
). Although BLAST can align the Alu elements in a pair of sequences, it is not straightforward to combine pairwise alignments into a multiple alignment. This example demonstrates that GLAM works for a range of multiple alignment problems. In addition, these alignments were fast (tens of seconds on an 1.1 GHz Pentium III CPU), demonstrating that GLAM is fast not only when the data set is small, but also when the signal to be aligned is strong (as is the case for Alu elements).
Figure 4 Alignment of Alu elements within 20 human DNA sequences. The horizontal lines represent the sequences (2000 bp each) and the black shapes indicate the positions of the Alu elements within them. The gray and hollow shapes above the lines indicate the (more ...)
Tests of the statistical significance calculation
Since GLAM will always return an alignment even for unrelated sequences, it is extremely useful to know whether an alignment score is statistically significant, i.e. greater than would be expected by chance for random sequences. To this end, we implemented a multiple sequence generalization of the BLAST statistical calculation, and tested its accuracy by aligning sets of random DNA sequences (in OOPS mode) and observing the empirical score distribution (Fig. ). For alignments of five sequences (Fig. A), the calculated and empirical score distributions generally agree. When aligning greater numbers of sequences, however, the calculated score distribution becomes increasingly conservative. This means that if an alignment score is calculated to be statistically significant, then it certainly is. However, if it is not calculated to be significant, we do not know whether or not it really is. We performed a similar test for double-stranded alignments of five sequences, obtaining an even better agreement between theory and observation than for the single-stranded case (30
). In conclusion, this method is quite useful because it is accurate for small numbers of sequences, and it provides upper bounds of E
-values for large numbers of sequences. Presumably, a correction for correlated diagonals would make the calculation more accurate. Plots of one minus the cumulative distribution function zoomed into the upper tail are provided as Supplementary Material available at NAR Online.
Figure 5 Tests of the statistical significance calculation. (A) One thousand sets of five random DNA sequences, each 500 bp long, were generated. Each set was aligned using GLAM, and the alignment scores are plotted as a histogram. The solid curve indicates the (more ...)
The above analysis has a potential weakness: the BLAST statistical theory applies to optimal alignments, but we have tested it using a heuristic algorithm which is not guaranteed to find them. We present two pieces of evidence arguing that GLAM has, in fact, found the optimal alignment most of the time. If GLAM is robustly finding optimal alignments, we would predict that if we repeat an alignment several times, we will obtain the same result on each (or most) of these attempts (in spite of the algorithm being stochastic). We tested this prediction by performing 10 runs of each of the 1000 alignments and recording how many of the runs converged to the same best alignment in each case (Fig. ). In most cases, all 10 runs gave the same result. This agreement of runs is slightly less good for the alignments of 20 sequences, which may explain the score distribution’s appearance of having eroded from the Gumbel shape.
We also considered the possibility that the global optimum alignment has a very narrow ‘basin of attraction’, somewhat like the hole on a golf course. The following alignment problem was constructed to mimic this scenario: the sequence TATTAATTTAAA was inserted once into each of 20 random sequences of 500 bp consisting only of G and C. In this problem, there are no sites outside the embedded motifs, which make it favorable for the algorithm to select the motifs. Nevertheless, GLAM was able to identify the embedded sequences rapidly (tens of seconds on an 1.1 GHz Pentium III CPU). In general, GLAM seems to have most difficulty when the global optimum lies within a very broad and shallow basin where many alignments have almost the same score. This is typically the case when there is no significant motif to be found.
Failure testing GLAM
We applied GLAM to a range of increasingly difficult problem scenarios, to learn the limits of its applicability and to study the reasons for eventual failure. In particular, we were interested to learn whether failure is caused by the search scheme failing to find the highest scoring alignment, or by the biologically meaningful alignment failing to have the highest score. Various numbers of ERE sites were selected, and embedded in randomly generated DNA sequences of various lengths. The sites were embedded in synthetic rather than real sequences because real sequences are likely to contain many unknown biological signals, making it hard to tell whether an alignment is successful or not. Some tests were made harder, and probably more realistic, by including ‘decoy’ sequences (randomly generated sequences lacking EREs). To measure GLAM’s accuracy in identifying the embedded sites, we use the correlation coefficient (equation 7).
TP (true positives) is the number of nucleotides contained in the sites and also included in the alignment returned by GLAM; FP (false positives) is the number of nucleotides not in a site but in the alignment; FN (false negatives) is the number of nucleotides in a site but not in the alignment; TN (true negatives) is the number of nucleotides neither in a site nor in the alignment. The correlation coefficient varies between +1 and –1, with +1 indicating perfect performance, and 0 indicating a performance no better than random.
GLAM’s ability to find the sites decreases as the sequences become longer, and improves when there are more EREs (Fig. ). Interestingly, the accuracy deteriorates in a gradual rather than a catastrophic manner as the sequence length increases, especially for larger numbers of EREs. We studied the alignments in detail to understand this behavior. Many of the EREs are so weak that as the sequence length increases, stronger motifs appear by chance in some of the sequences and get selected by GLAM. The resulting alignments still resemble an ERE, cover the full extent of the motif, and often have the correct width, though they sometimes are wider by several base pairs. It is worth emphasizing that alignments with correlation coefficients as low as ~0.3 still very much resemble EREs. For example, the alignment of 20 EREs in 2000 bp sequences with five decoys has a correlation coefficient of only 0.341, but eight of the EREs are perfectly recovered by this alignment. Moreover, the two highest marginal scores in this alignment belong to embedded EREs.
Figure 6 Tests of GLAM’s ability to find EREs embedded in random DNA sequences of varying lengths. In all panels, red lines indicate the results when five EREs were present; green, 10 EREs; dark blue, 15 EREs; purple, 20 EREs; light blue, 25 EREs. Open (more ...)
Decoy sequences almost always get included in the alignments, the exceptions occurring when the sequences are very short. We do not think this indicates a fundamental flaw in the method, since our alignment of Alu repeats excluded all sequences that lack this element (Fig. ). The ERE motif, unlike the Alu, is simply so weak that sequences that resemble it more closely than they resemble random DNA are likely to occur by chance. The segments of decoy sequences that appear in the alignments tend to have low marginal scores, which might provide a criterion for filtering them out.
While the correlation coefficient decreases, the alignment score relentlessly increases as the sequences become longer (Fig. ). This observation strongly suggests that the accuracy is not suffering because the search algorithm fails to find the highest scoring alignment, but because the embedded sites increasingly do not constitute the optimal alignment. Finally, we observe that alignments which fall short of the global optimum can still be useful. We performed 10 runs of GLAM on the set of 25 EREs in 5000 bp sequences with five decoys, obtaining eight different alignments, six of which have correlation coefficients >0.4.
We investigated whether GLAM’s calculated E-value can distinguish biologically meaningful from random alignments, by plotting the E-values of all the alignments shown in Figure against their correlation coefficients (Fig. ). All of the alignments with E-values <0.01 have correlation coefficients >0.5, indicating that they recover the embedded EREs to a considerable extent. On the other hand, there is a ‘twilight zone’ of alignments that have good correlation coefficients but insignificant E-values. These cases can be partly explained by the conservative nature of the E-value calculation, and presumably some of these alignments really are on the verge of becoming statistically insignificant. The failure tests were repeated using E2F instead of ERE motifs, which confirmed the observations made above (data not shown).
Figure 7 Relationship between statistical significance and accuracy of GLAM alignments. The E-value versus correlation coefficient is plotted for each alignment in Figure . The open boxes indicate cases where fewer than three out of 10 GLAM runs (more ...)
Assessment of various alignment strategies
We explored whether modifications to the search algorithm might optimize alignments more efficiently. As explained in Methods, the basic algorithm is a stochastic search of the alignment space, where new alignments are selected with probability proportional to exp(S
), where S
is the score of the new alignment. A closely related approach is simulated annealing, where a ‘temperature’ parameter t
is introduced, and new alignments are selected with probability proportional to exp(S
). Low temperatures exaggerate score differences, so that the probability of selecting higher scoring alignments increases at the expense of lower scoring ones. High temperatures have the opposite effect. We experimented with three annealing schedules: constant temperature; geometric cooling, where the temperature is multiplied by a parameter c
< 1 at each iteration; and a more complex technique known as the ‘modified Lam schedule’ (31
). The modified Lam schedule aims for a target accept rate, i.e. rate of changing the alignment versus leaving it unchanged per iteration. During the first 15% of an alignment run, the target accept rate decays geometrically from 100 to 44%, then it remains fixed at 44%, and finally it decays geometrically to 0% over the last 35% of the run. At each iteration, the temperature is either multiplied or divided by c
to force the real accept rate towards the target. For this schedule, n
specifies the total number of iterations, rather than how long to persist without improving the alignment.
We applied these strategies to a difficult case: aligning 40 DNA sequences of length 1000 bp each (one of the double decoy sets described above). Interestingly, the default strategy (t = 1) does not achieve the highest alignment scores (Fig. A). Among the constant temperature strategies, lower temperatures perform better when the run time is short, and higher temperatures perform better when given more time, up to t = 0.9 at 10 000 s on an 1.1 GHz Pentium III CPU. It is conceivable that t = 1 would become the best strategy given even more time. Constant temperatures greater than 1 perform extremely poorly (data not shown). The geometric schedules generally perform worse than the constant temperature strategies (Fig. B). The modified Lam schedule does not obviously perform better than constant t, which is disappointing given its extra complexity (Fig. B). Strong values of the forcing parameter c work better for short run times, and weaker values work better for long time scales.
Figure 8 Alignment score achieved versus run time (on 1.1 GHz Pentium III CPUs) for various alignment strategies. (A) Fixed temperature strategy. GLAM’s t parameter was fixed at certain values, and the running time was varied by adjusting the n parameter. (more ...)
Another strategy is to perform very many very fast searches. We applied this strategy using low temperatures (t
= 0.00001 and t
= 0.5) so that each search quickly reaches a local optimum, with r
= 100, 1000, 10
000 or 100
000 restarts (Fig. C). When t
= 0.00001, it is better to use high r
and low n
(i.e. more and quicker searches), whereas when t
= 0.5, it is better to use low r
and high n
. Overall, the rapid restart approach is not the best, suggesting that the problem has an extremely large number of local maxima.
In conclusion, the use of a constant temperature slightly less than 1 and the modified Lam schedule are the most effective search methods. We also tested the strategies on two further sequence sets, 20 sequences of 5000 bp each and 100 sequences of 1000 bp each, which confirmed all of the observations made above (data not shown).