We assessed the performance of MUSCLE on four sets of reference alignments: BAliBASE (19
), SABmark (21
), SMART (22
) and a new benchmark, PREFAB. We compared these with four other methods: CLUSTALW (25
), probably the most widely used program at the time of writing; T-Coffee, which has the best BAliBASE score reported to date; and two MAFFT scripts: FFTNS1, the fastest previously published method known to the author (in which diagonal finding by fast Fourier transform is enabled and a progressive alignment constructed), and NWNSI, the slowest but most accurate of the MAFFT methods (in which fast Fourier transform is disabled and refinement is enabled). Tested versions were MUSCLE 3.2, CLUSTALW 1.82, T-Coffee 1.37 and MAFFT 3.82. We also evaluated MUSCLE-p, in which the refinement stage is omitted. We also tried Align-m 1.0 (21
), but found in many cases that the program either aborted or was impractically slow on the larger alignments found in SMART and PREFAB.
BAliBASE. We used version 2 of the BAliBASE benchmark, reference sets Ref 1–Ref 5. Other reference sets contain repeats, inversions and transmembrane helices, for which none of the tested algorithms is designed.
We used version 1.63 of the SABmark reference alignments, which consists of two subsets: Superfamily and Twilight. All sequences have known structure. The Twilight set contains 1994 domains from the Astral database (26
) with pairwise sequence similarity e-values ≤1, divided into 236 folds according to the SCOP classification (27
). The Superfamily set contains sequences of pairwise identity ≤50%, divided into 462 SCOP superfamilies. Each pair of structures was aligned with two structural aligners: SOFI (28
) and CE (29
), producing a sequence alignment from the consensus in which only high-confidence regions are retained. Input sets range from three to 25 sequences, with an average of eight and an average sequence length of 179.
. SMART contains multiple alignments refined by experts, focusing primarily on signaling domains. While structures were considered where known, sequence methods were also used to aid construction of the database, so SMART is not suitable as a definitive benchmark. However, conventional wisdom [e.g. Fischer et al
)] holds that machine-assisted experts can produce superior alignments to automated methods, so performance on this set is of interest for comparison. We used a version of SMART downloaded in July 2000, before the first version of MUSCLE was made available; eliminating the possibility that MUSCLE was used to aid construction. We discarded alignments of more than 100 sequences in order to make the test tractable for T-Coffee, leaving 267 alignments averaging 31 sequences of length 175.
. The methods used to create databases such as BAliBASE and SMART are time-consuming and demand significant expertise, making a fully automated protocol desirable. Perhaps the most obvious approach is to generate sequence alignments from automated alignments of multiple structures, but this is fraught with difficulties; see for example Eidhammer et al
). With this in mind, we constructed a new test set, PREFAB (protein reference alignment benchmark) which exploits methodology (21
), test data (13
) and statistical methods (19
) that have previously been applied to alignment accuracy assessment. The protocol is as follows. Two proteins are aligned by a structural method that does not incorporate sequence similarity. Each sequence is used to query a database, from which high-scoring hits are collected. The queries and their hits are combined and aligned by a multiple sequence method. Accuracy is assessed on the original pair alone, by comparison with their structural alignment. Three test sets selected from the FSSP database (36
) were used as described in Sadreyev and Grishin (34
) (data kindly provided by Ruslan Sadreyev), and Edgar and Sjolander (13
), which we call SG, PP1 and PP2, respectively. These three sets vary mainly in their selection criteria. PP1 and PP2 contain pairs with sequence identity ≤30%. PP1 was designed to select pairs that have high structural similarity, requiring a z
-score of ≥15 and a root mean square deviation (r.m.s.d.) of ≤2.5 Å. PP2 selected more diverged pairs with a z
-score of ≥8 and ≤12, and an r.m.s.d. of ≤3.5 Å. SG contains pairs sampled from three ranges of sequence identity: 0–15, 15–30 and 30–97%, with no z
-score or r.m.s.d. limits. We re-aligned each pair of structures using the CE aligner (29
), and retained only those pairs for which FSSP and CE agreed on 50 or more positions. This was designed to minimize questionable and ambiguous structural alignments as done in SABmark and MaxBench (33
). We used the full-chain sequence of each structure to make a PSI-BLAST (37
) search of the NCBI non-redundant protein sequence database (39
), keeping locally aligned regions of hits with e-values below 0.01. Hits were filtered to 80% maximum identity (including the query), and 24 selected at random. Finally, each pair of structures and their remaining hits were combined to make sets of ≤50 sequences. The limit of 50 was arbitrarily chosen to make the test tractable on a desktop computer for some of the more resource-intensive methods, in particular T-Coffee (which needed 10 CPU days, as noted in Table ). The final set, PREFAB version 3.0, has 1932 alignments averaging 49 sequences of length 240, of which 178 positions in the structure pair are found in the consensus of FSSP and CE.
We used three accuracy measures: Q
, TC and APDB. Q
(quality) is the number of correctly aligned residue pairs divided by the number of residue pairs in the reference alignment. This has previously been termed the developer score (32
) and SPS (40
). TC (total column score) is the number of correctly aligned columns divided by the number of columns in the reference alignment; this is Thompson et al
.’s CS and is equivalent to Q
in the case of two sequences (as in PREFAB). APDB (41
) is derived from structures alone; no reference alignment of the sequences or structures is needed. For BAliBASE, we use Q
and TC, measured only on core blocks as annotated in the database. For PREFAB, we use Q
, including only those positions on which CE and FSSP agree, and also APDB. For SMART, we use Q
and TC computed for all columns. For SABmark, we average the Q
score over each pair of sequences. TC score is not applicable to SABmark as the reference alignments are pairwise.
Following Thompson et al
), statistical significance is measured by a Friedman rank test (42
), which is more conservative than the Wilcoxon test that has also been used for alignment accuracy discrimination (5
) as fewer assumptions are made about the population distribution. In particular, the Wilcoxon test assumes a symmetrical difference between two methods, but in practice we sometimes observe a significant skew. PREFAB and SABmark use automated structure alignment methods, which sometimes produce questionable results. Many low-quality regions are eliminated by taking the consensus between two independent aligners, but some may remain. In PREFAB, assessment of a multiple alignment is made on a single pair of sequences, which may be more or less accurately aligned than the average over all pairs. In SABmark, the upper bound on Q
is less than 1 to a varying degree because the pairwise reference alignments may not be mutually consistent. These effects can be viewed as introducing noise into the experiment, and a single accuracy measurement may be subject to error. However, as the structural aligners do not use primary sequence, these errors are unbiased with respect to sequence methods. A difference in accuracy between two sequence alignment methods can therefore be established by the Friedman test, and the measured difference in average accuracy will be approximately correct when measured over a sufficient number of samples.