We tested the performance of PROMALS3D and other multiple alignment programs on two alignment benchmark databases, SABmark (27
) and PREFAB (23
), using reference-dependent and reference-independent evaluation methods (see Materials and methods section). PROMALS3D, 3DCoffee and Expresso use both sequence and 3D structure information. PROMALS3D and MUSTANG can align multiple structures using only 3D structural information. The other programs PROMALS, SPEM, MUMMALS, ProbCons, MAFFT, MUSCLE, T-Coffee and ClustalW do not use 3D structural information.
Tests on SABmark database
SABmark database (version 1.65) has two benchmark sets for testing multiple alignment programs: the ‘twilight zone’ set contains 209 groups of SCOP (version 1.65) fold-level domains with very low similarity, and the ‘superfamilies’ set contains 425 groups of SCOP superfamily-level domains with low to intermediate similarity. For each group, the SABmark database provides a set of pairwise reference alignments for evaluation of alignment quality, instead of a single-reference multiple sequence alignment. Each pairwise reference alignment was derived from the consensus of two structural comparison programs SOFI (28
) and CE (29
Since PROMALS3D uses an ASTRAL domain structural database that is based on a later version of SCOP (1.69) than the one used in SABmark, exact matches or close homologs with structures (homolog3Ds) can be identified for most of the SABmark sequences. Combining structural constraints derived from DaliLite alignments with the profile–profile alignment constraints in original PROMALS, PROMALS3D achieves average Q
-scores of 0.603 and 0.805 for the ‘twilight zone’ set and the ‘superfamilies’ set, respectively [, PROMALS3D (D + S)]. The Q
-score improvements over the original PROMALS program without 3D structural information are 0.21 and 0.14, respectively. Such prominent increases of alignment quality are also evident from the reference-independent evaluation using GDT-TS score () and other structure-based scores (Table S1 in Supplementary Data
). Using DaliLite structural alignments in PROMALS3D gives slightly but significantly better results than using FAST alignments or TM-align alignments (measured by Q
-score or GDT-TS), suggesting that DaliLite produces more accurate structural alignments on average. Combining structural alignments made by DaliLite, FAST and TM-align did not yield much improvement over using only DaliLite structural alignments.
Another program that can incorporate 3D structural information is 3DCoffee (7
), which we implemented by feeding structural alignment constraints in the T-COFFEE program. The default 3DCoffee program using SAP structural alignments (12
) yields significantly worse results than PROMALS3D using DaliLite structural alignments (). 3DCoffee with DaliLite structural alignment constraints also give better results than 3DCoffee with SAP alignments. These results validate the high quality of DaliLite alignments. Automatic incorporation of SAP structural alignments by 3DCoffee is also available in the Expresso web server. We manually submitted 209 SABmark twilight-zone set alignments to Expresso web server and obtained worse results than running 3DCoffee on our local computers ().
PROMALS3D and 3DCoffee capture 3D structural information in a similar way through consistency measure. Results on SABmark benchmarks suggest they perform similarly when every sequence has a close homolog3D and the same structural constraints are given. In real-life alignment cases, however, we might not find close homolog3Ds for every sequence, and sequence constraints play a more important role in aligning sequences without 3D structural information. To test this effect, we force PROMALS3D and 3DCoffee to use 3D structural information for only half of the sequences in each alignment in SABmark database [, PROMALS3D (D/2 + S), 3DCoffee (D/2 + S) and 3DCoffee (SAP/2 + S)]. In these tests, PROMALS3D performs significantly better than 3DCoffee. The average Q-score differences of PROMALS3D (D/2 + S) and 3DCoffee (D/2 + S) are about 0.21 and 0.14 on ‘twilight zone’ set and ‘superfamilies’ set, respectively. These results reflect the superiority of PROMALS3D sequence constraints, which are based on profile–profile comparisons with predicted secondary structures. On the other hand, the sequence constraints of 3DCoffee are based on pairwise sequence alignments in the T-Coffee program.
PROMALS3D can also be used to construct alignments for multiple structures by using only DaliLite pairwise structural alignments as constraints [, PROMALS3D (D), last line]. PROMALS3D using only structural constraints yields performance slightly worse than combining constraints of structural alignments and profile–profile alignments with predicted secondary structures. We also tested the performance of MUSTANG, a multiple structural alignment program that is based on the consistency of pairwise structural alignments and does not use sequence information. PROMALS3D is significantly better than MUSTANG according to reference-dependent and reference-independent evaluations. These results suggest that PROMALS3D offers a good solution to the multiple structural alignment problem by combining DaliLite structural constraints and sequence constraints of profile–profile comparisons.
As a positive control, we also used pairwise SABmark reference alignments as the sole structural constraints to assemble multiple alignments by the PROMALS3D consistency strategy. SABmark pairwise reference alignments are noted as not entirely consistent with each other (27
). Therefore, any method that assembles multiple sequence alignments cannot achieve a perfect average Q
-score (would be 1.0) when tested on these pairwise reference alignments. The average accuracies for the multiple alignments assembled from pairwise reference alignments are about 0.71 and 0.87 for the ‘twilight zone’ set and the ‘superfamilies’ set, respectively; suggesting that inconsistency among pairwise reference alignments are more prominent in the more distant ‘twilight zone’ set. The lack of consistency (transitivity) between structural alignments is an intrinsic and unavoidable feature of structure superposition strategies. Superpositions are based on structural closeness, either in Cartesian space or in contact space, and between three structures with residues A, B and C, closeness of (A, B) and (B, C) does not imply that (A, C) are necessarily very close. Sequence alignments are frequently viewed as evolutionary alignments where transitivity applies. Since we consider each aligned site to correspond to a single ancestral site, alignment of (A, B) and (B, C) implies that A and C have the same ancestral site and should be aligned together. Evolutionary alignments are always hypothetical. Structural alignments are purely geometric but are essential benchmarks for accurate structure modeling. Such a difference is addressed by PROMALS3D, which finds the best compromise, consistent with all available sequences and pairwise structural alignments.
The effect of using distant homolog3Ds
The structural constraints between target sequences are deduced from sequence-based target-to-homolog3D alignments and structure-based homolog3D-to-homolog3D alignments. Distant homolog3Ds could affect the quality of these constraints since the quality of target-to-homolog3D alignments could be poor. For this reason, the Expresso webserver of 3DCoffee restricts the use of homolog3Ds only when they show above 60% sequence identity to the targets.
We studied the effect of using distant structural templates by restricting the selected homolog3Ds to certain similarity ranges and examining the alignment quality of using only these homolog3Ds on SABmark alignments. When using only distant homolog3Ds with sequence identity <20% to targets and PSI-BLAST target-to-homolog3D alignments, the average alignment quality score deteriorates as compared to not using 3D structural information (see ). On the other hand, using structural templates with identity between 20% and 60% results in 3–4% increase in alignment quality scores compared to not using 3D information. These results suggest that homolog3Ds with moderate similarity to targets are still valuable for improving alignment quality.
The effect of using distant homolog3Ds on SABmark ‘superfamilies’ set.
We reason that increasing the quality of target-to-homolog3D alignments (default are from PSI-BLAST output) can lead to improved quality of structural constraints and the resulting multiple alignments. To test this point, we made alignments between targets and their homolog3Ds using the pairwise profile-to-profile HMMs with predicted secondary structures (the same technique for deriving sequence constraints in PROMALS). These profile–profile target-to-homolog3D alignments indeed yield better quality of multiple sequence alignments than the PSI-BLAST target-to-homolog3D alignments ( and S2) when using distant homolog3Ds.
Tests on PREFAB database
) consists of 1682 alignments (version 4.0), each of which has two sequences with known structures and up to 24 homologous sequences added from database searches for each structure. The reference-dependent evaluation is based on the consensus of FSSP (16
) structural alignment and CE alignment of the two sequences with known structures. The average difficulty of PREFAB alignments is less than those of SABmark database, as the original PROMALS has an average Q
-score of 0.790 on the PREFAB set of alignments [best among programs not using 3D structural information (10
)], as compared to 0.393 and 0.665 on the SABmark ‘twilight-zone’ set and ‘superfamilies’ set, respectively. With the addition of 3D structural constraints from DaliLite alignments, the average Q
-score of PROMALS3D on all PREFAB alignments increases to 0.893 (), which is significantly better than the average Q
-score of PROMALS (0.790).
We also sorted the PREFAB alignments according to sequence identity, and divided them into four semi-equal-sized subsets (). The average sequence identities for the four subsets are 0.121, 0.185, 0.248 and 0.527, respectively. The subset with the lowest average sequence identity is the most difficult, for which we observed the most prominent increase of alignment quality of using structural information (an increase of about 0.25 for average Q-score). For subsets with higher average identity, the improvements of PROMALS3D over PROMALS are less prominent (). These results suggest that 3D structural information is most valuable for improving alignments of distantly related sequences.