|Home | About | Journals | Submit | Contact Us | Français|
We introduce web interfaces for two recent extensions of the multiple-alignment program DIALIGN. DIALIGN-TX combines the greedy heuristic previously used in DIALIGN with a more traditional ‘progressive’ approach for improved performance on locally and globally related sequence sets. In addition, we offer a version of DIALIGN that uses predicted protein secondary structures together with primary sequence information to construct multiple protein alignments. Both programs are available through ‘Göttingen Bioinformatics Compute Server’ (GOBICS).
Multiple sequence alignment (MSA) is the basis of almost all methods for sequence analysis in bioinformatics. Thus, the results of these methods crucially depend on the underlying alignments. A striking example is a recent study by Wong et al. (1). These authors demonstrated that uncertainties in multiple alignments drastically influence the output of standard phylogeny programs. Development and evaluation of MSA methods is therefore a central field of research in bioinformatics since the mid-1980s. Recent reviews on MSA methods are given, for example, by Edgar and Batzoglou (2), Morrison (3) or Kemena and Notredame (4).
Since its first release in 1996, DIALIGN is a widely used software tool for multiple alignment of DNA, RNA and protein sequences (5,6). It differs in various aspects from other MSA algorithms. DIALIGN tries to align only those parts of the sequences to each other that exhibit some statistically relevant degree of similarity. Non-related parts of the sequences remain unaligned. This way, the method combines local and global alignment features. It returns global alignments where sequences are homologous over their entire length, but local alignments where only local homologies are detectable. DIALIGN constructs alignments based on gap-free local alignments, so-called fragments for which a scoring function is defined based on the probability of their random occurrence. Multiple alignments are constructed in a greedy way by incorporating fragments that are mutually consistent, i.e. fragments that fit into one single output MSA (7).
As most MSA methods, the standard version of DIALIGN is fully automated and works without human intervention. In addition, however, DIALIGN has an option for ‘anchored alignment’ where MSAs are produced in a ‘semi-automatic’ way (8,9). With this option, the program can be ‘forced’ to align user-defined positions of the sequences to each other, and the remainder of the sequences is aligned automatically. Anchored alignment can also be used to speed-up the alignment procedure where long genomic sequences are to be aligned (10,11) or to study the behaviour of alignment methods in detail (12).
Numerous studies have shown that DIALIGN is superior to other MSA tools if locally related sequence sets are aligned, but on globally related sequences with weak primary-sequence similarity, it is often outperformed by global methods such as ‘CLUSTAL W’ (13), ‘MUSCLE’ (14,15), ‘MAFFT’ (16) or ‘PROBCONS’ (17). Since the first release of the DIALIGN, various alternative optimization algorithms have been applied to the fragment-based alignment approach in order to improve its performance (18,19), but recent results indicate that the relative weakness of DIALIGN on global homologies is due to the underlying objective function and not so much on the greedy optimization algorithm (12).
DIALIGN-T is a complete re-implementation of DIALIGN developed by the first author of this article (20). In the first step, it performs all possible pairwise alignments of the input sequences in the sense of DIALIGN (21,22). For multiple alignment, however, DIALIGN-T uses a number of heuristics to prevent the algorithm from aligning spurious, isolated random similarities that might destroy a biologically more meaningful global alignment. For example, in the greedy algorithm for MSA, DIALIGN-T considers not only the local degree of similarity in a fragment, but also its context. Fragments that are part of a high-scoring pairwise alignment are preferred compared to isolated fragments. Also, low-scoring regions are removed from long fragments to counterbalance the bias of DIALIGN in favour of high-scoring fragments and to support groups of lower scoring fragments. Together with some other heuristics, this led to a considerable improvement of the performance compared with the original implementation of DIALIGN.
These ideas were taken a step further in the latest release of the program, ‘DIALIGN-TX’ (23). Here, the traditional progressive approach to multiple alignment (24–26) is adapted to the fragment-based alignment as used in DIALIGN. First a guide tree is calculated based on pairwise fragment alignments. Then pairwise alignments of sequences and groups of previously aligned sequences are performed going from the leaves to the root of the guide tree. In traditional progressive alignment methods, such groups of already aligned sequences are represented as ‘profiles’ and aligned by ‘profile alignment’. This is not possible in DIALIGN, where an alignment is seen as a consistent set of fragments and only parts of the sequences may be aligned. To align two groups G1 and G2 of previously aligned sequences to each other, DIALIGN-TX selects a set of fragments each of which aligns a sequence from G1 with a sequence from G2. A vertex-cover algorithm by Clarkson (27) is used to remove inconsistencies and to select high-scoring sets of consistent fragments.
As most methods for multiple protein alignment, DIALIGN and DIALIGN-TX are based on primary structure information alone. However, attempts have been made in the past to use predicted secondary structures in alignment algorithms (28,29). We implemented a software pipeline that takes predicted protein secondary structures as additional input information for DIALIGN.
where w( f ) is the original, primary sequence-based fragment weight as used in DIALIGN (6). s( f ) is a measure of similarity at the secondary-structure level for fragments and is defined as
Here, mx is the proportion of matching states x, and px the proportion of predicted states x, where x can be H, E or C, as predicted by the PSIPRED program. Optimal values for the parameters α, β, γ and δ have been identified using a least squares support vector machine (32).
We evaluated our secondary structure-based MSA approach using the current release of ‘BAliBASE 3’ (34). Table 1 shows that, ‘on average’, the performance of DIALIGN using secondary structure information is similar to the performance of the program with primary-sequence information alone. For many data sets, however, we observed great differences in the resulting alignments. In some cases, the structure-based alignments were far better than the original ones, while in other cases it was the other way around. For some sequence sets, our secondary structure approach achieved an improvement of 29.7 percentage points in the sum-of-pairs (SP) score (or a relative improvement of 62%, respectively) compared to the purely sequence-based alignment. Therefore, we believe that our secondary structure-based alignments may contain valuable information that is not available in sequence-based MSAs and could therefore be a useful addition to sequence-based alignments.
To make the new versions of DIALIGN easily available to the research community, we set up WWW interfaces for them at ‘Göttingen Bioinformatics Compute Server’ (GOBICS). DIALIGN-TX is available at http://dialign-tx.gobics.de/submission.
Various parameter values can be selected by the user. For exclusion of low-scoring regions in long fragments, the minimum fragment length T from which low-scoring sub-fragments are excluded can be specified, as well as the length L of low-scoring regions that are excluded from alignment. That is, if a fragment f of length ≥T contains a sub-fragment of length L, this sub-fragment is removed and f is split into the two remaining sub-fragments. Also, there are options to increase the program speed, possibly at the expense of sensitivity. For DNA alignment, there are several options to translate DNA fragments into peptide fragments according to the genetic code and to consider open reading frames for alignment.
The downloadable program versions contains more options and adjustable parameters which are explained in the user guide. Also, the downloadable program now comes with an ‘anchored-alignment’ option.
DIALIGN with secondary-structure information is available at: http://dialign-sec.gobics.de/submission.
Deutsche Forschungsgemeinschaft (grants MO 1048/1-1 and MO 1048/6-1 to B.M., in part). Funding for open access charge: Annual budget of department of bioinformatics, University of Göttingen.
Conflict of interest statement. None declared.