Multiple alignment is an important tool for computational analysis of nucleotide or amino acid sequences. MAFFT (
1) is one of the fastest methods among the currently available multiple alignment tools (
2), and used in several projects, such as Pfam (
3), ASTRAL (
4) and MEROPS (
5). In MAFFT, an initial alignment is constructed by the progressive method (
6,
7) and then refined by the iterative refinement method (
8,
9). The outline of procedure of the previous version of MAFFT is briefly explained below and in the lower part of . A user can select an appropriate strategy from the fastest one (FFT-NS-1) to the most accurate one (FFT-NS-i).
| Table 1Options of version 5.3 (upper) and the previous version (lower) of MAFFT |
Progressive alignment (
1). A rough distance between every pair of input sequences is rapidly calculated based on the number of 6-tuples shared by the two sequences (
1,
10,
11). A guide tree is constructed from the distances with the UPGMA method (
12) with modified linkage (see supplementary material on our web page,
http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/suppl/). Input sequences are progressively aligned (
6,
7) following the branching order of the guide tree. This procedure is referred to as FFT-NS-1.
Progressive alignment (
2). The initial distance matrix is less reliable than that based on all pairwise alignments. We can obtain more reliable distance matrix by using the FFT-NS-1 alignment (
1,
11,
13). Progressive alignment is re-performed based on the new tree calculated from the new distance matrix. This method is referred to as FFT-NS-2.
Iterative refinement. The FFT-NS-2 alignment is further improved by the iterative refinement method (
8,
9) that optimizes the weighted sum-of-pairs (WSP) score proposed by Gotoh (
14), using an approximate group-to-group alignment algorithm (
1) and the tree-dependent restricted partitioning technique (
15). This procedure is referred to as FFT-NS-i.
For the progressive alignment processes, a fast Fourier transform (FFT) approximation (
1) is used in the FFT-NS-2, FFT-NS-1 and FFT-NS-i options (collectively denoted as FFT-NS-[12i] hereafter). When the sequences under consideration are highly conserved, these options require CPU times effectively proportional to average sequence length
L for amino acid or nucleotide sequence alignments consisting homologues of a single gene. Note that it is not
L log
L, although FFT takes
L log
L operations. This is because CPU time required by the FFT phase is much smaller than that by the dynamic programming (DP) phase, unless extremely long sequences like genomic sequences are input. The slopes of the FFT-NS-[2i] lines are indeed near to 1 in .
MAFFT also offers three options NW-NS-[12i] with full DP (
16) for the same part. In contrast to the case of FFT-NS-[12i], the time complexity of full DP options is proportional to
L2, independently to the similarity among input sequences. In most cases, an FFT-based strategy results in the same alignment as that by the corresponding option with full DP. The difference in accuracy was not statistically significant between the alignments generated with and without the FFT approximation in all cases we tested.
The parameters of MAFFT, such as gap penalties and scoring matrix, have not been minutely investigated because of limitation in the number of reference alignments available when it had been developed. Recently, large reference alignment databases, such as HOMSTRAD (
17), SABmark (
18) and PREFAB (
11), have been independently established. They are valuable resources to select good parameters for a sequence alignment program as well as to evaluate the performance of a program.
CLUSTAL W is a widely utilized program of multiple sequence alignment. Other algorithms have tried to improve on the accuracy of CLUSTAL W. Gotoh (
19) developed PRRP/N and found significant improvement by the iterative refinement method that uses the WSP score as objective function. TCoffee (
20) employs progressive strategy but achieved the highest accuracy (
1,
21,
22). This is because TCoffee constructs a multiple sequence alignment by combining information derived from heterogeneous sources, such as a global multiple alignment and local alignments. Although this ability is of great value, TCoffee requires a large CPU time proportional to
N3. Thus, it is hard to apply TCoffee to a large alignment consisting of dozens of sequences. The accuracy of TCoffee has recently been further improved in version 2 when compared with version 1. Recently, Edgar (
11,
23) implemented progressive and iterative refinement alignment strategies in MUSCLE. Although the algorithms of major options of MUSCLE seem similar to NW-NS-[2i] explained above, MUSCLE has an original option, MUSCLE-fast, which is faster and less accurate than other options of MUSCLE in most cases. Do
et al. (manuscript submitted) proposed a new method, PROBCONS, whose accuracy is comparable or slightly higher than TCoffee version 2.
In MAFFT version 5.3, (i) parameters were optimized based on a number of reference alignments and (ii) three new strategies (G-INS-i, H-INS-i and F-INS-i; collectively denoted as [GHF]-INS-i hereafter) were introduced. In an attempt to improve alignment accuracy, [GHF]-INS-i uses a TCoffee-like approach (
20) in incorporating all pairwise alignment information into an objective function. Iterative refinement for the objective function can be performed in reasonable time, as [GHF] INS-i was designed to have low computational complexity.
It was suggested that the accuracy of a multiple alignment of distantly related sequences is improved if they are aligned together with a number of their homologues (
24), because the information from many sequences is expected to reduce the ‘noise’ (
19). However, this possibility has not been quantitatively examined for recently developed methods. We also evaluated the effect of the number of homologues involved in an alignment.