Group-to-group alignments by FFT
The frequency of amino acid substitutions strongly depends on the difference of physico-chemical properties, particularly volume and polarity, between the amino acid pair involved in the substitution (
16). Substitutions between physico-chemically similar amino acids tend to preserve the structure of proteins, and such neutral substitutions have been accumulated in molecules during evolution (
17). It is therefore reasonable to consider that an amino acid
a is assigned to a vector whose components are the volume value
v(
a) and the polarity value
p(
a) introduced by Grantham (
18). We use the normalized forms of these values:

and

, where an overbar denotes the average over 20 amino acids, and σ
v and σ
p denote the standard deviations of volume and polarity, respectively. An amino acid sequence is converted to a sequence of such vectors.
Calculation of the correlation between two amino acid sequences. We define the correlation c(k) between two sequences of such vectors as
c(k) = cv(k) + cp(k),1
where cv(k) and cp(k) are, as defined below, the correlations of volume component and polarity component, respectively, between two amino acid sequences to be aligned. The correlation c(k) represents the degree of similarity of two sequences with the positional lag of k sites. The high value of c(k) indicates that the sequences may have homologous regions.
The correlation cv(k) of volume component between sequence 1 and sequence 2 with the positional lag of k sites is defined as
where

and

are the volume component of the
nth site of sequence 1 with the length of
N and that of sequence 2 with the length of
M, respectively. Considering
N
M in many cases, equation
2 takes
O(
N2) operations. The FFT reduces the CPU time of this calculation to
O(
N log
N) (
19). If
V1(
m) and
V2(
m) are the Fourier transform of

and

, i.e.
V1(
m)
3
V2(
m),
4it is known that the correlation cv(k) is expressed as
cv(
k)
V1*(
m) ·
V2(
m),
5 where
![[left and right double arrow ]](/corehtml/pmc/pmcents/x21D4.gif)
represents transform pairs, and the asterisk denotes complex conjugation.
The correlation cp(k) of polarity component between two sequences
is calculated in the same manner.
Finding homologous segments. If two sequences compared have homologous regions, the correlation c(k) has some peaks corresponding to these regions (Fig. A). By the FFT analysis, however, we can know only the positional lag k of a homologous region in two sequences but not the position of the region. As shown in Figure B, to determine the positions of the homologous region in each sequence, a sliding window analysis with the window size of 30 sites is carried out, in which the degree of local homologies is calculated for each of the highest 20 peaks in the correlation c(k). We identify a segment of 30 sites with score value exceeding a given threshold (0.7 per site in our program, see below for details of the scoring system) as a homologous segment. If two or more successive segments are identified as homologous segments, they are combined into one segment of larger length. If the length of the combined segment is longer than 150 sites, the segment is divided into several segments with 150 sites each.
Dividing a homology matrix. To obtain an alignment between two sequences, the homologous segments must be arranged consistently in both sequences. A matrix Sij(1 ≤ i, j ≤ n, n is the number of homologous segments) is constructed in the following manner. If the ith homologous segment on sequence 1 corresponds to the jth homologous segment on sequence 2, Sij has the score value of the homologous segment calculated above; otherwise Sij is set to 0. By applying the standard DP procedure to matrix Sij, we obtain the optimal path, which corresponds to the optimal arrangement of homologous segments. Figure A shows an example in which five homologous segments exist. The order of segments in sequence 1 differs from that in sequence 2. The optimal path depends on S23 and S32; if S23 > S32, the path with bold arrows is optimal.
Overall homology matrix is divided into some sub-matrices at the boundary corresponding to the center of homologous segments as illustrated in Figure B. As a result, the shaded area in Figure B is excluded from the calculation. As many homologous segments are detected by FFT, the CPU time is reduced.
Extension to group-to-group alignments. The procedure described above can be easily extended to group-to-group alignment by considering equations
2 and
6 as a special case with one sequence in each group. These equations are extended to group-to-group alignment by replacing

with

, which is the linear combination of the volume components of the members belonging to group 1:
where
wi is the weighting factor for sequence
i, which is calculated in the same manner as CLUSTALW (
7) for the progressive method, or in the same manner as Gotoh’s (
20) weighting system for the iterative refinement method. Similarly, polarity component is calculated as:
This method is applicable to nucleotide sequences by converting a sequence to a sequence of four-dimensional vectors whose components are the frequencies of A, T, G and C at each column, instead of volume and polarity values. In this case, correlation between two nucleotide sequences is:
c(k) = cA(k) + cT(k) + cG(k) + cC(k).
Scoring system
Similarity matrix. In order to increase the efficiency of alignment, the scoring system (similarity matrix and gap penalties) was also modified. Vogt
et al. (
21) suggested that the Needleman–Wunsch (NW) algorithm performs well with all-positive matrices, in which all elements have positive values. CLUSTALW (
7) and other methods use such all-positive matrices by default. Since Vogt
et al. (
21) examined only the cases in which members of each protein familiy are similar in length, it is not clear whether such all-positive matrices are suitable to various alignment problems, particularly to those of different length. Accordingly, contrary to existing methods, we adopted a normalized similarity matrix

(
a and
b are amino acids) that has both positive and negative values:

where
average1 = Σ
afaMaa,
average2 = Σ
a,bfafbMab,
Mab is raw similarity matrix,
fa is the frequency of occurrence of amino acid
a, and
Sa is a parameter that functions as a gap extension penalty. Under this similarity matrix

, the score per site between two random sequences is
Sa, and the score per site between two identical sequences is 1.0 +
Sa. If
Sa is much smaller than unity, gaps are scored virtually equivalent to random amino acid sequences.
The default parameters of our program are:
Mab is the 200 PAM log-odds matrix by Jones
et al. (
22),
fa is the frequency of occurrence for amino acid
a calculated by Jones
et al. (
22),
Sop (gap opening penalty, defined below) is 2.4 and
Sa is 0.06, for amino acid sequences. For nucleic acid sequences,
Mab is the 200 PAM log-odds matrix calculated from Kimura’s two parameter model (
23) with transition/transversion ratio of 2.0,
fa is 0.25,
Sop is 2.4 and
Sa is 0.06.
Gap penalty. Homology matrix
H(
i,
j) between two amino acid sequences
A(
i) and
B(
j) is constructed from the similarity matrix as

, where
i and
j are positions in sequences. When two groups of sequences are aligned, homology matrix between group 1 and group 2 is calculated as:
where A(n, i) indicates the ith site of the nth sequence in group 1, B(m, j) is the jth site of the mth sequence in group 2, and wn is the weighting factor, defined previously, for nth sequences.
In the NW algorithm (1), the optimal alignment between two groups of sequences is calculated as:
where P(i,j) is the accumulated score for the optimal path from (1,1) to (i, j), and G1(i, x) and G2(j, y) are gap penalties defined below.
Each group of sequences may contain the gaps already introduced at previous steps. If a gap is newly introduced at the same position as one of such existing gaps, the new gap should not be penalized, because these new and existing gaps are probably resulting from a single insertion or deletion event. Gotoh (
6) and Thompson
et al. (
7) developed position-specific gap penalties depending on the pattern of existing gaps. Our method used in this report is rather simpler than theirs:
G1(i, x) = Sop · {1 – [g1start(x) + g1end(i)]/2},
where Sop corresponds to a gap opening penalty, g1start(x) is the number of the gaps that start at the xth site, and g1end(i) is the number of the gaps that end at the ith site. That is,
where
zm(
i) = 1 and
am(
i) = 0, if the
ith site of sequence
m is a gap; otherwise
zm(
i) = 0 and
am(
i) = 1;
wm is the weighting factor for sequence
m. The other penalty
G2(
j,
y) is calculated in the same manner. Because this formulation is simpler than existing ones (
6,
7), the CPU time is considerably reduced, but the accuracy of resulting alignments is comparable with that by existing scoring systems (see Results).
Computer programs
We have developed a program package MAFFT, which incorporates new techniques described above. The source code for the FFT algorithm has been taken from Press
et al. (
19). In MAFFT, the progressive method (
3,
7) (FFT-NS-1, FFT-NS-2) and the iterative refinement method (
4–
6) (FFT-NS-i) are implemented with some slight modifications described below.
FFT-NS-1. Using the FFT algorithm and the normalized similarity matrix described above, input sequences are progressively aligned following the branching order of sequences in the guide tree. This method is hereafter referred to as FFT-NS-1. This method requires a guide tree based on the all-pairwise comparison, whose CPU time is
O(
K2), where
K is the number of sequences. Rapid calculation of a distance matrix is important for the case of large
K. Thus we adopted the method of Jones
et al. (
22) with two modifications; 20 amino acids are grouped into six physico-chemical groups (
24), and the number
Tij of 6-tuples shared by sequence
i and sequence
j is counted. This value is converted to a distance
Dij between sequence
i and sequence
j as
Dij = 1 – [Tij/min(Tii, Tjj)].
The guide tree is constructed from this distance matrix using the UPGMA method (
25).
FFT-NS-2. Input sequences are realigned along the guide tree inferred from the alignment by FFT-NS-1. It is expected that more reliable alignments are obtained on the basis of more reliable guide trees (
26). This method is referred to as FFT-NS-2.
FFT-NS-i. An alignment obtained by FFT-NS-2 is subjected to further improvement, in which the alignment is divided into two groups and realigned (
4–
6). We employ a technique called tree-dependent restricted partitioning (
27). This process is repeated until no better scoring alignment is obtained in respect of the score described above. This method is referred to as FFT-NS-i.
To test the effect of the FFT algorithm or the normalized similarity matrix described above, we compared these three methods with several methods in which these newly developed techniques are not used.
NW-NS-1/NW-NS-2. We examined a method that uses the standard NW algorithm, instead of the FFT algorithm, with the normalized similarity matrix described above. This method is referred to as NW-NS-1 or NW-NS-2. Concerning the guide trees, NW-NS-1 and NW-NS-2 are identical to FFT-NS-1 or FFT-NS-2, respectively.
NW-AP-2. To test the effect of the normalized similarity matrix described above, we examined a method with conventional all-positive similarity matrix (
21), which is made positive by subtracting the smallest number in the matrix from all elements. This is equivalent to setting
Sa in equation
7 to 0.82 for the similarity matrix we use. This method is referred to as NW-AP-2. Except for the similarity matrix, the procedure of NW-AP-2 is identical to that of NW-NS-2.