We proposed a novel subcellular location predicting method which is based on sequence alignment and amino acid composition. Through fivefold cross validation tests for TargetP plant data sets, we obtained the overall accuracy of 0.9096 and the average MCC of 0.8655. These values are higher than existing predictors which use only sequence information. We believe that the high accuracy attained by our method indicates that our alignment algorithm is automatically detecting signal sequences. Localization signals, such as the mitochondrial transit signal are highly diverse at the amino acid level, but share some features such as regions of positive charge or amphiphilic nature. Thus by aligning the amino acid compositions of small blocks, instead of individual amino acids, our technique may capture some features of localization signals such as transit signals.
As mentioned in the previous section, our method is effective for "cTP", "mTP" and "SP," but less effective for "other". The reason is discussed in the following. In general, proteins destined for chloroplast, mitochondria, and secretory pathway have signal sequences in their N termini. On the other hand, proteins destined for nucleus and cytosol have one or more signal sequences in the middle part of their sequence. As explained later, global alignment is applied to left ends of sequences and local alignment is applied to right ends of sequences in our method as shown in Fig. . Then, it detects signal sequences in N termini with higher probability than in the middle part of sequences. This is the reason why our method cannot improve the accuracy for "other" substantially. Furthermore, conventional local alignment, in which local alignment is applied to both ends of sequences, did not improve the prediction accuracy in our preliminary experiment. This fact is reasonable since conventional local alignment ignores signal sequences in N-termini with higher probability than our method.
Figure 3 Global and local alignment. Global alignment is applied to left ends of block sequences and local alignment is applied to right ends of block sequences. (A) The alignment detects the signal sequence which is in the left part of sequences. (B) The alignment (more ...)
Although the above discussion can also be applied to SP and the "other" subset of non-plant datasets of TargetP, the result for non-plant mTP seems to be contradicting to this. We think the reason is either due to the influence of not using distance frequency [22
], or that the composition of non-plant mTP sequences is only 13.6%. Alternatively, there may exist some unknown signal sequences relating to subcellular location in N-termini of non-plant mTP proteins. However, it should be noted that the difference of accuracies between our method and [22
] is very small (0.25%).
Parameters in Table were determined by trial and error approach, because it is very important to assign appropriate values to them. Since our feature vector consists of only amino acid composition, the information of subcellular location signals may be disappeared if w is too large or too small. Even when appropriate w is given, the information of signals may be divided and ignored if the edge of the window is in the middle of a signal sequence. However, such ignored signals can be found if c is set appropriately. c = w/2 is considered to be one of the best relationships between c and w.
Assigning appropriate values to γ and gap penalty is also important. As explained later, the pairing score in alignment of our method takes a positive value when two blocks are similar to each other. On the other hand, the pairing score takes a negative value when two blocks are not similar. However, if γ is too large, the pairing score takes a negative value in most cases. On the other hand, if γ is too small, the pairing score takes a positive value in most cases. Therefore, it is very important to set appropriate γ since it determines the threshold of "whether blocks are similar to each other". The value of gap penalty also strongly influences the result of alignment. The detailed method of determining these parameters are described in the following.
Let p represent the gap penalty. We examined p = 0, 0.2, 0.4, 0.6, 0.8 1.0, γ = 0.3, 0.6, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5 and (w, c) = (40, 20), (30, 15), (20, 10) respectively. Among values above, (p, γ, w, c) = (0.4, 2.5, 20, 10) yielded the best accuracy. Since we did not examine so many values, the accuracy may be improved by further optimization for these values. We believe that further optimization for (w, c) is more hopeful than those for p and γ to improve the accuracy. It is to be noted that the same values were used for (p, γ, w, c) in all the experiments in this paper.
In terms of "posconstraint" and "negconstraint", the first and second methods of Table show the comparison of
• (1) our method with specifying "posconstraint" and "negconstraint,"
• (2) our method without specifying "posconstraint" and "negconstraint."
Although the average MCC and overall accuracy of (1) are better than those of (2) by 0.013 and 0.0107 respectively, (2) is better than the other predictors shown in Table . In terms of SP, MCC of (2) is better than (1) by 0.0093 although (1) is better than (2) for the other locations. Thus, our predictor can yield good accuracies even when "posconstraint" and "negconstraint" are not specified.
To optimize these parameters used in (1), we set posconstraint = exp(-0.07·i) and negconstraint = exp(-0.07·j) and scanned i and j from 0 to 99. Then, values which yielded the best overall accuracy were used for (1) and are shown in Table . We believe that further optimization for "posconstraint" and "negconstraint" does not substantially improve the accuracies.
We also developed a web-based prediction system based on our proposed method for plant. It is available on [29
]. When amino acid sequences in the FASTA format are given as shown in Fig. , the web-system returns the first and second candidates of the location and their scores as shown in Fig. . Although an overall accuracy of 90.96% was achieved in our five-fold cross validation tests, it takes about 10 seconds to predict a location. The reason why our web-system is not fast is that it calculates similarity scores between every input sequence and all training sequences. However, these calculations can be parallelized. If there are enough CPUs, the calculation time would be in a second although our web-system is not parallelized so far.
Figure 4 Web-based system. Our proposed method is implemented as a web-system and available on . (A) Amino acid sequences in the FASTA format are pasted into our web-system as input. (B) Two candidate locations are shown for each input sequence along with (more ...)