The prediction of signal peptides has become an important application of genomics and proteomics investigations. SignalP is currently the most efficient and widely used tool for this task. A comparison of most available software in this field underscored the unique performance of this program (7
). However, non-commercial utilization of SignalP via the Internet is limited to 10 requests of up to 2000 sequences per day. Response to such requests takes several minutes. This means that SignalP is not suited to fast whole proteome analysis approaches. Finally, the program is not available as public domain software for integration into other software projects. Therefore, we decided to implement an alternative efficient prediction tool which meets the described criteria. The algorithm employed also represents an alternative approach to the neural network and Hidden Markov solutions implemented by SignalP. The fidelity of the employed method was significantly improved by the introduction of a frequency correction in order to adjust the amino acid bias as described by Schneider and Brown (9
To check the accuracy of PrediSi, we performed a self-consistency test. For this purpose we constructed three test datasets containing proteins carrying signal peptides—for eukaryotes, Gram-negative and Gram-positive bacteria. The test datasets consist of all the amino acid sequences from a training dataset extracted from SwissProt and the same number of randomly chosen amino acid sequences without signal peptides from a corresponding control set. We compared the results obtained with the accuracy of SignalP (Table ). Predictions were only considered as correct if both the existence and the cleavage position of the signal peptide were predicted correctly. The results of the analysis showed that PrediSi was slightly less accurate in the prediction of eukaryotic and Gram-negative signal sequences [85.49% PrediSi versus 90.66% SignalP-Neural Network (NN) and 88.24% SignalP–Hidden Markov Model (HMM); 91.12% versus 91.39% NN and 93.09% HMM, respectively] but slightly better at predicting Gram-positive signal peptides (88.14% versus 85.61% NN and 87.29% HMM) (Table ). Interestingly, if we allowed a tolerance of two positions between the cleavage position, the accuracy of returning the correct cleavage position increased significantly. Probably some of these falsely predicted cleavage positions are due to database errors as mentioned before (10
). PrediSi provides a normalized score on a scale between 0 and 1. A score greater than 0.5 means that the examined sequence very likely contains a signal peptide. The advantage of this user-friendly score is that it is comparable between different weight matrices.
Statistical examination of the accuracy of the different models promoted by SignalP and the accuracy of the new weight matrix approach
The optimal PWM size differs between the three examined groups of organisms. The optimal size for the eukaryotic PWM is −16/+4 (with the cleavage position between positions −1 and +1), for Gram-negatives −16/+2 and for Gram-positives −21/+1. Figure depicts sequence logos (11
) of signal peptides for the three different groups. The estimated matrix size correlates well with the information content of the observed sequences. Agreeing with earlier analysis, signal peptides of Gram-positives are larger than those of other organisms (12
). In summary, accuracy of prediction with PrediSi is similar to that with SignalP.
Figure 1 Sequence logos based on the aligned amino acid sequences of signal peptides. The signal peptide is cleaved off between position −1 and 0. (A) Gram-negative bacteria, (B) Gram-positive bacteria, (C) eukaryotes. Shaded area represents PWM region. (more ...)
The use of a very fast algorithm for the prediction of the signal peptides enables our web interface to finish the necessary calculations nearly in real time. For example, the analysis of 20
000 eukaryotic sequences takes only about 10 s and is, therefore, limited only by the data transfer via the Internet. To our knowledge, this is the fastest public method available for predicting signal peptides. Using PrediSi it is not necessary to deliver the results by email or to install queues, because the results are directly presented in the web browser (Figure ). Other methods such as Markovian models and neural networks need much more calculation time to perform such a task.
Screenshot of the PrediSi web interface.