|Home | About | Journals | Submit | Contact Us | Français|
The proper prediction of the location of disulfide bridges is efficient in helping to solve the protein folding problem. Most of the previous works on the prediction of disulfide connectivity pattern use the prior knowledge of the bonding state of cysteines. The DBCP web server provides prediction of disulfide bonding connectivity pattern without the prior knowledge of the bonding state of cysteines. The method used in this server improves the accuracy of disulfide connectivity pattern prediction (Qp) over the previous studies reported in the literature. This DBCP server can be accessed at http://126.96.36.199/dbcp or http://188.8.131.52/dbcp.
Disulfide bonds play an important structural role in stabilizing protein conformations. For the protein folding prediction, a correct prediction of disulfide bridges can greatly reduce the search space (1,2). The prediction of disulfide bonding pattern helps, to a certain degree, predict the 3D structure of a protein and hence its function because disulfide bonds impose geometrical constraints on the protein backbones. Some recent research works had shown the close relation between the disulfide bonding patterns and the protein structures (3,4).
In the realm of the disulfide bond prediction, four problems are addressed. The first is the protein chain classification: to classify if the protein contains disulfide bridge(s) or not, the second is the residue classification: to predict the bonding state of cysteines, the third is the bridge classification and the last is the prediction of the disulfide bonding pattern. Over the past years, significant progress has been made on the prediction of the disulfide bonding states (5–8) and the disulfide bonding pattern (9–17). For disulfide bonding pattern prediction, with the exception of the methods proposed by Ferrè and Clote (11, 12) and Cheng et al. (15), the others assume that the bonding states are known. The method proposed by Ferrè and Clote (11,12) and Cheng et al. (15) can be applied whether the bonding states are known or not.
In this study, the coordinate (X, Y, Z) of the Cα of each amino acid in the protein predicted by MODELLER (18) is used as the feature. The support vector machine (SVM) is then trained to compute the connectivity probabilities of cysteine pairs. The Edmonds–Gabow maximum weight perfect matching algorithm (19) is utilized to find the connectivity pattern.
The flowchart of this server illustrated by an example is shown in Figure 1.
With the exception of the protein’s secondary structure, the features used in the previous studies on disulfide bonding connectivity prediction are protein sequence features and not related to the protein structure. In this study, we propose to use the structure-related feature. The MODELLER (18) is used to predict the coordinate (X, Y, Z) of the Cα of each amino acid in the protein sequence. Having the coordinates, we can compute the Euclidean distance Dij between the amino acid at the i-th position and the amino acid at the j-th position. We further extend the definition of Euclidean distance to the pair distance (PD). Let the positions of cysteine i and cysteine j be Pi and Pj, respectively. The PD between cysteine i and cysteine j is defined to be a vector (DPi−[w/2], Pj−[w/2], … , DPi−1, Pj−1, DPi, Pj, DPi+ 1, Pj+1,…, DPi+ [(w−1)/2], Pj+ [(w−1)/2]) that contains w Euclidean distances, where w is the window size. If we have k cysteines in the protein, there are as many as cysteine pairs. Since most cysteine pairs will not constitute a disulfide bond, by examining DPi, Pj of the cysteine pairs that constitute a disulfide bond, we set a threshold of value 15 for DPi, Pj. In other words, if DPi, Pj is >15, this pair of cysteines will not be considered as a candidate that may have a disulfide bond. In order to make the values proper to be input to the SVM, which is −1 to 1, each component of the vector PD is normalized by the equation (Dij − 7.5)/7.5. The resultant vector is called the normalized PD (NPD) and is the input to the SVM.
The inputs to the DBCP include three parts:
In Step 1, if the E-value of the template sequence is >10 or the template sequence shares identity <25% to the input sequence, a previously proposed method (20) is used for prediction. In this method, the position-specific scoring matrix, the normalized bond lengths, the predicted secondary structure of protein and the physicochemical properties index of the amino acid were used as features. The multiple trajectory search and the SVM training were tightly integrated to train the predictor. For more details, please refer to (20).
The DBCP web server is free and open to all users and there is no login requirement. This prediction software was implemented using C language and the server-side scripting language PHP, and it employed the web page on the Apache web server.
In this subsection, we introduce the results of the DBCP, as listed below:
We found four web sites that provided the prediction of the disulfide bonding connectivity pattern without prior knowledge of bonding state of cysteines (12,14–16). Cheng et al. (15) tested their prediction method by a 10-fold cross validation on the data set SPX (15). As a comparison, we also tested our method by a 10-fold cross validation on the same data set, and the results were shown in the Supplementary Data. The method proposed by Song et al. (16) can process only protein sequences that have less than 12 cysteines. Therefore, we conducted a test to compare our method only with the other three methods. We took 56 protein sequences from the SWISS-PROT database release no. 56.3 that are neither in the SWISS-PROT release no. 39 nor in the data set SPX, this set of sequences is denoted as ‘SP56NS’. The prediction accuracies of our method and the other three methods on this data set are shown in Table 1.
Since the present version of the web server was trained by using the data set SPX, we also took 50 sequences from the SWISS-PROT database release no. 56.3 that are neither in the SWISS-PROT release no. 39 nor in the data set SPX. Furthermore, the pairwise sequence identity of these 50 sequences and the sequences in SPX is <25%. This set of sequences is denoted as ‘SP56NS_25’. The prediction accuracies of our method and the other three methods on this data set are shown in Table 2.
For checking the prediction accuracy when the input sequence has low identity to the overall set of PDB proteins, we took 32 sequences from the SWISS-PROT database release no. 56.3, where either the sequence shares identity <25% to the template sequence found by the BLAST or the E-value of the template sequence is >10. This set of sequences is denoted as ‘CHK25’. The prediction accuracy of DBCP for this data set is shown in Table 3.
A web-based application system called the DBCP is provided for the prediction of the disulfide bonding connectivity pattern without the prior knowledge of the bonding state of cysteines. In previous research works, without the prior knowledge of the bonding state of cysteines, to the best of our knowledge, the best accuracy of disulfide connectivity pattern prediction (Qp) and that of disulfide bridge prediction (Qc) are 51% and 52%, respectively, on the data set SPX with 10-fold cross validation. The method used in this server improved the prediction accuracies on the same test data set SPX to 84.4% (Qp) and 94.6% (Qc) with 10-fold cross validation. The comparison of the prediction accuracy of the DBCP with that of three other state-of-the-arts web services on the data sets SP56NS and SP56NS_25 also reveals that the DBCP outperforms the other three methods.
If the template sequence found by the BLAST has an E-value >10 or the identity of the template sequence and the input sequence is <25%, another method previously proposed by us is used for prediction. In this case, the prediction accuracy may slightly degenerate. Since the DBCP is designed aiming to predict the disulfide bonding connectivity pattern of a sequence that does not have cysteines involved in the metal binding sites, for protein sequences that contain cysteines involved in the metal binding sites, other methods that can predict both the disulfide bonds and the metal binding sites will be more suitable for prediction. The high metal binding site score (e.g. >0.5) indicates that there may be cysteines involved in the metal binding sites. In this case, users are strongly suggested using other methods in addition to the DBCP and conclude the prediction result based on the results of all methods.
Supplementary Data are available at NAR Online.
National Science Council of ROC (contract number NSC98-2221-E005-049-MY3, partial); Ministry of Education, Taiwan, ROC under ATU plan (partial); Central Taiwan University of Science and Technology (grant CTU99-P-33). Funding for open access charge: National Chung Hsing University.
Conflict of interest statement. None declared.
The authors like to thank the anonymous reviewers for pointing out the problems of proteins lacking homology and sequences containing cysteines involved in metal binding sites. The comments of all anonymous reviewers have improved the quality of the paper.