We compared the PrISE family of algorithms using the DS188, DS24Carl, DS56bound and DS56unbound datasets. We also assessed the extent to which the quality of predictions is impacted by the presence of structural elements derived from homologs of the query protein in the repository of structural elements used to make the predictions. In addition, the performance of PrISEC was assessed against the performance of several classifiers based on machine learning methods, scoring functions, and local and global structural similarity on different datasets.
Comparison of PrISEL, PrISEG and PrISEC
Recall that PrISEL relies on the similarity between structural elements (i.e. local structural similarity), PrISEG relies on the similarity between protein surfaces (i.e. general structural similarity), and PrISEC combines local structural similarity and general structural similarity to predict interface residues. The performances of these three predictors were compared using the DS188 dataset. For this experiment, samples were extracted from the ProtInDB repository. In addition, samples extracted from proteins sharing more than 95% of sequence identity with the query protein and belonging to the same species were excluded from the prediction process to avoid overestimation on the predictions. To simulate a random prediction, the interface/non-interface labels associated with the central residue in each sample in the repository were randomly shuffled. The results of this experiment are presented in Figure as precision-recall curves. These results indicate that PrISEL, PrISEG, and PrISEC outperform the random predictor. Furthermore, PrISEC achieves similar or better performance than PrISEG whereas PrISEG predictions are superior to those of PrISEL. Similar conclusions are supported by experiments using the DS24Carl, DS56bound and DS56unbound datasets d. As a consequence, PrISEC was selected to perform the experiments presented in the next subsections.
Comparative performances of PrISEL, PrISEG, PrISEC, and randomly generated predictions on the DS188 dataset.
Impact of homologs of the query protein on the quality of predictions
We assess the extent to which the predictions are impacted by the presence of structural elements derived from sequence homologs of the query protein. The first experiment excludes samples derived from proteins belonging to the same species that share ≥ 95% of sequence identity with the query protein (called homologs from the same species). The second experiment excludes samples from all the proteins that share ≥ 95% of sequence identity with the query protein (referred to as homologs).
Figure compares the two methods for excluding homologs with a setup in which only the samples derived from proteins with the same PDB ID as the query proteins are excluded e. As seen from Figure , the prediction performance is better when sequence homologs of the query protein are not excluded from the set of proteins used to generate the repository used for making the predictions. The best performance is achieved by excluding the proteins with the same PDB ID as those of the query proteins.
Comparison of schemes for filtering out similar proteins from the prediction process. This experiment was performed using PrISEC with the DS188 dataset.
Comparison with two prediction methods based on geometric-conserved local surfaces
We compared the three predictors from the PrISE
family with the predictors proposed by Carl et al. in [37
]. These methods rely on conservation of the geometry and the physico-chemical properties of surface patches to predict interfaces. In [37
], the conserved regions were extracted from proteins with similar structures. In [41
], similar performance was achieved using conserved regions extracted using local structural alignments. This comparison was performed using the DS24Carl dataset composed of 24 proteins and generated in [41
]. In the case of the PrISE
family of methods, samples were retrieved from the ProtInDB
repository. Samples extracted from proteins sharing more than 95% of sequence identity with the query protein and belonging to the same species were not used in the prediction process. The results of the experiment, presented in Table , indicate that each of the three predictors from the PrISE
family outperforms the predictors described in [37
]. The differences in performances may be explained by the differences in the prediction techniques. In particular, PrISE
family of predictors, unlike those of Carl et al., exploit the interface/non-interface labels associated with surface patches that share structural similarity with the surface neighborhood of each surface residue of the query protein.
Performance of different methods on the DS24Carl dataset
Results of a similar experiment excluding samples extracted from homologs of the query proteins, as well as results of experiments using the protInDb PQS
repository, are presented in section six of the Additional File 1
Comparison with a prediction method based on protein structural similarity
We compared PrISEC
], a method that relies on protein structural similarity, using the DS188, DS56bound and DS56unbound datasets. PredUs
is based on the idea that interaction sites are conserved among proteins that are structurally similar to each other. PredUs
computes a structural alignment of the query protein with every protein in a set of proteins with known interface residues. The alignments are used to extract a contact frequency map
which indicates for each residue in the query protein, the number of interface residues that are structurally aligned with it. The contact frequency map is then used to predict whether each residue on the query protein is an interface residue. In [38
], the prediction was performed using a logistic regression function that receives as inputs the counts contained in the contact frequency maps. In [39
], the logistic regression function was replaced by a support vector machine (SVM) classifier that uses accessible surface areas and the counts contained in the contact frequency maps to perform prediction.
In order to perform a fair comparison between PrISE
, the structural elements used by PrISE
and the structural neighbors used by PredUs
were extracted from the same dataset of proteins. This dataset corresponds to the subset of proteins that are common to both ProtInDB
and PQS which ensures the largest overlap between the proteins used by PredUs
(which relies on the structural neighbors extracted from the PDB and PQS) and PrISE
(which relies on the proteins extracted from biological assemblies in the PDB and deposited in ProtInDB
). This resulting dataset, used to create the protInDB
repository, includes 55,974 protein chains derived from 21,786 protein complexes. PredUs
predictions were obtained from the available web server [39
]. This server allows us to choose the set of structural neighbors to be considered in the prediction process. Using this feature, we were able to exclude from the sets of structural neighbors those proteins that were not in the intersection of ProtInDB
and PQS as well as homologs or homologs from the same species.
A first comparison of the PrISE
family of predictors and PredUs
was carried out using the DS188 dataset. However, since the SVM used by PredUs
was trained using this dataset [39
], it is likely that the estimated performance of PredUs
in this case is overly optimistic, resulting in an unfair comparison with PrISE
. We found that in 7 of 188 cases (corresponding to the PDB Ids and chains 1ghq-A, 1gp2-G, 1t6b-X, 1wq1-G, 1xd3-B, 1z0k-B, and 2ajf-A) PredUs
failed to find structural neighbors, and hence failed to predict interfaces. In contrast, the PrISE
predictors found the structural elements needed to produce predictions for the 188 cases. Predictions including these seven cases are labeled as PrISEC
188 in Figure , whereas predictions of PrISEC
considering the set of 181 proteins are labeled with the suffix 181. The performances of PrISEC
in the two cases are similar. PredUs
generally outperforms PrISEC
, the best performing predictor from the PrISE
family. This result is not surprising given that the SVM used by PredUs
was trained on this dataset whereas PrISE
did not have this advantage.
Figure 4 Comparison of PredUs and PrISEC using the dataset DS188, derived from the docking benchmark 3.0. (A) performance of predictions from which homologs from the same species were not used to compute the structural neighbors and the samples used in PredUs (more ...)
A second comparison of PrISEC and PredUs was performed using the DS56bound dataset. PrISEC and PredUs generated predictions for all the proteins in this dataset. The precision-recall curves presented in Figure show that when homologs from the same species are excluded from the collection of similar structures, PrISEC outperforms PredUs, but when homologs are excluded regardless of the species, the performances of PrISEC and PredUs are comparable. These results indicate that the use of local surface structural similarity is a competitive alternative to the use of protein structural similarity for the problem of predicting protein-protein interface residues.
Figure 5 Comparison of PrISEC and PredUs using the dataset DS56bound, derived from CAPRI. The results in (A) correspond to predictions in which homologs from the same species were excluded from the collection of samples and the set of structural neighbors. The (more ...)
An evaluation considering additional performance measures is presented in Table . The data in this table indicates that PrISEC outperforms PredUs in terms of F1, correlation coefficient, or area under the ROC. The values for precision, recall, F1, Accuracy and CC were computed using the default cutoff values for PrISEC and PredUs.
Evaluation of PrISEC and PredUs on DS56bound using different performance measures
A final comparison between PrISEC and PredUs was performed using the DS56unbound dataset. Three out of the 56 proteins (corresponding to the PDB IDs-chains 1ken-H, 1ken-L, and 1ohz-B) were not processed by PredUs because no structural neighbors were found. Figure shows the precision-recall curves of PrISEC and PredUs on the 53 cases covered by PredUs, as well as the performance of PrISEC when all the 56 proteins are considered. A comparison of both predictors using the set of 53 proteins and excluding homologs from the same species indicates that PrISEC outperforms PredUs for precision values > 0.4. On the contrary, when homologs are excluded, the performance of PredUs is better than the performance of PrISEC for precision values ≥ 0.3. Finally, the performance of PrISEC computed on 56 proteins is, surprisingly, slightly better than the performance computed on 53 proteins. This suggests that local structural similarity based interface prediction methods can be effective even in the absence of globally similar structures.
Figure 6 Comparison of PrISEC and PredUs using the DS56unbound dataset, derived from CAPRI. (A) shows the performance achieved after removing homologs from the same species from the set of similar structures in PredUs and PrISEC. (B) shows the performances when (more ...)
An evaluation of PrISEC and PredUs using additional performance measures is presented in Table . PrISEC outperforms PredUs in terms of F1, CC and AUC when homologs from the same species are excluded from the set of similar structures. When homologs are excluded, PredUs outperforms PrISEC on the set of 53 proteins predicted by PredUs.
Evaluation of PrISEC and PredUs on DS56unbound using different performance measures
Comparison with other prediction methods
We compared the performances of PrISEC
, Promate [25
], PINUP [48
], Cons-PPISP [49
], and Meta-PPISP [50
] using all the proteins in the DS56bound and DS56unbound datasets. The choice of the predictors used in this comparison was based on the results of a comparative study in which they were reported to achieve the best performance among the six different classifiers on two different datasets [8
]. Promate uses a scoring function based on features describing evolutionary conservation, chemical character of the atoms, secondary structures, distributions of atoms and amino acids, and distribution of b-factors. Cons-PPISP's predictions are based on a consensus between different artificial neural networks trained on conservation sequence profiles and solvent accessibilities. PINUP uses an empirical scoring function based on side chain energy scores, interface propensity and residue conservation. Meta-PPISP uses linear regression on the scores produced by Cons-PPISP, Promate and PINUP.
In the experiments presented in this subsection, we considered the performance of two PrISEC classifiers according to which proteins were filtered out from the process of extraction of samples: homologs from the same species as the query protein and homologs regardless of the species. The scores used to generate the precision-recall curves of Promate, PINUP, Cons-PPISP and Meta-PPISP were computed using Meta-PPISP's web server.
The precision-recall curves corresponding to the evaluation of the classifiers on the DS56bound and DS56Unbound datasets are shown in Figure . On both datasets, PrISEC predictors outperform Meta-PPISP for precision values > 0.35 and achieve performance comparable to that of Meta-PPISP for precision values ≤ 0.35. Furthermore, PrISEC outperform Promate, PINUP, and Cons-PPISP over the entire range of precision and recall values.
Figure 7 Performance of different classifiers evaluated on the DS56bound (A) and the DS56unbound (B) datasets. For the PrISE classifiers, "spe." and "hom." show predictions in which samples extracted from homologs from the same specie and homologs, respectively, (more ...)
An evaluation considering additional performance measures is presented in Table . All the performance measures, with exception of AUC ROC, were computed using threshold values of 0.56, 0.28, 0.41, 0.34, and 0.34 on the scores generated by Promate, PINUP, Cons-PPISP, Meta-PPISP, and PrISEC respectively. These threshold values correspond to the default values defined in the Meta-PPISP and PrISEC web servers. The results show that the PrISEC predictors outperform the other predictors on both datasets in terms of F1, correlation coefficient and area under the ROC.
Evaluation on the datasets DS56bound and DS56unbound
The results of an experiment using 187 proteins from the DS188 dataset are presented in Figure . Protein chain 2vis-C was excluded from the experiment given that Promate could not generate a prediction. When homologs from the same species are excluded, PrISEC outperforms the other predictors except Meta-PPISP. PrISEC outperforms Meta-PPISP for precision values > 0.4 and achieves comparable performance to that of Meta-PPISP for precision values ≤ 0.4. When homologs are excluded, the performance of PrISEC is superior that the performance of PINUP and Promate. PrISEC outperforms Meta-PPISP and Cons-PPISP for precision values > 0.5, and is outperformed by Meta-PPISP for precision values ≤ 0.45.
Figure 8 Precision-recall curves of different classifiers evaluated on 187 proteins from the DS188 dataset. For the PrISE classifiers, "spe." and "hom." show predictions in which homologs from the same species and homologs, respectively, has been excluded from (more ...)
An evaluation using different performance measures is presented in Table . According to this table, the performance of both PrISE predictors is superior that the performance of the other classifiers in terms of F1 and CC. Furthermore, when homologs from the same species are excluded, PrISEC outperforms the other classifiers in terms of AUC.
Evaluation on 187 proteins from DS188
Prediction performances in the absence of similar proteins
To evaluate the extent to which the performances of PrISEC and PredUs depend on the degree of homology between the query proteins and the proteins used to extract samples or structural neighbors, we compare the results obtained using three different sequence homology cutoffs: 95%, 50% and 30%. The results, shown in Figure , indicate that PredUs is more sensitive than PrISEC to the lack of similar proteins in the sets used to extract similar structures. The figure also shows that the performance of PrISEC is competitive with that of Meta-PPISP even when the repository used by PrISEC is composed by proteins sharing < 30% of sequence identity with the query proteins.
Figure 9 Performance computed in absence of similar proteins at different similarity levels. Figures (A) and (B) show the precision recall curves computed after excluding from the sets of similar structures homologs (without regarding the species) sharing ≥ (more ...)