Features for SVM
In we list the Pearson’s correlation coefficients among the three features and
is anti-correlated with
: loops with more residues with low probability dihedral angles in the unbound state are more likely to move upon complex formation. In contrast, ZB
are positively correlated with
, since flexible loops show more thermal motion in the crystal structure, and loops exposed to solvent are less constrained and therefore more likely to move. In , we show histograms that represent the ability of the three features to discriminate between stationary and mobile loops. While overlap between the distributions is considerable in each case, the separation is statistically significant. For example, SR
has the best separation with a p-value of 2.2 × 10−35
computed using the Wilcoxon rank-sum test. Most of the stationary loops (70%) have SR
values that are positive, compared with only 30% of mobile loops. The p-value calculated between the classes of loops by comparing histograms of RASA
is 1.63 × 10−17
and the mean for mobile loops is shifted by 10% toward larger values. Similarly, 63% of the mobile loops have ZB
< 0, compared with 36% of stationary loops; the p-value on ZB
distributions for mobile and stationary loops is 1.6 × 10−8
. It is interesting to note that mobile and stationary loops do not show a significant difference in their residue compositions: the p-value is 0.47 ().
Since the features also correlate amongst themselves, the performance of the SVM cannot simply be the sum of the correlations of the features with
. The relative importance of the features in the SVM can be assessed using the f-score27
, which can quantitatively rank the discriminative power: a feature with higher f-score has better discriminative power. SR
has the highest f-score of 0.5, which is followed by ZB
with an f-score of 0.24 and RASA
with 0.16. This indicates that all three features contribute substantially to the performance of the SVM. The order based on f-score is the same as the order based on correlation with
(). Nonetheless, although SR
have similar (absolute) correlation coefficients, their f-scores differ more than a factor 2, reflecting the dependence between these two features.
Flexibility Prediction with SVM
as features, we tested the SVM performance with the four kernel functions in LIBSVM. The prediction accuracies for the BM2 loops, using the 4-fold cross validation as described in the Methods, are 72.1%, 71.1%, 75.3%, and 68.9% for the linear, polynomial, radial basis, and sigmoid kernel functions, respectively. Since the radial basis clearly performed best, we used it for the remaining analyses and predictions in this section. The discriminant score was used to calculate a receiver operating characteristic (ROC) curve and the area under the curve (AUC) value. The average AUC was 0.79 with a standard deviation of 0.03. The averaged ROC curve is shown in (in black), indicating that the SVM predicts the mobility of loops considerably better than random (AUC = 0.5). It is interesting to note that the first 16 predictions were all false positives, as indicated by the horizontal section at the beginning of the ROC curve. These stationary loops all have unfavorable backbone dihedral angles; their SR
are lower than −2 (the average for stationary loops is 0.7), which caused them to be incorrectly predicted as flexible.
Figure 2 SVM loop-level flexibility prediction accuracy for various levels of confidence cutoffs. The AUC of random prediction is 0.5. The AUC is 0.79 with default cutoff of 0.5. Average ROC curves with confidence cutoffs at 0.6 with 154 loops, 0.7 with 126 loops, (more ...)
As a second assessment of the SVM, we define the accuracy of a prediction as the number of correct predictions (true positives + true negatives) divided by the total number of predictions (true negatives + false negatives + true positives + true negatives):
With the default discriminant score cutoff of 0.5 for the SVM, we achieved an accuracy of 75.3%, much higher than the baseline accuracy of 50%.
The SVM also reports a quantitative measure for the confidence of its predictions, referred to as the confidence probability. The confidence probability allows a confidence cutoff to be set such that predictions are accepted only when the confidence probability is higher than the cutoff. Indeed, the accuracy and AUC both increase when the predictions are restricted to higher confidence probability cutoffs. The prediction accuracy improves as a higher confident cutoff is applied, as is shown in . For example, a confidence probability cutoff of 0.7 yields an accuracy of 80.5% and AUC of 0.81.
We also investigated the difference in performance of the SVM between interfacial loops and non-interfacial loops. We separated the 848 surface loops into two classes, 237 interfacial loops and 611 non-interfacial loops, and followed the training and testing procedures described in the methods for these two groups separately. The AUC is 0.71 ± 0.01 for the interfacial loops and 0.80 ± 0.01 for the non-interfacial loops, indicating that the predictability of SVM is affected by the binding partner proteins, which suggests that protein surface loop mobility is also affected by the binding partner proteins. Furthermore, we trained a new SVM using the same three features to distinguish between interfacial and non-interfacial loops. For this test, we randomly selected 75% of interfacial loops and 75% of non-interfacial loops for training and the rest of 25% were used for testing. The AUC was 0.55 ± 0.01 and the average prediction accuracy was 63% ± 2.2%. This is only slightly better than random (AUC = 0.5 and accuracy = 61%), indicating that the features we used are not promising in the prediction of protein-protein binding interface.
In addition to use the SVM to predict the mobility of entire loops, we also attempted to predict the conformational change of individual residues. Such knowledge would allow one to selectively search the backbone conformational space in protein-protein docking applications. We restricted the dataset to the 3482 residues in the 339 mobile loops, classified as mobile (Δi > 30, 2146 residues) or stationary (Δi ≤ 30, 1335 residues). Using the SVM with the residue-specific features SR, RASA and ZB, we performed 10 repeated tests by randomly picking 75% loops in the mobile set and 75% loops in the stationary set for training, and the rest of 25% for testing. Our average accuracy was 68% ± 1.8 % (compared with the baseline 61.3%) and average AUC was of 0.72 ± 0.04 (compared with the baseline 0.5). We also tested hydrophobicity and molecular weight of each residue as one of the SVM features, but their inclusions did not improve the prediction accuracy or AUC.
Validation test on an independent dataset
As an independent test, we applied the SVM to the unbound structures that were newly added to Benchmark 3.028
and Benchmark 4.029
(BM3+BM4 includes 149 unbound structures from which 512 loops were extracted). Training with the loops of BM2 and using the radial basis kernel, the prediction accuracy for the independent set BM3+BM4 is 70.5% compared with 75.3% for the BM2 loops alone using 4-fold cross-validation. This moderate decrease in performance is likely due to the BM3+BM4 loops being harder to predict: when we train and test on BM3+BM4 loops, using 4-fold cross-validation and the radial basis kernel, we obtain an accuracy of 69.7%, which represents a similar decrease in performance.
Whereas we observed a 6% difference in performance between the best and worst performing kernel functions for 4-fold cross-validation on the BM2 loops, this difference is reduced to 3% when testing using the BM3+BM4 loops. We obtained accuracies of 69.1%, 71.1%, 70.5%, and 69.2% for linear, polynomial, radial basis, and sigmoid, respectively (Supplementary table 4
). Note that here the radial basis is the second best performing kernel function. The differences among the four kernels, however, are small.
Application to the Ras superfamily of proteins
The Ras superfamily is a well known family of molecular switches in signal transduction that control cell growth and differentiation by alternating between the activated (GTP bound) and inactivated (GDP-bound) forms.30–32
The docking benchmark version 2.0 contains 11 complexes that involve members from the Ras superfamily, and here we use this set to further illustrate the behavior and performance of our method for loop flexibility prediction. We first give a detailed analysis of the H-Ras protein binding to RasGAP, followed by comparison with other proteins from the Ras superfamily: Ras, Ran, Rac and Cdc42.
H-Ras forms a complex with the GTPase activating protein RasGAP. The Unbound H-Ras protein is the activated form, bound with a GTP analog GDPCP, and the H-Ras protein bound to RasGAP is in the inactivated form with GDP.30
and illustrate the conformational changes of these loops upon binding RasGAP. Switch I (loop 1) and Switch II (loop 2) are involved in the binding process, and indeed show large differences between the bound and unbound form. Loop 4 is not in the interface, but does show modest differences between the bound and unbound forms. Loops 3 and 5 are stationary. shows the SR
, and ZB
features for the unbound structure, as well as the measure for loop mobility, Δ. SR
shows highly unlikely backbone conformations for Switch I and Switch II, with 4 and 6 instances of SR = SRmin
for switch I and switch II respectively, and high ZB
values. These values indicate a high probability of these loops being flexible, and are consistent with the observed Δ’s. Loop 4 is classified as a mobile loop, although the RMSD calculation after structural alignment is low (0.55 Å). The motion of this loop corresponds to the inversion of two consecutive backbone dihedrals, which results in large Δ but small RMSD.
Figure 3 Superimposed unbound form (loops in red, PDB code 6Q21, chain D) and bound form (loops in blue, PDB code 1WQ1 chain R) of H-Ras. Figure generated using PyMOL (www.pymol.org). loop 1 (switch I: residue number 30–38), loop 2 (switch II: residue (more ...)
Conformational change and values of the features used for the SVM of H-Ras complexes with Ras GAP. Loop 1 and loop 2 correspond to Switch I and Switch II, respectively.
We trained an SVM as described in the previous section, using the set of loops extracted from BM2 for training, excluding all loops from the Ras superfamily. We then applied the SVM to the loops of all proteins from the Ras superfamily in our database. The SVM correctly predicts H-Ras loops 1, 2 and 4 to be mobile, and loops 3 and 5 to be stationary.
We investigated the features at the residue level in detail to gain a better understanding of the mobility of switch I and switch II. Switch I is composed of 9 residues (Supplementary table 2
), 4 of which have SR = SRmin
in the unbound protein (shown in bold): D
-I-E-D. Upon binding, the following changes in Ramachandran angles Δ are found: 30D
:175 and 35T
:241. Given the overall loop motion, it is perhaps not surprising to note that some residues adjacent to these 4 residues also show a large degree of motion upon binding. Indeed, the adjacent residues with SR
values 32Y:0.65, 36I:2.82 have Δ values of 32Y:149, 36I:169. However, 34P shows a Δ of only 31 degrees, despite being flanked by two residues with SR = SRmin
. This is caused by the rigidity provided by Proline’s native backbone restriction. The switch II loop is composed of 8 residues, of which 6 residues have SR = SRmin
). Again, most of these residues show large values for Δ, as well as those residues flanked by residues with SR = SRmin
. This observation, that mobile loops may contain some residues of which SR
does not suggest loop motion, is consistent with the superior predictive power of loop-averaged features over residue-level features for loop motion.
We applied the SVM to the remaining Ras proteins from docking benchmark 2.0. presents the results on Switch I and Switch II, since only these loops are involved in the binding process. The SVM predicts the flexibility of 92.8% of the given Ras superfamily loops correctly, which is higher than the 75.3% prediction accuracy using the entire set of loops, as described in the previous sections. Interestingly, the mobility of Switch II of Ran depends on the binding partner: with RCC1 the loop is mobile, while it is stationary with nuclear transport factor 2. The SVM prediction, however, is based solely on the unbound structure of Ran, which therefore results in one wrong prediction. Also for H-Ras we have two complexes. But in those cases, we assigned two different unbound structures for the Ras protein when we originally constructed the docking benchmark 2.0, due to the different GTP analogs in the structures. One unbound H-Ras structure (PDB code 821P) binds GTN. Its Switch I is classified as a stationary loop when compared with both complex structures (PDB code 1WQ1 and 1HE8), and our SVM correctly predicted so, based on the high SR scores (0.94±1.73) and low ZB values (−0.17±0.49) for the residues in this loop. The other unbound Ras structure (6Q21) binds GCP. Its Switch I is classified as a flexible loop when compared with both complex structures, and our SVM also correctly predicted so, based on the much lower SR scores (−1.82±5.86) and higher ZB values (1.47±0.49) for the residues in this loop. The two different unbound Ras structures have the same protein sequence but different cofactors, indicating that cofactors can affect the flexibility of protein loops significantly, and that this is accurately recognized by our SVM.
SVM predictions for Switch I and Switch II flexibility of the Ras superfamily proteins.
Finally, we performed prediction on the binding partners of the Ras superfamily proteins. There are a total of 43 surface loops, including 18 mobile loops and 21 stationary loops (4 loops are classified undefined). 14 out of 18 mobile loops were correctly predicted and 15 out of 21 stationary loops were correctly predicted. This corresponds to a prediction accuracy of 74.4%, which is similar to the prediction accuracy of the entire set of proteins. Supplementary table 3
summarizes the SVM prediction results for Ras protein binding partners.