The aim of this study was to establish structural descriptors for classifying individual peptide bonds within proteins and then to estimate the power of these descriptors in predicting susceptibility to proteolysis. We reasoned that features (structural descriptors) with high predictive power reflect the aspects of protein structure that permit proteolytic cleavage and that these regions have a higher probability for containing regulatory cleavage sites. To accomplish this objective, we performed a statistical analysis on a set of ~ 315 documented proteolytic events in 123 proteins with known or predicted 3D structures.
The structural descriptors can be segregated into three main categories (exposure
, and local interactions
) based on the features they describe. Based on results from two complementary approaches (statistical hypothesis testing and machine learning–based classification), descriptors with the highest predictive power relating protein topology are: solvent accessibility
, protrusion index
, molecular surface accessibility
, and, to some extent, depth index
. Descriptors that reflect the flexibility of polypeptide chains (like B-Factor
and disordered regions
) had the second-highest predictive power. Finally, descriptors that convey the strength of local interactions in proteins (like hydrogen bonding
and secondary structure loops
) have the lowest predictive power. The rank order of the predictive power of these features holds for proteins of known 3D structure and for proteins for which we had only homology-based models. Taken together, these observations suggest that the ability of a protease to physically access a peptide bond is the most critical factor in determining susceptibility to proteolysis. Other structural descriptors from our list did not exhibit significant correlation with cleavage sites (Supplementary Files 6
The observed consistency of all types of estimations between two protein datasets suggest that structural models [53
] can be used for cleavage site prediction almost as efficiently as the solved structures (). This is particularly important in the context of the rapidly improving structural modeling techniques along with the expansion of structural coverage of protein families in PDB. On the other hand, structural descriptors deduced solely from the amino acid sequences showed relatively poor performance. The only association with cleavage sites at the significance level comparable with genuine structural descriptors of moderate predictive power (from the local interactions
category) was revealed for predicted solvent accessibility
(). This observation indicates that, despite the obvious benefits of utilizing readily available protein sequences, our current ability to deduce useful structural descriptors is limited by relatively scarce 3D structural data.
Overall, the main conclusions of our study are in general agreement with the first systematic survey by Hubbard et al.
], which implicated exposure
, and local interactions
as important structural determinants of limited proteolysis. At the same time, though, our statistical analysis for the first time afforded the ability to rank these three categories of structural features by their relative significance (). Among the few contradictions between our specific findings and conclusions of Hubbard et al.
], we assigned a high predictive power to protrusion index, a likely result of the difference in the calculation methods.
Another interesting finding is that B-Factor, a feature reflecting an atom's thermal motion, showed the best correlation with proteolytic susceptibility only when normalized in the context of each protein substrate, whereas for most other good descriptors, both normalized and raw values produced comparable results. This observation suggests that the flexibility observed in the actual 3D structure (which captures only a subset of many possible conformations of the protein molecule) should be considered in relative terms within a given structure rather than between the structures, which might be at least partially due to the inconsistent reporting of B-Factor between structures [54
A relative probability of proteolytic processing within different types of secondary structures remains a subject of conflicting reports. Thus, while many early studies indicated that proteases cleaved mostly in the loops, our analysis revealed a lower but substantial probability of cleavage in helices. These conclusions are consistent with some of the recent reports [12
]. Cleavage in β-strands is still commonly perceived as highly unlikely if not impossible [17
]. Nevertheless, a more rigorous statistical analysis, which accounts for differences in the relative content of all types of secondary structure elements in different proteins, revealed an appreciable (albeit lower) frequency of limited proteolysis in β-strands.
The relevance of cleavage in β-strands is supported by many well-documented and physiologically important proteolytic events. For example, the cleavage inside the edge β-strand of the birch profilin β-sheet () was reported for mast cells alpha-chymase in conjunction with the attenuation of allergic response [55
]. Two cleavage sites at the edges of the β-strands were reported for lactoferrin () as a result of autoproteolytic activity of this iron binding protein associated with mammalian non-immune defense against pathogens [56
]. Cleavage of an internal strand of a β-sheet, which, at the same time, is the N-terminus of the protein, was registered for actin and two different types of proteases (), caspase-1 and Granzyme B [57
]. The latter protease was also reported to cleave the internal β-strand proximal to the N-terminal strand in alpha-enolase () [58
The examples listed above also illustrate the tendency of cleavage sites to occur either close to the edges of β-strands or inside β-strands that are located at the edge of a β-sheet. This trend revealed by detailed examination of our 3D structural gallery of visualized proteolytic events (Suplemental Files 3 and 4) is consistent with some of the earlier suggestions [17
]. It likely reflects the tendency of β-sheet perimeter residues to be exposed and have lower hydrogen bonding energy than internal residues. Interestingly, in all 4 cases (of 26 examined) of truly internal cleavage sites in β-sheets, the respective strands were located very close to the N- or C-termini ().
Among significant structural descriptors assessed by us for the first time, the most interesting behavior was observed for the depth index, which measures the distance from the peptide bond to the surface of the protein. When evaluated by the F10-score metric, depth index was the most important feature for cleavage, although by other metrics it was behind accessibility, protrusion, and packing index. Interestingly, depth index appears to be of particular importance for the cleavage sites with relatively low solvent accessibility. Visual inspection of representative poorly accessible cleavage sites revealed favorable values of depth index. Remarkably, in most of such cases, the respective peptide bonds located at relatively low depth appeared to be “shielded” by loops with high B-Factor values. It is tempting to speculate that the access of a protease to such a bond might be granted by the mobility of a loop. This interpretation is consistent with the utmost importance of accessibility (exposure) even when it is masked by “freezing” a protein in a particular crystallizable conformation. Using a combination of descriptors (as opposed to any single descriptor) opens the possibility of resolving at least some of such difficult cases, although it would likely require employment of additional rule-based approaches.
However, the conventional machine learning classification methods used in this study proved the concept that a combination of structural descriptors leads to substantial improvement of the accuracy of predicting a cleavage site. Thus, the linear SVM approach, despite the apparent simplicity of its scoring method, allowed us to predict ~90% of cleavable bonds while increasing the number of bonds excluded from consideration by 15% compared to the best individual descriptor.
, this study provides a statistical foundation for the automated and accurate prediction of regions within proteins that have a high propensity for cleavage by endopeptidases. Our analysis suggests that approximately one-third of all peptide bonds in an average protein have the potential to be proteolytically processed based on their structural properties. By combining structure-based predictions common for many proteases with sequence-based preferences of a given protease, we expect to achieve more-accurate mapping of individual cleavage sites. In a general sense, this combined strategy has shown promise when applied to caspases [59
]. The main distinction of the analysis described here is that it provides a solid statistical foundation for the extraction of structural features of general utility, potentially applicable to numerous regulatory proteases implicated in a variety of pathways and syndromes. These findings set the stage for the development of a new generation of software tools for accurate structure-based predictive modeling of regulatory proteolysis and other post-translational modifications. Such computational tools would find numerous applications in proteomics research, for example in a rapidly developing field of degradomics or N-terminomics [61