Protein–peptide specificity is an important biological phenomenon. There are many systems in which a protein recognizes a specific amino acid sequence, often falling on a flexible, solvent-exposed region of another protein. Such systems include the modular scaffolding PDZ domains, which recognize specific sequences on the C-terminal tails of their substrates (Jemth and Gianni, 2007
); multifunctional SH3 domains, which recognize a linear motif of the form Pro-Xaa-Xaa-Pro (Kaneko et al.
); and class I MHC proteins, which bind nine-residue peptides with specificity varying across different MHC molecules (Sieker et al.
Here, we focus on the protein–peptide specificity of the pro-apoptotic proteases granzyme B (GrB) and caspases interacting with their respective protein substrates. GrB is a serine protease delivered by natural killer cells into virally infected and tumor cells (Pardo et al.
; Russell and Ley, 2002
). The caspases are a family of endogenous cysteine proteases activated by extracellular death ligands and environmental stresses (Nicholson and Thornberry, 2003
). Both protease types recognize and cleave specific peptide sequences containing an aspartic acid residue on their target substrates, activating different pathways that lead to cell death. Identifying these substrates has led to a wealth of knowledge about how the proteases contribute to apoptosis, how the cleavage events lead to cell death, and which substrates to target for therapeutic purposes.
Substrates of the two protease types have been discovered with a variety of experimental techniques, ranging from low-throughput gel-based methods to proteomic efforts that can identify hundreds of cleaved proteins (Bredemeyer et al.
; Casciola-Rosen et al.
; Dix et al.
; Mahrus et al.
). However, different datasets overlap only partially, indicating that many substrates remain to be identified. For example, two proteomics studies, respectively, reported 261 and 292 caspase cleavage sequences, although the high-confidence overlap between the two sets was only 64 [A in Johnson and Kornbluth (2008
Flowchart of procedure. Peptides are scored with the SVM trained on sequence and structure features; the peptides that pass the cutoffs derived from benchmarking are the final candidates for experimental validation.
To reduce this gap, accurate computational techniques could be used to predict protein–peptide interactions for guiding further focused experiments. A number of approaches have been taken in the systems described. For PDZ interactions, examples of such methods include position-specific scoring matrices (PSSMs; Stiffler et al.
) and Bayesian inference (Chen et al.
). SH3 binding partners have been predicted with neural networks (Ferraro et al.
; Zhang et al.
) and MHC class I interactions have been predicted with support vector machines (SVMs; Jacob and Vert, 2008
). Finally, molecular docking methods have been developed to analyze both systems (Bui et al.
; Hou et al.
Computational methods have also been applied to predict substrates recognized by GrB and caspases. These methods take advantage of both protease types having a near-absolute requirement for Asp at the P1 position, while allowing degenerate preference for different residue types in the positions immediately surrounding P1. These studies rely on fixed sequence searches. (Wilkins et al.
), PSSMs based on frequencies of residue types in known cleavage sites (Garay-Malpartida et al.
; Lohmüller et al.
; Verspurten et al.
) and positional-scanning combinatorial substrate libraries (PS-SCLs; Backes et al.
; Boyd et al.
), SVMs using residue composition around the cleavage site (Wee et al.
), and Bayesian neural networks (Yang, 2005
Cleavage sequences for both GrB and caspases are generally thought to occur on flexible, disordered regions of substrates (Hubbard, 1998
). However, it was previously shown in an analysis of caspase substrate structures that many of these known cleavage sites are in α-helices and even occasionally on β-strands (Mahrus et al.
; Timmer et al.
). This observation motivates the choice of a machine learning algorithm that relies on the structure as well as sequence information. Here, we describe such a protocol incorporating SVM learning. The method is trained and benchmarked on separate pools consisting of known GrB and caspase cleavage sequences. It is then applied to the human proteome to generate a list of high-confidence predictions for experimental validation. Two such candidates are the proteins AIF-1 and SMN1, which are experimentally validated as being cleaved by GrB. The approach has the potential to provide greater coverage of substrates for both GrB and caspases, and can be easily adapted to other protein–peptide systems through our web server that can learn from any user-supplied protein–peptide training set.