Many different aspects of cellular signalling, trafficking and targeting mechanisms are mediated by interactions between proteins and peptides. In the immune system, for example, the major task of recognizing foreign pathogen proteins is mediated by interactions between Major Histocompatibility Complex (MHC) molecules and short pathogen-derived peptides (see Fig. ). T-cells recognize these peptides only when they are bound to MHC molecules. Understanding the underlying principles of MHC-peptide interactions is therefore a problem of fundamental importance, with applications to vaccine and drug design [3
]. MHC binding is a challenging problem because MHC molecules exhibit high specificity – it is estimated that each molecule only binds to 1% of all existing peptides [4
]. Additionally, MHC molecules are highly polymorphic and polygenic – there are hundreds of different alleles in the entire population while each individual carries a few alleles only (up to 6 MHC class I alleles and up to 12 MHC class II alleles) [5
Schematized drawing of a peptide in the binding groove of MHC class I (a) and MHC class II (b) molecules. The peptide backbone is shown as a string of balls, each of which represents a residue.
Biochemical assays, which empirically test protein-peptide binding affinity, can nowadays provide a rather high throughput rate [6
]. However, note that there are 20L
peptides of length L
(for 9 amino-acid long peptides as in the MHC proteins this amounts to 1012
peptides) and a great number of proteins that need to be considered. Therefore, in recent years, there has been a growing interest in developing computational methods for protein-peptide binding prediction [7
]. Formally, the protein-peptide binding prediction problem can be stated as follows: given a protein and a peptide, predict the binding affinity of the interaction between the two. Stated this way, the protein-peptide binding prediction is essentially a simplified version of the more general protein docking problem.
What should we expect from a "good" binding prediction algorithm? [8
1. Classification: A good binding prediction algorithm should first and foremost correctly predict whether a query peptide (which was not provided during the training stage) binds or does not bind to the given protein.
2. Ranking: An even stronger requirement is that the algorithm could also obtain a relative binding score for each peptide that can be used to rank different peptides according to their specificity.
3. Affinity prediction: Ultimately, the algorithm's score would predict the precise binding affinity values as determined experimentally.
Clearly, current state-of-the-art prediction methods obtain promising classification
results (for a recent comparison between several methods see [14
]). Many of these methods also compute binding scores for each peptide, but these scores are in most cases not even compared to the empirically known affinity values, and have even been shown to have poor correspondences in some cases [15
] (an interesting exception is a recent work on the PDZ domain [16
Most prediction algorithms formalize the protein-peptide binding prediction problem as a binary classification problem: For each protein (i.e. MHC molecule) a classifier is trained to distinguish binding peptides from non-binding peptides. After an initial training stage, the classifier is tested on a set of peptides, which were not presented during the training stage. The training data consists of experimentally determined binders and randomly generated peptides which are assumed to be non-binders. Interestingly enough, only rarely are experimentally determined non-binders used, mainly because a small number of these non-binders have been made publicly available.
In this paper we suggest a novel formulation of the protein-peptide binding prediction problem. Our approach is driven by the following two important observations:
Observation 1 Peptides that bind to the same protein are "similar" to one another, and different from non-binding peptides.
This observation underlies most, if not all, computational prediction methods. Motif based methods [8
] for example, search for a binding motif that captures the similarity of a set of known binding peptides. Prediction is then based on the similarity of a query peptide to the motif, which implicitly measures the similarity of the query peptide to the peptides in the training set. This observation suggests that given a novel peptide, one can predict its binding affinity to a specific protein, by explicitly measuring the average distance (or similarity) of the novel peptide to a list of known binding peptides. Intuitively speaking, if the peptide is close to the known binders, we would classify it as a binder, and if it is far – we would classify it as a non-binder.
Observation 2 Peptides binding to different proteins within the same "family" resemble each other
Proteins from the same family (e.g. MHC class I) are known to have structural and sequential similarity. Therefore they bind to peptides that share common characteristics. Additionally, in MHC class I and MHC class II, many proteins are grouped into supertypes [17
] (such as the HLA-A2 MHC class I supertype). A supertype is a collection of proteins whose binding peptide sets are overlapping. This observation implies that we may benefit from simultaneously
learning a single binding prediction function over an entire family of proteins, instead of independently learning a single
classifier for each of the proteins within the protein family. At a first glance it might appear that one can recast a set of binary classification problems using a single multi-class classification problem. However, a closer look reveals that the protein-peptide binding problem is not
a multi-class classification problem due to the following inherent facts: (1) Some peptides bind to several proteins within a family (indeed this information is used to define MHC supertypes). (2) A peptide that does not bind to a specific protein within a family, does not necessarily bind to a different protein within the family.
Our novel approach is based on the two observations described above. We propose to address the protein-peptide binding prediction problem by learning peptide-peptide distance functions. We do not require that the triangle inequality holds, and thus our distance functions are not necessarily metrics. Moreover, based on observation 2, we suggest to pool together information from an entire protein family and to learn a single peptide-peptide distance function (instead of learning a different distance function for every protein independently). Our peptide-peptide distance function is then used to compute protein-peptide binding affinity – the affinity of a query peptide to a given protein is inversely proportional to its average distance from all of the peptides known to bind to that protein. Our proposed learning scheme is summarized in Fig. and elaborated in the following section.
Learning peptide distance functions
As mentioned above, we propose to address the protein-peptide binding affinity prediction problem by learning a peptide-peptide distance function over an entire
family of proteins. A distance function
assigns a non-negative value for each pair of points. Most algorithms that learn distance functions make use of equivalence constraints [1
]. Equivalence constraints are relations between pairs of data points, which indicate whether the points in the pair belong to the same category or not. We term a constraint positive
when the points are known to be from the same class, and negative
in the opposite case. In this setting the goal of the algorithm is to learn a distance function that attempts to comply with the equivalence constraints provided as input.
In our setting, each protein defines a class. Each pair of peptides (data-points) which are known to bind to a specific protein (that is, belong to the same class) defines a positive constraint, while each pair of peptides in which one binds to the protein and the other does not – defines a negative constraint. Therefore, for each protein, our training data consists of a list of binding and non-binding peptides, and the set of equivalence constraints that they induce.
We collect these sets of peptides and equivalence constraints from several proteins within a protein family into a single dataset. We then use this dataset to learn a peptide-peptide distance function (see Fig. left plots). Using this distance function, we can predict the binding affinity of a novel peptide to a specific protein, by measuring its average distance to all of the peptides which are known to bind to that protein (see Fig. right plots). More formally, let us denote by
the distance between Peptidei
and by Bj
the group of peptides known to bind to Proteinj
. We define the affinity between Peptidei
Figure 3 Left: peptide-peptide distance matrices of MHC class I binding peptides, collected from the MHCBN dataset. Peptides that bind to each of the proteins were grouped together and labeled accordingly. Following Observation 1, a "good" distance matrix should (more ...)
In order to learn peptide-peptide distance functions, we use the DistBoost
], which learns distance functions using data and some equivalence constraints (see Methods for the algorithm's description). DistBoost
requires that the data be represented in some continuous vector feature space. We therefore represent each amino-acid using a 5-dimensional feature vector as suggested by [23
], and each peptide by concatenating its amino-acid feature vectors (for further details see the Data representation section). We compare our method to various protein-peptide affinity prediction methods on several datasets of proteins from MHC class I and MHC class II. The results show that our method significantly outperforms all other methods. We also show that on proteins for which small amounts of binding peptides are available the improvement in performance is even more pronounced. This demonstrates one of the important advantages of learning a single peptide distance function on an entire protein family.
Many different computational approaches have been suggested for the protein-peptide binding prediction problem (see [24
] for a recent review). These methods can be roughly divided into three categories:
Motif based methods
represent important requirements needed for binding, such as the presence and proper spacing of certain amino acids within the peptide sequence. Prediction of protein-peptide binding is usually performed as motif searches [8
]. The position specific scoring matrix (PSSM) approach is a statistical extension of the motif search methods, where a matrix represents the frequency of every amino acid in every position along the peptide. Peptide candidates can be assigned scores by summing up the position specific weights. The RANKPEP
] uses this approach to predict peptide binding to MHC class I and class II molecules.
Structure based methods
These methods predict binding affinity by evaluating the binding energy of the protein-peptide complex [9
]. These methods can be applied only when the three-dimensional structure of the protein-peptide complex is known or when reliable molecular models can be obtained.
Machine learning methods
Many different learning algorithms have been suggested for binding prediction. Among these are artificial neural networks (NetMHC) [12
], Hidden Markov Models (HMM's) [10
] and support vector machines (SVMHC) [11
]. To the best of our knowledge all of these methods are trained separately for each protein (or supertype). Therefore, these methods work well when sufficient amounts of training data (i.e peptides which are known to be binders or non-binders for a given protein) is provided.