Search tips
Search criteria 


Logo of bioinformLink to Publisher's site
Bioinformation. 2009; 3(9): 403–408.
Published online 2009 June 28.
PMCID: PMC2732036

Prediction of MHC class I binding peptides using probability distribution functions


Binding of peptides to specific Major Histo-compatibility Complex (MHC) molecule is important for understanding immunity and has applications to vaccine discovery and design of immunotherapy. Artificial neural networks (ANN) are widely used by predictions tools to classify the peptides as binders or non­binders (BNB). However, the number of known binders to a specific MHC molecule is limited in many cases, which poses a computational challenge for prediction of BNB and hence, needs improvement in learning of ANN. Here, we describe, the application of probability distribution functions to initialize the weights and biases of the artificial neural network in order to predict HLA­A*0201 binders and non­binders. The 10­fold cross validation has been used to validate the results. It is evident from the results that the AROC for 90% of test cases for Weibull, Uniform and Rayleigh distributions is in the range 0.90-1.0. Further, the standard deviation for AROC was minimum for Weibull distribution, and may be used to train the artificial neural network for HLA­A*0201 MHC Class­I binders and non­binders prediction.

Keywords: T­cell Epitope, ANN, Probability distribution, MHC binder/non­binder


Major Histocompatibility Complex (MHC) plays a central role in the development of both humoral and cell­mediated immune responses. While antibodies may react with antigens alone, most T cells recognize antigens only when it is combined with an MHC molecule; thus, MHC molecules play a critical role in antigen recognition by T­ cells. T cell do not recognize soluble native antigen but rather recognize antigen that has been processed into antigenic peptides, which are presented in combination with MHC molecules. The T cell epitope must be viewed in terms of their ability to interact with both T­cell receptor and MHC molecule. The antigen binding cleft on an MHC molecule interacts with various oligomeric peptides that functions as T­Cell epitope. The antigen binding cleft on an MHC molecule determines the nature and the size of the peptide(s) that MHC molecule can bind and consequently the maximal size of the T cell epitope. It has been observed that peptides of nine amino residues (9­mers) bind most strongly; peptides of 8­11 residues also bind but generally with lower affinity than nonamers. Binding of a peptide to a MHC molecule is a prerequisite for recognition by T cells and hence is fundamental to understand the basis of immunity and also for the development of potential vaccines [1,2].

Three type of models that incorporate biological knowledge have been used for prediction of MHC binding peptides: (i) binding motif [3], which represent the anchoring patterns and the amino acids commonly observed at anchor positions, (ii) Quantitative matrices [4], that provide coefficients that quantify contribution of each amino acid at each position within a peptide to MHC/peptide binding, and (iii) Artificial Neural Networks (ANN) [5,6] an arbitrary level of complexity can be encoded by varying the number of nodes in hidden layer and the number of hidden layers. Artificial Neural Networks [7] are connectionist models commonly used for classification. ANN is widely used for classification of MHC binder and non-binder. For prediction of T­cell epitope ANN has been used with the HMM (Hidden Markov model) [8], GA (Genetic Algorithms) [9], Evolutionary Algorithm [10]. SVM (Support Vector Machine) has also been used to predict the binding peptides [11]. Combined GA­ANN model has also been used to find the optimal conditions [12]. The work for the present paper has been motivated from the GA­ANN model. Here, in this paper a new approach of using the probability distribution functions to initialize the random weights for artificial neural network training has been demonstrated.


Data Collection

The data sets used for training and testing for binders and non­binders (BNB) were obtained from IEDB Beta 2.0 database [] for HLA­A*0201 MHC Class I allele. The 1609 peptides with 0 [less, double equals] IC50 [less, double equals] 500 have been retrieved as binders and 397 peptides with IC50 [dbl greater-than sign] 5000 have been retrieved as non­binders. After removing the duplicates, 800 9­mer binders and 256 9­mer non-binders have been used for training and prediction as shown in Table 5. Since the ratio of binders and nonbinders have to be kept nearly 1:1 in order to reduce the biasness in learning, the additional 544 9­mer non­binders have been generated through ExPASy server. Further, the common peptides among binders and newly generated 9­mer non­binders have been deleted. At last 800 nonamer binders and 790 nonamer non­binders have been used for training and prediction.

Algorithm used for the prediction of MHC binding peptides

Probability distribution based weights and biases initialization

A probability model does not allow to predict the result of any individual experiment but the probability that a given outcome will fall inside a specific range of values cab be determined by using the model. Since the weights of the ANN are small numbers and the variation among them should be small, so continuous probability distributions have been used for initialization of weights and biases for artificial neural network. Beta, Exponential, Extreme Value, Gamma, Lognormal, Normal, Rayleigh, Uniform and Weibull continuous distributions have been examined in the studied research work. Following steps have been used to generate the small random numbers using MATLAB []: (1) Use the functions given in second column of Table 1 (see supplementary material) to generate a vector of small random numbers; (2) The functions given in the third column of the Table 1 (see supplementary material) have been used to estimate the parameters and confidence interval for a given distribution; (3) Repeat the steps 1 and 2 till the parameters correspond to 95% confidence intervals.

Back propagation method for learning of artificial neural network

There are 20 amino acids found in all kinds of proteins. To code each amino acid a 20 bit binary code is used. For each binary code it will have value 1 according to its position and rest of the values is zeros. Since the binder and non binders sequences are 9­mer, hence a binder sequence will be represented by a vector of 180 (20x9) binary values. The model is used for only predicting the binder or non binder for a given 9­mer sequence, hence one output node and two hidden nodes are used. Therefore, 180 input nodes 2 nodes in a single hidden layer and 1 output node have been used to model. If the value at the output for a given epitope is less then the given threshold it is classified as non-binder otherwise the epitope is predicted as binder. The back propagation method has been used for learning ANN. For each training sample the weights have been modified so as to minimize the mean squared error between the network's prediction and the actual prediction. This error has been propagated backwards by updating the weights and biases to reflect the error of the network's prediction. The algorithm is shown in Figure 1.

Figure 1
The Back propagation algorithm

Evaluation Parameters

The predictive performance for Beta, Normal, Rayleigh, Uniform, and Weibull distributions was accessed using receiver operating characteristics (ROC) analysis. The area under the ROC curve (AROC) provides a measure of overall prediction accuracy, AROC [double less-than sign] 70 % for poor, AROC [dbl greater-than sign] 80 % for good, AROC [dbl greater-than sign] 90 % for excellent prediction [13]. The ROC curve is generated by plotting sensitivity (SN) as a function of 1­specificity (SP). The sensitivity, SN=(TP/(TP+FN))*100 and SP=(TN/(TN+FP))*100, gives percentage of correctly predicted binders and non­binders respectively. The PPV = ((TP)/(TP+FP))*100 and NPV=((TN)/(FN+TN))*100 gives the positive probability value i.e. the probability that a predicted binder will actually be a binder, and negative probability value i.e. the probability that a predicted non­binder will actually be a non-binder. The terms are defined in Table 2 (see supplementary material). 10­fold cross validation has been used for training and prediction of the artificial neural network with various probability distribution functions. 10 data sets of BNB have been designed. The training has been done for 9 test data set (i.e. 1st test data to test data 9th) and the 10th data set has been used for prediction and the results have been recorded. Then the 2nd test data to 10th test data have been used for training and the 1st has been used for prediction. Similarly when the prediction has been done for the ith test data the remaining 9 test data except for ith have been used for training.


The programs for training and classification have been implemented using C on Windows environment. The initial weights and biases matrix using various probability distributions functions have been created by MATLAB.


The continuous (data) probability distributions (Beta, Exponential, Extreme value, Gama, Lognormal, Normal, Rayleigh, Uniform, Weibull) have been used for initialization the weights. Gama and Lognormal continuous distributions have been discarded because the variations among the random initial values were too high, and hence not found suitable for modeling. The probability distribution functions and the estimated values of parameters using MLE (Maximum Likelihood Estimation) have been shown in Table 3 (see supplementary material) except for Gama and Lognormal. The probability distributions except Gama and Lognormal have been used for learning the ANN. Exponential and Extreme value distributions have been discarded because the error convergence curve is not smooth which might lead to wrong predictions as it is evident from the error graph shown in Figure 2.

Figure 2
The error analysis for small number of epoch (to make convergence clear)

The 10­fold cross validation has been used to validate the results. In 10­fold cross­validation, the data has been divided into 10 subsets of (approximately) equal size. The ANN has been trained 10 times, each time leaving out one of the subsets from training, but using only the omitted subset for prediction results. The 800 binders and 790 non binders have been divided in 10 sets of 80 and 79 respectively for prediction. The remaining binders and non­binders have been used for training. The ANN has been trained for 10 times for every probability distribution function leaving one out one of the subset from training and uses that for the prediction of BNB. Web based tool have been used to calculate the area under the ROC curve [].

Area under the fitted ROC curve for BNB sequences have been shown in Table 4 (see supplementary material) and the analysis of are under the ROC curve having been shown in Figure 3. The mean and standard deviation have been calculated for various probability distributions.

Figure 3
Graph of receiver operating characteristics (ROC) analysis.


We assembled a data set of binders and non­binders for HLA­A*0201 MHC Class I to study the impact of the probability distribution function for initialization of weights and biases of artificial neural network, motivated by the GA­ANN model where the GA have been used to initialize the weights and biases of artificial neural network. The high binding affinity peptides with 0[less, double equals]IC50[less, double equals]500 have been retrieved as binders and low binding affinity peptides with IC50[dbl greater-than sign]5000 have been retrieved as non­binders from IEDB Beta 2.0 database. The total number of binders and non­binders was 1609 and 397 respectively. A set of 800 9­mer binders and 256 9­mer non­binders have been prepared after eliminating the duplicates. The ratio of binders and non­binders have to be kept nearly 1:1 in order to reduce the biasness in learning, hence, additional 544 9­mer non­binders have been generated from a EBI­Expasy protein database and added to the non­binder set. Finally 800 9­mer binders and 790 9­mer non­binders have been used for training and prediction after further removing the duplicates caused by newly generated non­binders. The 10 sets of binders and nonbinders of nearly equal size have been made for 10­fold cross validation.

The results have been shown in Table 4 (see supplementary material) for all the probability distribution functions for all the test sets. The mean values of area under ROC curve for Beta, Normal, Rayleigh, Uniform and Weibull is 0.934, 0.924, 0.9367, 0.937 and 0.9337 respectively. All the distributions have performed well. The standard deviation for each has also calculated which shows that the standard deviation is minimum for Weibull probability distribution. The threshold parameter has been varied from 0.5 to 0.95. Further the values for Sensitivity, Specificity, PPV, NPV and accuracy for Beta, Normal, Rayleigh, Uniform, and Weibull distributions for all sets have been shown in Table 6, 7, 8, 9, and 10 (see supplementary material), respectively.

From the above results it is evident that the weight initialization may have an impact on the performance of artificial neural network. This is basically adding some prior knowledge to the artificial neural network. The MHC class­I 9­mer binders and non­binders may have any combination of 20 amino acids. The amino acids at the position 1 to 9 may follow a probability distribution or close to any probability distribution. As the results have shown that in case of HLA­A*0201 allele the performance was better in case when the weights for artificial neural network have been initialized using Weibull probability distribution. The modules for the training, classification, and results have been implemented in C using pointers, in order to improve the efficiency of training and classification. Overall this study shows that the quality of the prediction of binders and non­binders can be substantially improved by using the probability distributions for initialization of the weights for artificial neural network.

Supplementary material

Data 1:


The authors are grateful to Dr D S Yadav, Dept of CSE, I. E. T. Lucknow, Sri S. P. Singh Amity University Lucknow, for kind cooperation. Sri Rajeev Kumar, Department of Mechanical Eng. for providing MATLAB.


Citation:Soam et al, Bioinformation 3(9): 403-408 (2009)


1. De Groot AS, et al. Immunology and Cell Biology. 2002;80:255–269. [PubMed]
2. Buss S, et al. Tissue Antigens. 2003;62:378–384. [PubMed]
3. Rammensee HG, et al. Immunogenetics. 1999;50:213–219. [PubMed]
4. Bhasin M, Ragahava GPS. Vaccine. 2004;22:3195–3204. [PubMed]
5. Bhasin M, Ragahava GPS. J Biosciences. 2007;32(1):31–42. [PubMed]
6. Brussic V, et al. Methods. 2004;34:436–443. [PubMed]
7. Armano G, et al. BMC Bioinformatics. 2005;6(Suppl 4) [PMC free article] [PubMed]
8. Zhang GL, et al. Nucleic Acid Research. 2005;33:172–179. [PMC free article] [PubMed]
9. Prugel-Bennett A, Shapiro JL. Physical Review Letters. 1994;72(9):1305–1309. [PubMed]
10. Brusic V, et al. Bioinformatics. 1998;14(2):121–130. [PubMed]
11. Reidesel H, et al. Genome Informatics. 2004;15(1):198–212. [PubMed]
12. SH M Anijdan, et al. Science Direct. 2006;27(7):605–609.
13. Zhang GL, et al. Immunome Research. 2006;2:3. [PMC free article] [PubMed]

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group