|Home | About | Journals | Submit | Contact Us | Français|
Binding of peptides to specific Major Histo-compatibility Complex (MHC) molecule is important for understanding immunity and has applications to vaccine discovery and design of immunotherapy. Artificial neural networks (ANN) are widely used by predictions tools to classify the peptides as binders or nonbinders (BNB). However, the number of known binders to a specific MHC molecule is limited in many cases, which poses a computational challenge for prediction of BNB and hence, needs improvement in learning of ANN. Here, we describe, the application of probability distribution functions to initialize the weights and biases of the artificial neural network in order to predict HLAA*0201 binders and nonbinders. The 10fold cross validation has been used to validate the results. It is evident from the results that the AROC for 90% of test cases for Weibull, Uniform and Rayleigh distributions is in the range 0.90-1.0. Further, the standard deviation for AROC was minimum for Weibull distribution, and may be used to train the artificial neural network for HLAA*0201 MHC ClassI binders and nonbinders prediction.
Major Histocompatibility Complex (MHC) plays a central role in the development of both humoral and cellmediated immune responses. While antibodies may react with antigens alone, most T cells recognize antigens only when it is combined with an MHC molecule; thus, MHC molecules play a critical role in antigen recognition by T cells. T cell do not recognize soluble native antigen but rather recognize antigen that has been processed into antigenic peptides, which are presented in combination with MHC molecules. The T cell epitope must be viewed in terms of their ability to interact with both Tcell receptor and MHC molecule. The antigen binding cleft on an MHC molecule interacts with various oligomeric peptides that functions as TCell epitope. The antigen binding cleft on an MHC molecule determines the nature and the size of the peptide(s) that MHC molecule can bind and consequently the maximal size of the T cell epitope. It has been observed that peptides of nine amino residues (9mers) bind most strongly; peptides of 811 residues also bind but generally with lower affinity than nonamers. Binding of a peptide to a MHC molecule is a prerequisite for recognition by T cells and hence is fundamental to understand the basis of immunity and also for the development of potential vaccines [1,2].
Three type of models that incorporate biological knowledge have been used for prediction of MHC binding peptides: (i) binding motif , which represent the anchoring patterns and the amino acids commonly observed at anchor positions, (ii) Quantitative matrices , that provide coefficients that quantify contribution of each amino acid at each position within a peptide to MHC/peptide binding, and (iii) Artificial Neural Networks (ANN) [5,6] an arbitrary level of complexity can be encoded by varying the number of nodes in hidden layer and the number of hidden layers. Artificial Neural Networks  are connectionist models commonly used for classification. ANN is widely used for classification of MHC binder and non-binder. For prediction of Tcell epitope ANN has been used with the HMM (Hidden Markov model) , GA (Genetic Algorithms) , Evolutionary Algorithm . SVM (Support Vector Machine) has also been used to predict the binding peptides . Combined GAANN model has also been used to find the optimal conditions . The work for the present paper has been motivated from the GAANN model. Here, in this paper a new approach of using the probability distribution functions to initialize the random weights for artificial neural network training has been demonstrated.
The data sets used for training and testing for binders and nonbinders (BNB) were obtained from IEDB Beta 2.0 database [www.immuneepitope.org] for HLAA*0201 MHC Class I allele. The 1609 peptides with 0 IC50 500 have been retrieved as binders and 397 peptides with IC50 5000 have been retrieved as nonbinders. After removing the duplicates, 800 9mer binders and 256 9mer non-binders have been used for training and prediction as shown in Table 5. Since the ratio of binders and nonbinders have to be kept nearly 1:1 in order to reduce the biasness in learning, the additional 544 9mer nonbinders have been generated through ExPASy server. Further, the common peptides among binders and newly generated 9mer nonbinders have been deleted. At last 800 nonamer binders and 790 nonamer nonbinders have been used for training and prediction.
A probability model does not allow to predict the result of any individual experiment but the probability that a given outcome will fall inside a specific range of values cab be determined by using the model. Since the weights of the ANN are small numbers and the variation among them should be small, so continuous probability distributions have been used for initialization of weights and biases for artificial neural network. Beta, Exponential, Extreme Value, Gamma, Lognormal, Normal, Rayleigh, Uniform and Weibull continuous distributions have been examined in the studied research work. Following steps have been used to generate the small random numbers using MATLAB [www.mathworks.com]: (1) Use the functions given in second column of Table 1 (see supplementary material) to generate a vector of small random numbers; (2) The functions given in the third column of the Table 1 (see supplementary material) have been used to estimate the parameters and confidence interval for a given distribution; (3) Repeat the steps 1 and 2 till the parameters correspond to 95% confidence intervals.
There are 20 amino acids found in all kinds of proteins. To code each amino acid a 20 bit binary code is used. For each binary code it will have value 1 according to its position and rest of the values is zeros. Since the binder and non binders sequences are 9mer, hence a binder sequence will be represented by a vector of 180 (20x9) binary values. The model is used for only predicting the binder or non binder for a given 9mer sequence, hence one output node and two hidden nodes are used. Therefore, 180 input nodes 2 nodes in a single hidden layer and 1 output node have been used to model. If the value at the output for a given epitope is less then the given threshold it is classified as non-binder otherwise the epitope is predicted as binder. The back propagation method has been used for learning ANN. For each training sample the weights have been modified so as to minimize the mean squared error between the network's prediction and the actual prediction. This error has been propagated backwards by updating the weights and biases to reflect the error of the network's prediction. The algorithm is shown in Figure 1.
The predictive performance for Beta, Normal, Rayleigh, Uniform, and Weibull distributions was accessed using receiver operating characteristics (ROC) analysis. The area under the ROC curve (AROC) provides a measure of overall prediction accuracy, AROC 70 % for poor, AROC 80 % for good, AROC 90 % for excellent prediction . The ROC curve is generated by plotting sensitivity (SN) as a function of 1specificity (SP). The sensitivity, SN=(TP/(TP+FN))*100 and SP=(TN/(TN+FP))*100, gives percentage of correctly predicted binders and nonbinders respectively. The PPV = ((TP)/(TP+FP))*100 and NPV=((TN)/(FN+TN))*100 gives the positive probability value i.e. the probability that a predicted binder will actually be a binder, and negative probability value i.e. the probability that a predicted nonbinder will actually be a non-binder. The terms are defined in Table 2 (see supplementary material). 10fold cross validation has been used for training and prediction of the artificial neural network with various probability distribution functions. 10 data sets of BNB have been designed. The training has been done for 9 test data set (i.e. 1st test data to test data 9th) and the 10th data set has been used for prediction and the results have been recorded. Then the 2nd test data to 10th test data have been used for training and the 1st has been used for prediction. Similarly when the prediction has been done for the ith test data the remaining 9 test data except for ith have been used for training.
The programs for training and classification have been implemented using C on Windows environment. The initial weights and biases matrix using various probability distributions functions have been created by MATLAB.
The continuous (data) probability distributions (Beta, Exponential, Extreme value, Gama, Lognormal, Normal, Rayleigh, Uniform, Weibull) have been used for initialization the weights. Gama and Lognormal continuous distributions have been discarded because the variations among the random initial values were too high, and hence not found suitable for modeling. The probability distribution functions and the estimated values of parameters using MLE (Maximum Likelihood Estimation) have been shown in Table 3 (see supplementary material) except for Gama and Lognormal. The probability distributions except Gama and Lognormal have been used for learning the ANN. Exponential and Extreme value distributions have been discarded because the error convergence curve is not smooth which might lead to wrong predictions as it is evident from the error graph shown in Figure 2.
The 10fold cross validation has been used to validate the results. In 10fold crossvalidation, the data has been divided into 10 subsets of (approximately) equal size. The ANN has been trained 10 times, each time leaving out one of the subsets from training, but using only the omitted subset for prediction results. The 800 binders and 790 non binders have been divided in 10 sets of 80 and 79 respectively for prediction. The remaining binders and nonbinders have been used for training. The ANN has been trained for 10 times for every probability distribution function leaving one out one of the subset from training and uses that for the prediction of BNB. Web based tool have been used to calculate the area under the ROC curve [www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html].
Area under the fitted ROC curve for BNB sequences have been shown in Table 4 (see supplementary material) and the analysis of are under the ROC curve having been shown in Figure 3. The mean and standard deviation have been calculated for various probability distributions.
We assembled a data set of binders and nonbinders for HLAA*0201 MHC Class I to study the impact of the probability distribution function for initialization of weights and biases of artificial neural network, motivated by the GAANN model where the GA have been used to initialize the weights and biases of artificial neural network. The high binding affinity peptides with 0IC50500 have been retrieved as binders and low binding affinity peptides with IC505000 have been retrieved as nonbinders from IEDB Beta 2.0 database. The total number of binders and nonbinders was 1609 and 397 respectively. A set of 800 9mer binders and 256 9mer nonbinders have been prepared after eliminating the duplicates. The ratio of binders and nonbinders have to be kept nearly 1:1 in order to reduce the biasness in learning, hence, additional 544 9mer nonbinders have been generated from a EBIExpasy protein database and added to the nonbinder set. Finally 800 9mer binders and 790 9mer nonbinders have been used for training and prediction after further removing the duplicates caused by newly generated nonbinders. The 10 sets of binders and nonbinders of nearly equal size have been made for 10fold cross validation.
The results have been shown in Table 4 (see supplementary material) for all the probability distribution functions for all the test sets. The mean values of area under ROC curve for Beta, Normal, Rayleigh, Uniform and Weibull is 0.934, 0.924, 0.9367, 0.937 and 0.9337 respectively. All the distributions have performed well. The standard deviation for each has also calculated which shows that the standard deviation is minimum for Weibull probability distribution. The threshold parameter has been varied from 0.5 to 0.95. Further the values for Sensitivity, Specificity, PPV, NPV and accuracy for Beta, Normal, Rayleigh, Uniform, and Weibull distributions for all sets have been shown in Table 6, 7, 8, 9, and 10 (see supplementary material), respectively.
From the above results it is evident that the weight initialization may have an impact on the performance of artificial neural network. This is basically adding some prior knowledge to the artificial neural network. The MHC classI 9mer binders and nonbinders may have any combination of 20 amino acids. The amino acids at the position 1 to 9 may follow a probability distribution or close to any probability distribution. As the results have shown that in case of HLAA*0201 allele the performance was better in case when the weights for artificial neural network have been initialized using Weibull probability distribution. The modules for the training, classification, and results have been implemented in C using pointers, in order to improve the efficiency of training and classification. Overall this study shows that the quality of the prediction of binders and nonbinders can be substantially improved by using the probability distributions for initialization of the weights for artificial neural network.
The authors are grateful to Dr D S Yadav, Dept of CSE, I. E. T. Lucknow, Sri S. P. Singh Amity University Lucknow, for kind cooperation. Sri Rajeev Kumar, Department of Mechanical Eng. for providing MATLAB.
Citation:Soam et al, Bioinformation 3(9): 403-408 (2009)