Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformation**|**v.3(9); 2009**|**PMC2732036

Formats

Article sections

Authors

Related links

Bioinformation. 2009; 3(9): 403–408.

Published online 2009 June 28.

PMCID: PMC2732036

Received 2009 January 19; Revised 2009 March 31; Accepted 2009 April 19.

Copyright © 2009 Biomedical Informatics Publishing Group

This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium,
for non-commercial purposes, provided the original author and source are credited.

This article has been cited by other articles in PMC.

Binding of peptides to specific Major Histo-compatibility Complex (MHC) molecule is important for understanding
immunity and has applications to vaccine discovery and design of immunotherapy. Artificial neural networks (ANN) are
widely used by predictions tools to classify the peptides as binders or nonbinders (BNB). However, the number of known
binders to a specific MHC molecule is limited in many cases, which poses a computational challenge for prediction of BNB
and hence, needs improvement in learning of ANN. Here, we describe, the application of probability distribution functions to
initialize the weights and biases of the artificial neural network in order to predict HLAA*0201 binders and nonbinders.
The 10fold cross validation has been used to validate the results. It is evident from the results that the A_{ROC} for 90% of test
cases for Weibull, Uniform and Rayleigh distributions is in the range 0.90-1.0. Further, the standard deviation for AROC was
minimum for Weibull distribution, and may be used to train the artificial neural network for HLAA*0201 MHC ClassI
binders and nonbinders prediction.

Major Histocompatibility Complex (MHC) plays a central role in the development of both humoral and cellmediated immune responses. While antibodies may react with antigens alone, most T cells recognize antigens only when it is combined with an MHC molecule; thus, MHC molecules play a critical role in antigen recognition by T cells. T cell do not recognize soluble native antigen but rather recognize antigen that has been processed into antigenic peptides, which are presented in combination with MHC molecules. The T cell epitope must be viewed in terms of their ability to interact with both Tcell receptor and MHC molecule. The antigen binding cleft on an MHC molecule interacts with various oligomeric peptides that functions as TCell epitope. The antigen binding cleft on an MHC molecule determines the nature and the size of the peptide(s) that MHC molecule can bind and consequently the maximal size of the T cell epitope. It has been observed that peptides of nine amino residues (9mers) bind most strongly; peptides of 811 residues also bind but generally with lower affinity than nonamers. Binding of a peptide to a MHC molecule is a prerequisite for recognition by T cells and hence is fundamental to understand the basis of immunity and also for the development of potential vaccines [1,2].

Three type of models that incorporate biological knowledge have been used for prediction of MHC binding peptides: (i) binding motif [3], which represent the anchoring patterns and the amino acids commonly observed at anchor positions, (ii) Quantitative matrices [4], that provide coefficients that quantify contribution of each amino acid at each position within a peptide to MHC/peptide binding, and (iii) Artificial Neural Networks (ANN) [5,6] an arbitrary level of complexity can be encoded by varying the number of nodes in hidden layer and the number of hidden layers. Artificial Neural Networks [7] are connectionist models commonly used for classification. ANN is widely used for classification of MHC binder and non-binder. For prediction of Tcell epitope ANN has been used with the HMM (Hidden Markov model) [8], GA (Genetic Algorithms) [9], Evolutionary Algorithm [10]. SVM (Support Vector Machine) has also been used to predict the binding peptides [11]. Combined GAANN model has also been used to find the optimal conditions [12]. The work for the present paper has been motivated from the GAANN model. Here, in this paper a new approach of using the probability distribution functions to initialize the random weights for artificial neural network training has been demonstrated.

The data sets used for training and testing for binders and
nonbinders (BNB) were obtained from IEDB Beta 2.0
database [www.immuneepitope.org] for HLAA*0201
MHC Class I allele. The 1609 peptides with 0 IC_{50}
500 have been retrieved as binders and 397 peptides with
IC_{50} 5000 have been retrieved as nonbinders. After
removing the duplicates, 800 9mer binders and 256 9mer
non-binders have been used for training and prediction as
shown in Table 5. Since the ratio of binders and nonbinders
have to be kept nearly 1:1 in order to reduce the
biasness in learning, the additional 544 9mer nonbinders
have been generated through ExPASy server. Further, the
common peptides among binders and newly generated 9mer nonbinders have been deleted. At last 800 nonamer
binders and 790 nonamer nonbinders have been used for
training and prediction.

A probability model does not allow to predict the result of any individual experiment but the probability that a given outcome will fall inside a specific range of values cab be determined by using the model. Since the weights of the ANN are small numbers and the variation among them should be small, so continuous probability distributions have been used for initialization of weights and biases for artificial neural network. Beta, Exponential, Extreme Value, Gamma, Lognormal, Normal, Rayleigh, Uniform and Weibull continuous distributions have been examined in the studied research work. Following steps have been used to generate the small random numbers using MATLAB [www.mathworks.com]: (1) Use the functions given in second column of Table 1 (see supplementary material) to generate a vector of small random numbers; (2) The functions given in the third column of the Table 1 (see supplementary material) have been used to estimate the parameters and confidence interval for a given distribution; (3) Repeat the steps 1 and 2 till the parameters correspond to 95% confidence intervals.

There are 20 amino acids found in all kinds of proteins. To code each amino acid a 20 bit binary code is used. For each binary code it will have value 1 according to its position and rest of the values is zeros. Since the binder and non binders sequences are 9mer, hence a binder sequence will be represented by a vector of 180 (20x9) binary values. The model is used for only predicting the binder or non binder for a given 9mer sequence, hence one output node and two hidden nodes are used. Therefore, 180 input nodes 2 nodes in a single hidden layer and 1 output node have been used to model. If the value at the output for a given epitope is less then the given threshold it is classified as non-binder otherwise the epitope is predicted as binder. The back propagation method has been used for learning ANN. For each training sample the weights have been modified so as to minimize the mean squared error between the network's prediction and the actual prediction. This error has been propagated backwards by updating the weights and biases to reflect the error of the network's prediction. The algorithm is shown in Figure 1.

The predictive performance for Beta, Normal, Rayleigh,
Uniform, and Weibull distributions was accessed using
receiver operating characteristics (ROC) analysis. The area
under the ROC curve (A_{ROC}) provides a measure of overall
prediction accuracy, A_{ROC} 70 % for poor, A_{ROC} 80 %
for good, A_{ROC} 90 % for excellent prediction [13]. The
ROC curve is generated by plotting sensitivity (SN) as a
function of 1specificity (SP). The sensitivity,
SN=(TP/(TP+FN))*100 and SP=(TN/(TN+FP))*100, gives
percentage of correctly predicted binders and nonbinders
respectively. The PPV = ((TP)/(TP+FP))*100 and
NPV=((TN)/(FN+TN))*100 gives the positive probability
value i.e. the probability that a predicted binder will
actually be a binder, and negative probability value i.e. the
probability that a predicted nonbinder will actually be a
non-binder. The terms are defined in Table 2 (see
supplementary material). 10fold cross validation has been
used for training and prediction of the artificial neural
network with various probability distribution functions. 10
data sets of BNB have been designed. The training has
been done for 9 test data set (i.e. 1^{st} test data to test data
9^{th}) and the 10^{th} data set has been used for prediction and
the results have been recorded. Then the 2^{nd} test data to
10^{th} test data have been used for training and the 1^{st} has
been used for prediction. Similarly when the prediction has
been done for the i^{th} test data the remaining 9 test data
except for i^{th} have been used for training.

The programs for training and classification have been implemented using C on Windows environment. The initial weights and biases matrix using various probability distributions functions have been created by MATLAB.

The continuous (data) probability distributions (Beta, Exponential, Extreme value, Gama, Lognormal, Normal, Rayleigh, Uniform, Weibull) have been used for initialization the weights. Gama and Lognormal continuous distributions have been discarded because the variations among the random initial values were too high, and hence not found suitable for modeling. The probability distribution functions and the estimated values of parameters using MLE (Maximum Likelihood Estimation) have been shown in Table 3 (see supplementary material) except for Gama and Lognormal. The probability distributions except Gama and Lognormal have been used for learning the ANN. Exponential and Extreme value distributions have been discarded because the error convergence curve is not smooth which might lead to wrong predictions as it is evident from the error graph shown in Figure 2.

The 10fold cross validation has been used to validate the results. In 10fold crossvalidation, the data has been divided into 10 subsets of (approximately) equal size. The ANN has been trained 10 times, each time leaving out one of the subsets from training, but using only the omitted subset for prediction results. The 800 binders and 790 non binders have been divided in 10 sets of 80 and 79 respectively for prediction. The remaining binders and nonbinders have been used for training. The ANN has been trained for 10 times for every probability distribution function leaving one out one of the subset from training and uses that for the prediction of BNB. Web based tool have been used to calculate the area under the ROC curve [www.rad.jhmi.edu/jeng/javarad/roc/JROCFITi.html].

Area under the fitted ROC curve for BNB sequences have been shown in Table 4 (see supplementary material) and the analysis of are under the ROC curve having been shown in Figure 3. The mean and standard deviation have been calculated for various probability distributions.

We assembled a data set of binders and nonbinders for
HLAA*0201 MHC Class I to study the impact of the
probability distribution function for initialization of
weights and biases of artificial neural network, motivated
by the GAANN model where the GA have been used to
initialize the weights and biases of artificial neural
network. The high binding affinity peptides with
0IC_{50}500 have been retrieved as binders and low
binding affinity peptides with IC_{50}5000 have been
retrieved as nonbinders from IEDB Beta 2.0 database. The
total number of binders and nonbinders was 1609 and 397
respectively. A set of 800 9mer binders and 256 9mer
nonbinders have been prepared after eliminating the
duplicates. The ratio of binders and nonbinders have to be
kept nearly 1:1 in order to reduce the biasness in learning,
hence, additional 544 9mer nonbinders have been
generated from a EBIExpasy protein database and added
to the nonbinder set. Finally 800 9mer binders and 790 9mer nonbinders have been used for training and prediction
after further removing the duplicates caused by newly
generated nonbinders. The 10 sets of binders and nonbinders
of nearly equal size have been made for 10fold
cross validation.

The results have been shown in Table 4 (see supplementary material) for all the probability distribution functions for all the test sets. The mean values of area under ROC curve for Beta, Normal, Rayleigh, Uniform and Weibull is 0.934, 0.924, 0.9367, 0.937 and 0.9337 respectively. All the distributions have performed well. The standard deviation for each has also calculated which shows that the standard deviation is minimum for Weibull probability distribution. The threshold parameter has been varied from 0.5 to 0.95. Further the values for Sensitivity, Specificity, PPV, NPV and accuracy for Beta, Normal, Rayleigh, Uniform, and Weibull distributions for all sets have been shown in Table 6, 7, 8, 9, and 10 (see supplementary material), respectively.

From the above results it is evident that the weight initialization may have an impact on the performance of artificial neural network. This is basically adding some prior knowledge to the artificial neural network. The MHC classI 9mer binders and nonbinders may have any combination of 20 amino acids. The amino acids at the position 1 to 9 may follow a probability distribution or close to any probability distribution. As the results have shown that in case of HLAA*0201 allele the performance was better in case when the weights for artificial neural network have been initialized using Weibull probability distribution. The modules for the training, classification, and results have been implemented in C using pointers, in order to improve the efficiency of training and classification. Overall this study shows that the quality of the prediction of binders and nonbinders can be substantially improved by using the probability distributions for initialization of the weights for artificial neural network.

The authors are grateful to Dr D S Yadav, Dept of CSE, I. E. T. Lucknow, Sri S. P. Singh Amity University Lucknow, for kind cooperation. Sri Rajeev Kumar, Department of Mechanical Eng. for providing MATLAB.

**Citation:**Soam *et al*, Bioinformation 3(9): 403-408 (2009)

1. De Groot AS, et al. Immunology and Cell Biology. 2002;80:255–269. [PubMed]

2. Buss S, et al. Tissue Antigens. 2003;62:378–384. [PubMed]

3. Rammensee HG, et al. Immunogenetics. 1999;50:213–219. [PubMed]

4. Bhasin M, Ragahava GPS. Vaccine. 2004;22:3195–3204. [PubMed]

5. Bhasin M, Ragahava GPS. J Biosciences. 2007;32(1):31–42. [PubMed]

6. Brussic V, et al. Methods. 2004;34:436–443. [PubMed]

7. Armano G, et al. BMC Bioinformatics. 2005;6(Suppl 4) [PMC free article] [PubMed]

8. Zhang GL, et al. Nucleic Acid Research. 2005;33:172–179. [PMC free article] [PubMed]

9. Prugel-Bennett A, Shapiro JL. Physical Review Letters. 1994;72(9):1305–1309. [PubMed]

10. Brusic V, et al. Bioinformatics. 1998;14(2):121–130. [PubMed]

11. Reidesel H, et al. Genome Informatics. 2004;15(1):198–212. [PubMed]

12. SH M Anijdan, et al. Science Direct. 2006;27(7):605–609.

13. Zhang GL, et al. Immunome Research. 2006;2:3. [PMC free article] [PubMed]

Articles from Bioinformation are provided here courtesy of **Biomedical Informatics Publishing Group**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |