The humoral immune response is based on the ability of antibodies to recognize and bind to epitopes on the surface of antigens with high specificity. It is believed that most protein epitopes are composed of different parts of the polypeptide chain that are brought into spatial proximity by the folding of the protein or discontinuous. However, for approximately 10% of the epitopes, the corresponding antibodies are cross-reactive with a linear peptide fragment of the epitope [1
]. These epitopes are termed linear or continuous and are composed of a single stretch of the polypeptide chain.
In many cases it is difficult to obtain a pure preparation of the protein of interest for immunization purposes. The traditional cloning of the proteins or experimental peptide scanning approach is clearly not feasible on a genomic scale. However, to raise antibodies it is not necessary to present the complete protein but only the immunogenic fractions. Specific antibodies can be generated by immunization of animals with a peptide if the peptide is well chosen and presents an effective continuous epitope of the protein. The continuous B-cell epitopes play a vital role in the development of peptide vaccines, in diagnosis of diseases, and for allergy research. The specific interactions between antibodies generated against the continuous epitopes are also exploited extensively in biochemical and high-throughput assays. The ENCODE [2
] and the modENCODE [3
] projects aim to profile protein-DNA interactions for all transcription factors and DNA associated proteins for Human and for model organisms like Drosophila melanogaster
and Caenorhabditis elegans
using the factor specific antibodies. This has increased the demand for good antibodies at the whole genome level.
The computational methods can be cost effective and reliable for predicting linear B-cell epitopes and can guide a genome wide search for antigenic B-cell epitopes. Therefore, a lot of research has been devoted in the past for identifying continuous B-cell epitopes from the protein sequences. The classical approach of epitope prediction is to utilize the amino acid propensity scales describing properties like hydrophobicity [4
], hydrophilicity [5
], flexibility/mobility [6
], surface accessibility [7
], polarity [8
], turns [10
], and antigenicity [11
]. The first propensity scale method for predicting linear B-cell epitopes was introduced by Hopp and Woods [12
] and utilized the Levitt hydorophilicity scale [13
] to assign a propensity value to each amino acid. PREDITOP [10
], PEOPLE [14
], BEPITOPE [15
], and BcePred [16
] predicted linear B-cell epitopes based on combinations of physico-chemical properties as opposed to the propensity measures that rely on individual properties. The BcePred method obtained the best specificity of 56% and sensitivity of 61% [16
]. Blythe and Flower assessed 484 amino acid propensity scales in combination with ranges of plotting parameters and found that even the best set of scales and parameters perform only marginally better than random [17
]. This led researchers to combine propensity scales with machine learning methods to improve the performance. The BepiPred [1
] method combined the Parker hydorophilicity scale [5
] with a Hidden Markov Model (HMM) and demonstrated a slight but statistically significant improvement in the classification performance compared to the performance of the propensity scale based methods. Chen et al
] developed an amino acid pair (AAP) antigenicity scale that assigns to each possible pair of amino acids, a propensity value. Their support vector machine (SVM) classifiers trained using amino acid pair (AAP) propensity derived features outperformed SVM classifiers trained using amino acid propensity derived features [18
Recently, several researchers have explored various machine learning methods with learning examples for predicting linear B-cell epitopes using amino acid sequence information. The ABCPred [19
] method use recurrent artificial neural networks for predicting linear B-cell epitopes. Söllner and Mayer [20
] represent each peptide features derived from a variety of propensity scales, neighborhood matrices, and respective probability and likelihood values and attained an accuracy of 72%. The BCPred and FBCPred [21
] methods predict linear B-cell epitopes and flexible length linear B-cell epitopes (respectively) using SVM classifiers that use string kernels. The COBEpro [22
] method use a two-step procedure for predicting linear B-cell epitopes. In the first step, an SVM classifier is used to assign scores to fragments of the query antigen. In the second step, a prediction score is associated with each residue in the query antigen based on the SVM scores for the peptide fragments. Many methods utilizing three-dimensional (3D) structure to predict discontinuous epitopes are also available [23
]. We refer readers to a recent review by El-Manzalawy and Honavar [25
] for a more detailed discussion.
There are several problems common to recently developed machine-learning methods. These methods have utilized only a limited amount of positive learning examples. Some of these methods have utilized negative learning examples derived from random protein fragments. These negative training examples may harbor genuine B-cell epitopes and affect the training procedure and result in poor classification performance. Moreover, none of the published work has systematically combined and compared the performance of various structural properties and evolutionary information in bringing about good classification performance. Finally, most methods have utilized large peptide lengths (e.g. 20) in their benchmarking experiments. Predicting protein epitopes within the length range of 7-15 is important as peptide in this length range are easy to synthesize experimentally and well-chosen peptides could generate specific antibodies. The effect of peptide length on the classification performance has not been checked systematically. It is worth noting that these methods have failed to achieve accuracy > 75% and AUC > 0.75.
To overcome these limitations, we have generated a large non-redundant training set of B-cell epitopes with both the positive and the negative learning examples. This dataset was prepared by combining data from the Immune epitope database [26
], the BCIPEP database [27
] and the AntiJen database [28
] that has information on small peptide epitope antigenicity and specificity in immune response. We have created and leveraged the computational capabilities of Open Life Science Gateway (OLSGW; Wu et al
. Submitted; see Methods) to exhaustively predict structural properties for sequences containing epitopes. We have checked contributions of different protein structural properties including first and higher order composition, evolutionary conservation information, compositional and per residues probabilities for secondary structure, solvent accessibility, disorder, and low-complexity. We show that the utilization of negative examples and increasing the training set size contribute to the improved classification performance. However, the set of learning features have a definite impact on the learning process. We have also checked the effect of epitope length on the classification performance.
We introduce the B-cell epitope oracle (BEOracle), a SVM classifier that integrates different types of information that was provided as learning features to the classifier and validate the classifier on multiple validation sets. A large test set, a list of 32 chromatin immunoprecipitation (ChIP) grade antibodies and 10 HDAC antibodies with known epitope sequences were utilized as validation sets. Moreover, we checked whether BEOracle scores for HDAC peptides antibodies correlated with the intensity of fluorescence in immunofluorescence experiments on the Drosophila melanogaster embryos. Finally, a second SVM classifier, B-cell region oracle (BROracle) that utilizes the BEOracle scores as learning features, was trained to predict specificity of antibodies produced after immunization with large protein domains. This is first such attempt in our knowledge. Validation information for immunofluorescence (IF), western blot (WB), immunohistochemistry (IHC) and protein array (PA) data from Protein Atlas database was used to assess performance of BROracle.