|Home | About | Journals | Submit | Contact Us | Français|
The ability to predict immunogenic regions in selected proteins by in-silico methods has broad implications, such as allowing a quick selection of potential reagents to be used as diagnostics, vaccines, immunotherapeutics, or research tools in several branches of biological and biotechnological research. However, the prediction of antibody target sites in proteins using computational methodologies has proven to be a highly challenging task, which is likely due to the somewhat elusive nature of B-cell epitopes. This paper proposes a web-based platform for scoring potential immunological reagents based on the structures or 3D models of the proteins of interest. The method scores a protein’s peptides set, which is derived from a sliding window, based on the average solvent exposure, with a filter on the average local model quality for each peptide. The platform was validated on a custom-assembled database of 1336 experimentally determined epitopes from 106 proteins for which a reliable 3D model could be obtained through standard modeling techniques. Despite showing poor sensitivity, this method can achieve a specificity of 0.70 and a positive predictive value of 0.29 by combining these two simple parameters. These values are slightly higher than those obtained with other established sequence-based or structure-based methods that have been evaluated using the same epitopes dataset. This method is implemented in a web server called B-Pred, which is accessible at http://immuno.bio.uniroma2.it/bpred. The server contains a number of original features that allow users to perform personalized reagent searches by manipulating the sliding window’s width and sliding step, changing the exposure and model quality thresholds, and running sequential queries with different parameters. The B-Pred server should assist experimentalists in the rational selection of epitope antigens for a wide range of applications.
Several experimental techniques are currently available for the experimental mapping of B-cell epitopes.1 However, the time and costs involved in comprehensive epitope mapping of even a single target protein make this approach unfeasible on a genomic scale. In contrast, in-silico B-cell epitope prediction methodologies are a manageable alternative that allow for virtual cost-effective, genome-wide scans in the search for molecules with diagnostic, vaccinal, or immunomodulatory potential. They also generally provide options for the selection of useful biotechnological reagents.
The problem of predicting potential epitopes for target proteins has been investigated by several groups using various methodologies.2,3 Early methods, based on the analysis of linear protein sequences, attempted to correlate aminoacid propensity scales based on hydrophobicity, hydrophilicity, flexibility, accessibility, or secondary structure features with epitopic regions.4–9 While these methods were widely used to assist in the design of peptides for immunological purposes, their validity has been questioned. 10 Some groups have proposed that combining several different linear prediction indexes might produce more reliable results. This is the basis of the Linear Epitope Prediction Database (LEPD) method.11
Among the wealth of parameters considered for optimal epitope prediction,2,12 the role of residue solvent exposure in the determination of antigenic properties is still a matter of debate. Parameters correlated with exposure, such as hydrophilicity, have been used since the early era of in-silico B-cell epitope prediction methods,6 based on the commonsense assumption that epitopic residues should be exposed and available for interaction with the antibody–antigen binding regions. The efficacy of such simple propensity scales has been extensively reviewed using updated datasets, and is proven to be extremely poor if not close to random.10,13 Despite this, simple propensity scales can still be used in combination with other parameters, as shown by the Conformational B-cell Epitope Prediction (CBTOPE) server14 that incorporates the Parker scale6 among the aminoacid physico-chemical properties used in the prediction procedure.
By surveying an extensive epitopes dataset that includes structural information, Lollier et al concluded that “epitopic residues are not distributed among any specific Relative Surface Accessibility and Protrusion index values, and in some cases epitopes cover the entire antigenic sequence.”15
Wang et al12 propose a highly accurate prediction methodology, which is based on an evaluation, from the linear sequence, of a number of structural features, evolutionary information, disorders, and low complexity information. These data are then used in different combinations to identify the best prediction methodology. It is worth noting that the best predictor includes residue accessibility information as one of the parameters, while disorders and low complexity information do not contribute significantly to the method.
More recently, Ponomarenko et al16 proposed a structure-based method called ElliPro that approximates protein surface patches as ellipsoids and then scores the residues belonging to the surface patches based on a protrusion index, which is a sophisticated way of selecting the most exposed residues in the structure.
In a recent paper, Zhang et al17 presented a novel prediction methodology based on the concept of the “thick surface patch,” in which both exposed residues and interior residues concur on the definition of the potential conformational B-cell epitopes.
The present study addresses the possibility of using solvent exposure, directly calculated from 3D data, as a simple parameter to be correlated with the probability that a given peptide sequence would contain a B-cell epitope. We find that the correlation improves if the local quality of the model/structure is also taken into account. The method, named B-Pred, is accessible on the Internet at http://immuno.bio.uniroma2.it/bpred. The method and server focus on the prediction of potential linear epitopes, as opposed to discontinuous epitopes. This makes the server particularly useful in the selection and design of immunological and biotechnological reagents, as the conversion of a linear epitope to a peptidic reagent is relatively straightforward, while the same does not hold true for discontinuous epitopes. A few novel features that are not presently available in other similar B-cell prediction web servers further facilitate the reagent-design aspect. These include the visualization of positive stretches directly on the structure, a customizable output for the sliding window results that also allows the user to quickly download the analysis results in different formats, and the computation of protein epitopic hotspots. The server also has a unique feature not found in other B-cell epitope prediction servers, which is the ability to compute and highlight interface residues when analyzing a molecule within a multi-chain structure. This has interesting implications when the server is used to design peptidic reagents, as it allows the user to select targets with the desired characteristics.
The B-cell epitopes dataset was built using the Immune Epitope Database (IEDB) version 2.018 and the Bcipep database.19 Only positive linear epitopes were used. Duplicated entries and epitopes that could not be mapped univocally on the protein linear sequence were removed. The dataset included 1336 experimentally determined epitopes derived from 106 proteins for which at least two different epitopes were known, and for which a model could be built with an average quality score ≥ 0.2 as determined by the Verify 3D (V3D) software.20–22 This dataset is available at http://immuno.bio.uniroma2.it/bpred/table/table2.pl and provides the following information for each epitope: the epitope id linked to the original epitope entry (ie, IEDB epitopes), the epitope reference ID (ie, Bcipep epitopes), the PubMed ID linked to the corresponding publication record on PubMed, the peptide start and stop residues, sequence, length, and the (NCBI GI) number of the protein linked to the corresponding protein record on the National Center for Biotechnology Information (NCBI) website.
The templates for the protein models were identified by HMM-HMM comparison using the pdb70_7Mar09 Hidden Markov Model (HMM) database23 via the HHpred server.23 Psiblast was used for the Multiple Sequence Alignment (MSA) generation method with three MSA generation iterations and a local alignment mode. The E-value threshold was set at 1E-3 for MSA generation and the minimum coverage threshold for MSA hits was set at 20%. Structures predictions were performed by the MODELLER software24,25 using the optimal multiple template of the HHpred server.
The Naccess software26,27 was used to calculate the solvent exposure of each residue using the full quaternary structure of the protein (ie, including all protein chains) as input. These solvent exposure values were then compared with the values obtained when using single protein chains as input. The difference in solvent accessibility was calculated in these two cases for each residue. This difference is zero if the residue is not located within a protein-protein interface, and has a negative value otherwise. This simple method allows for the easy identification of the residues located at interfaces, which are subsequently highlighted in the output of the web server’s analysis.
For the ElliPro method, the dataset was analyzed using the protein models with the default ElliPro parameters.16 Of all the predicted epitopes, only the linear ones were considered in the analysis. For the LEPD11 and the CBTOPE 14 methods, the analysis was performed using the default parameters.
For the purpose of this analysis, five different randomized datasets were generated from the main dataset, thereby generating five “test set/training set” pairs. In order to minimize possible biases in the randomization process, each pair was evaluated as an independent training and test set. The performance values were determined by the mean values of the five independent determinations.
Data are expressed using mean ± standard deviation of the mean, or the frequency of positives as appropriate. The groups were compared using the t-test for continuous data and the χ2 test for categorical data, applying Yates’ correction when appropriate. A P-value below 0.05 was considered significant. Receiving operator characteristic (ROC) analysis was used for the identification of the optimal B-Pred parameter conditions and cut-off values. All tests were performed using the GraphPad Prism 4.0 software package (GraphPad Software, San Diego, CA).
B-Pred was developed on a Linux Ubuntu 10.04 server running PHP 5.2.3, Perl 5.10.1, Naccess version 2.1.1, and the current release of V3D (http://nihserver.mbi.ucla.edu/Verify_3D/). The web server interface, web forms, and data output were written using HTML4.01/CSS and PHP.
First, the user uploads or selects a model or structure to work with. After uploading the model or retrieving the structure from the Protein Data Bank (PDB), a job folder is created on the server by a PHP script where the model data and all subsequent analysis data are stored. A job password is then created and stored in a dedicated text file. Once the model file is stored, an interface with the analysis options (sliding window step, analysis thresholds) is shown to the user. After selecting the analysis parameters, the PHP script makes a call to a perl script (bpredmain.pl) that is in charge of all the analysis steps. The first task executed by the perl script is to call the Naccess application, and retrieve the Naccess solvent exposure data for the selected chain (computed alone) and the Naccess data for the full protein complex. Comparing the data for the chain alone and the data from the full complex identifies the interface residues. The structural data for the selected chain are then analyzed for local model quality by the perl script, which then calls the V3D software. Naccess data, V3D data, and interface determinations are then stored in a text file in csv format. Next, the PHP script reads this detailed analysis file to format the results and generate the server output. Figure 5 presents the server algorithm. With this data, a visualization of the structure is dynamically provided through the Jmol applet.28 Finally, solvent exposure and model quality charts from the server output are dynamically generated using the JpGraph PHP software.29
The proposed B-cell epitope prediction method, B-Pred, assumes the availability of a 3D structure of the target protein, either as an experimentally determined structure or as a model. This structure is used to assign two scores to each residue, one for solvent exposure and one for local structure quality. The solvent exposure values are computed for the selected chain alone and for the selected chain in the context of the whole protein complex. Comparing these two sets of values identifies and flags the residues involved in the protein–protein interface. The protein is then split into a number of overlapping peptides using a sliding window approach. A solvent exposure score and a quality score are then computed for each peptide, as the average of the scores of its constituent residues. If the values of the solvent exposure and structure quality are above the predefined thresholds, the peptide is flagged as potentially containing a B-cell epitope. The following two sections describe the computation of the default threshold values. Figure 1 provides a diagram of this flowchart, and Figure 5 describes the server algorithm underlying the flowchart.
In order to test the predictive power of this method and determine the optimal thresholds for solvent exposure and structure quality, a dataset of 1336 experimentally determined epitopes derived from 106 proteins was built (see methods section). This dataset is available at http://immuno.bio.uniroma2.it/bpred/table/table2.pl.
This study analyzed the distribution of epitope sizes in the experimental dataset. As expected, there is a clear bias toward round numbers, in particular 10 mers (12.4% of total epitopes), 15 mers (23.4%) and 20 mers (11.9%). There is a prevalence of 15 mers, which are probably a good compromise between the willingness of the researcher to include a number of residues higher than the expected size for an average B-cell epitope in the synthesized peptides and the cost of the peptide synthesis. Nevertheless, a significant number of 20 mers are also present in the dataset. Epitopes and peptides longer than 20 are poorly represented (9.6%).
The criteria for scoring a peptide as true positive are based on the inclusion of the experimental epitope in the considered peptide. Therefore, we selected 20 as the default sliding window size in order to include as many epitopes from the experimental dataset as possible, while retaining a good degree of resolution in the results. This allowed us to consider 90.4% of the experimental epitopes in our dataset.
In order to avoid the overselection of contiguous peptides in the same protein region that would be obtained by moving the sliding window by one residue each time, we introduced a default sliding step of three residues.
The experimental dataset was randomly partitioned into five different training (80% of the peptides) and test (20%) sets. The training sets were evaluated independently to determine the optimal thresholds for both the solvent exposure and structure quality scores. To this end, a 20-mer scan was performed on the proteins in each set. Each peptide was assigned a solvent exposure score (mean of individual Naccess scores of the 20 residues) and a local model quality score (mean of individual V3D scores of the 20 residues).
First, the ability of the Naccess score to identify the true epitopes at different V3D thresholds was assessed. Figure 2 shows the mean values of the area under the curve obtained with the ROC analysis at the different values of V3D for the five training sets. The training sets performed quite homogenously, and the best performances were obtained from the V3D values in the range of 0–0.2 (P < 0.05 all comparisons, χ2 test).
In agreement with the ROC analysis, the sensitivity of the method slightly improved when the model quality, as assessed by V3D, was taken into account (Figure 3). In accordance with the original publication that describes the V3D software,22 the best cutoff values for V3D were in the range of 0–0.2.
Interestingly, the five different training sets determined similar cut-off values at each level of V3D (data not shown). Among the five training sets, the worst performance was a sensitivity of 36.65%, which was associated with a specificity of 70% for the Naccess cut-off value of 43.05, and a V3D above 0.
The 20 mers in the five test sets were also scored using the ElliPro, LEPD, and CBTOPE methods for comparison (see methods section).
Figure 4 presents the result from this comparative analysis in terms of sensitivity, specificity, and predictive values. The mean values obtained for each parameter for the test sets are also reported. Interestingly, B-Pred presented a significant improvement in positive predictive values (P < 0.03 all comparisons, χ2 test) compared to the other three methods used.
Furthermore, the B-Pred thresholds were set up with the aim of achieving a high specificity and a similar approach is at the basis of the cut-off value used in the ElliPro method; therefore, specificities of the two systems are comparable (Figure 4). The sensitivity of B-Pred is equivalent to the sensitivity of ElliPro, which supports the notion that B-Pred performs as well, if not better than ElliPro. Indeed, the sensitivity of the B-Pred was worse than 0.35 for test set 4, which was the same as ElliPro (P > 0.05, χ2 test) (Table 1).
On the other hand, B-Pred presented a significantly lower sensitivity than CBTOPE (0.37 versus 0.54, respectively, P < 0.02 all comparisons, χ2 test) (Table 2). However, B-Pred presented a significantly higher specificity (0.69 versus 0.49, respectively, P < 0.01, χ2 test). Therefore, B-Pred achieved a significantly higher positive predictive value (Figure 4).
After achieving these positive results, we incorporated the method described above in a web server, which is accessible at http://immuno.bio.uniroma2.it/bpred (Figure 5). The server can use a protein structure, a model uploaded by the user, or a PDB id as input.30 Each job is assigned a random number that can be used to retrieve the results at a later time.
After the model file is uploaded, or the structure has been retrieved from the PDB, the website displays details on the structure including a 3D rendering with the Jmol applet. By default, the server proposes the parameter values (length of peptides, sliding window size, solvent exposure, and structure quality thresholds) that have been optimized for the dataset presented in this paper. These values can be changed at any time, as the analysis can be reiterated with different parameters within the same work session.
If the uploaded structure or model comprises more than a single chain, the user can select the chain to be analyzed.
The detailed results of the analysis (Figure 6) are presented as follows:
Figure 3 shows part of the server output screen in which the sequence overview and the peptide scan results are shown for a sample PDB file.
This paper investigates whether the solvent accessibility of a peptide in the context of a full protein structure can be used to identify potential epitopes. Lollier et al15 recently questioned this relationship that is widely used in algorithms for B-cell epitope prediction tools based on linear sequence analysis31 and protein structures.2
This paper presents a simple approach to analyzing 3D models or structures using Naccess and V3D algorithms to obtain values for solvent exposure and local model/structure quality, respectively. Selection occurs by scoring a sequence as “positive” when these values are above the defined threshold.
Several B-cell epitope prediction methods have recently been developed based on the linear protein sequence or on protein structure coordinates.2,3 However, B-cell epitope prediction methods are often largely inaccurate for several reasons.10,13 The characteristics that render a sequence suitable for antibody binding are still poorly defined despite extensive research in this area, making even the prediction of linear epitopes a difficult task. In terms of prediction methods based on 3D structures, a paradigm is emerging that shows that a significant number of proteins (about 40% of all human proteins) contain at least one disordered segment of 30 aminoacids or more, while 25% of all human proteins are likely to be entirely disordered and might reach a defined structure only when interacting with a ligand.32,33 Therefore, experimental crystal structures or structure models may not necessarily reflect the real conformation of proteins or protein complexes in a solution. Despite these limitations, a number of methods demonstrate significant predictive power when challenged with experimental datasets.
This report presents a rather simple structure-based method that only analyzes two different parameters (local solvent exposure and local structure quality). The software is called B-Pred and can be freely accessed through a web server located at http://immuno.bio.uniroma2.it/bpred. This method is aimed at predicting linear, continuous epitopes (as opposed to conformational/discontinuous epitopes). B-Pred showed a sensitivity of 0.37 at a specificity of 0.69 (Table 1), making the method comparable with other published methods that are based on linear protein sequences (LEPD), 3D coordinates (ElliPro), or SVM models (CBTOPE), with a slightly increased specificity, thereby minimizing the number of false-positive predictions. It should be noted that the conditions used to test LEPD, ElliPro, and CBTOPE are different from the ones reported in their original papers, as the epitopes database is different and assumptions were made in order to score 20 mers with these methods.
This method was implemented and made publicly available by the development of a web server with a number of novel features that are not available in similar servers. This server is biased toward scoring potential immunological reagents (peptides) derived from protein sequences. B-Pred uses a sliding window to scan the sequence and identify potential epitopes. The parameters of the analysis can be modified during subsequent iterations in order to identify the reagents that are most suited to the specific needs of the user.
A unique feature of the B-Pred server is the identification of the residues located in protein-protein interaction surfaces. This information can be relevant in designing peptides for use in the production of antibodies/antisera with specific characteristics. In this context, it could be speculated that antibodies directed at protein-protein interfaces could display neutralizing activity by preventing or competing with the formation of active protein complexes. Conversely, antibodies targeted at areas not involved in complex formation, but still located on solvent exposed regions, could be suitable reagents for the immunoprecipitation of whole protein complexes. Of course, the B-Pred server is just a contribution toward this ambitious goal.
Although the B-Pred server considers conformational and structural information to determine solvent accessibility, it exclusively focuses on the prediction of continuous linear epitopes, as opposed to discontinuous epitopes that can be predicted by other B-cell prediction servers. While this limits the scope of the method, it allows for an immediate translation of the results into peptidic reagents for bench research, which is one of the main purposes of this system.
According to Lollier et al, surface and solvent exposure, as assessed by different methods (Relative Surface Accessibility and Protrusion Index) cannot be reliably correlated to antigenic propensity.15 There are a number of reasons why an experimentally determined epitope can have poor solvent exposure in the context of the 3D structure of the full protein. For instance, a protein or allergen can be denatured or otherwise processed before or after being injected into an animal for immunization. Before the availability of prediction methods based on structure, peptides were selected from protein sequences using propensity scale indexes and were successfully used to raise antisera or monoclonal antibodies. It is common knowledge that monoclonals exist that will only work in western blots, and thereby recognize sequences that become exposed only after protein denaturation during electrophoresis and blotting procedures. However, other monoclonals are suitable for immunoprecipitation of the target protein from undenatured lysates, and thereby recognize solvent accessible surface sequence stretches that are either continuous or discontinuous with respect to the linear sequence. For these and other reasons, many sequences stored in databases as containing B-cell epitopes can have an overall poor surface exposure. The detailed information about the methods and experimental conditions used for their identification would be extremely useful for the development of more targeted prediction methodologies that would be able to take all of the above considerations into account.
It should be noted that, despite the report by Lollier et al,15 in the present study we do observe a correlation between surface exposure and antigenic propensity. This could be due to a number of reasons that are worth investigating. For example, for intrinsic reasons, our epitopes dataset is entirely biased toward proteins for which a 3D structure is either available or for which a model can be reliably computed. Therefore, it is possible that our experimental epitopes dataset is biased toward peptides that were predicted for subsequent synthesis and experimental testing using existing structural prediction methods that most often incorporate surface exposure information in their algorithms. Since the methodology for peptide design is not readily available in epitope databases, it is not easy to verify this kind of hypothesis. Research is underway to address these issues.
Since the current B-Pred implementation is based on a single parameter (solvent exposure) that is filtered on local model quality, it is reasonable to assume that the method could be further improved by the inclusion of additional structural parameters12,34 and/or by combining the solvent exposure, as directly determined from the structure, with classical linear propensity scales. Research is under way to investigate the possible inclusion of additional parameters to improve the prediction accuracy of the current method.
Among the possible applications of this method, the development of diagnostic reagents for serological analysis is worth mentioning. A protein encoded in the genome of a pathogen of interest can be analyzed for potential B-cell epitopes that could be targeted by the humoral host response. Peptides containing these epitopes have potential as diagnostic reagents for serological tests if the selected sequences are specific to the pathogen of interest. In order to optimize the discovery of B-cell epitopes with diagnostic potential by identifying amino acid sequences that are only present in a given pathogen strain (which is related to the concept of “conservancy”35), an interesting development of this work will be to automatically link the B-Pred analysis with a BLAST analysis. Work is currently in progress to achieve this goal in a future version of the B-Pred web server.
In conclusion, this study provides a new, freely accessible online tool for the selection of candidate B-cell epitopes in proteins of interest, and is focused on the design of experimental reagents for a variety of biological and biotechnological applications.
This study was supported by the “Distretto Tecnologico delle Bioscienze del Lazio,” FILAS 2009. We are grateful to Professor Gajendra PS Raghava for allowing us to access the full data of the Bcipep database and to Dr Emanuele Buonomo for providing informatics advice for building the epitopes dataset.
The authors report no conflicts of interest in this work.