|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
Prediction of peptide binding to major histocompatibility complex (MHC) molecules is a basis for anticipating T-cell epitopes, as well as epitope discovery-driven vaccine development. In the human, MHC molecules are known as human leukocyte antigens (HLAs) and are extremely polymorphic. HLA polymorphism is the basis of differential peptide binding, until now limiting the practical use of current epitope-prediction tools for vaccine development. Here, we describe a web server, PEPVAC (Promiscuous EPitope-based VACcine), optimized for the formulation of multi-epitope vaccines with broad population coverage. This optimization is accomplished through the prediction of peptides that bind to several HLA molecules with similar peptide-binding specificity (supertypes). Specifically, we offer the possibility of identifying promiscuous peptide binders to five distinct HLA class I supertypes (A2, A3, B7, A24 and B15). We estimated the phenotypic population frequency of these supertypes to be 95%, regardless of ethnicity. Targeting these supertypes for promiscuous peptide-binding predictions results in a limited number of potential epitopes without compromising the population coverage required for practical vaccine design considerations. PEPVAC can also identify conserved MHC ligands, as well as those with a C-terminus resulting from proteasomal cleavage. The combination of these features with the prediction of promiscuous HLA class I ligands further limits the number of potential epitopes. The PEPVAC server is hosted by the Dana-Farber Cancer Institute at the site http://immunax.dfci.harvard.edu/PEPVAC/.
T-cells are the key component of the adaptive immune system, playing a pivotal role fighting both infectious agents and cancer cells (1). T-cell-based immune responses are driven by antigenic peptides (epitopes), presented in the context of major histocompatibility complex (MHC) molecules (2). Therefore, the prediction of peptides that can bind to MHC molecules has become the basis for the anticipation of T-cell epitopes (3). MHC molecules fall into two major classes, namely MHC class I (MHCI) and MHC class II (MHCII). Antigens presented by MHCI and MHCII are recognized by two distinct sets of T-cells, CD8+ T and CD4+ T-cells, respectively. Identification of T-cell epitopes is important for both understanding disease pathogenesis and vaccine design. Thus, the availability of computational methods that can readily identify potential epitopes from primary protein sequences has fueled a new paradigm in vaccine development that is driven by this epitope discovery.
A major complication to this vaccine development approach is the extreme polymorphism of the MHC molecules. In the human, MHC molecules are known as human leukocyte antigens (HLAs), and there are hundreds of allelic variants of the class I (HLA I) and the class II (HLA II) molecules. These HLA allelic variants bind distinct sets of peptides as MHC polymorphism is the basis for peptide-binding specificity (4), and are expressed at vastly variable frequencies in different ethnic groups (5). This complexity suggests that a large number of HLA molecules will have to be targeted for peptide-binding predictions, requiring so many peptides to elicit a broadly protective multi-epitope vaccine as to be impractical. Interestingly, groups of several HLA molecules (supertypes) can bind largely overlapping sets of peptides (6,7). The identification of these HLA supertypes facilitates the epitope-based vaccine development for the following two reasons: first, targeting of representative HLA alleles from distinct supertypes allows the immune response to be stimulated in a variety of genetic backgrounds; second, the selection of promiscuous peptide binders to those alleles included within a given supertype limits the number of peptides to be considered without decreasing the spectrum of the immune response.
In this paper, we describe a web server, PEPVAC (Promiscuous EPitope-based VACcine), that allows the prediction of promiscuous epitopes to five HLA I supertypes: A2 (A*0201-07, A*0209 and A*6802), A3 (A*0301, A*1101, A*3101, A*3301, A*6801 and A*6601), A24 (A*2402 and B*3801), B7 (B*0702, B*3501, B*5101-02, B*5301 and B*5401) and B15 (A*0101, B*1501_B62 and B1502). These supertypes were defined using a method based on the clustering of the predicted peptide-binding repertoire of MHC molecules (8). The combined phenotypic frequency of these supertypes is >95% for five major American ethnicities (Black, Caucasian, Hispanic, Native American and Asian). Thus, targeting these supertypes with epitope predictions would potentially provide a population coverage ≥95%, regardless of ethnicity.
Peptides binding to HLA I molecules are potential CD8+ T-cell epitopes. In vivo, the C-terminus of these antigenic epitopes results from the selective proteolysis of cytosolic proteins mediated by the proteasome (9). The proteasome is thus important for determining these epitopes. Therefore, PEPVAC has also been implemented with an algorithm for the identification of those peptides containing a C-terminus that is likely to be the result of proteasomal cleavage. Finally, PEPVAC also allows the prediction of conserved epitopes from sequences with variability masked. The combination of these two features serves in both refining the predictions of T-cell epitopes and limiting the number of potential epitopes.
The peptide-binding mode of MHCI molecules differs from that of MHCII (10–12), and as result, the prediction of peptide-MHCII binding is less reliable than that of peptide-MHCI binding. Thereby, we have focused here in the prediction of MHCI ligands, a class that is specifically recognized by CD8+ cytotoxic T lymphocytes. Peptides binding to a specific MHCI molecule are related by sequence similarity, and thus we use position-specific scoring matrix (PSSM) from aligned MHCI ligands as the predictors of peptide-MHCI binding in combination with a dynamic algorithm. PSSMs are also known as profiles and weight matrices and have previously been shown to be adequate tools for the prediction of peptide-MHC binding (13–16). PSSMs are derived from block alignments of MHCI ligands that are of the same length. Such a restriction guarantees proper structural alignment of ligands and subsequent accuracy of the peptide-binding predictions (13,14). Given that MHCI-ligands are usually of nine residues in length, PSSMs used in this study are for the prediction of ligands of that same size (nine residues). Accuracy of the prediction of peptide-MHCI binding using PSSMs varies depending on threshold and the targeted MHCI molecule. On average, however, ROC analyses of the predictions at different thresholds result in AUC values (Area Under ROC Curve) above 0.8, indicating that these PSSMs are very good for predictors of peptide-MHCI binding. Furthermore, >80% of known CD8+ T-cell epitopes can be predicted at a 2% threshold from their protein sources.
We defined HLA I supertypes through clustering of predicted MHC peptide-binding repertoires (8). In brief, the core of the method consists of the generation of a distance matrix whose coefficients are inversely proportional to the peptide binders shared by any two HLA molecules (Figure 1). Subsequently, this distance matrix is fed to a phylogenic clustering algorithm to establish the kinship among the distinct HLA peptide-binding repertoires. Figure 2 shows a phylogenic tree built upon the peptide-binding repertoire of 55 HLA I molecules, using a Fitch and Margoliash clustering algorithm (17). We defined supertypes (Figure 2) as groups of HLA I alleles with ≥20% peptide-binding overlap (pairwise between any pair of alleles). The supertypes identified in this study include the A2, A3, B7, B27 and B44 supertypes previously identified by Sidney et al. (16). Furthermore, we have also identified three new supertypes, BX, B15 and B57 (Figure 2). The cumulative phenotypic frequency (CPF) of these supertypes is shown in Table 1. CPF was calculated using the gene and haplotype frequencies reported for five distinct American ethnic groups including Blacks, Caucasians, Hispanic, North American Natives and Asians (18). CPF represents the population coverage that would be provided by a vaccine composed of epitopes restricted by the alleles included in the supertype. The A2, A3 and B7 supertypes have the largest CPF in the five studied ethnic groups, close to 90%, irrespective of ethnicity. To increase the population coverage to ≥95%, regardless of ethnicity, it is necessary to include at least two more supertypes. Specifically, the supertypes A2, A3, B7, B15 and A24/B44 represent the minimal supertypic combinations with the indicated population coverage. Alleles belonging to each of these supertypes are shown in Figure 2 and Table 1.
Following the HLA I supertypic analysis as discussed, we have implemented a tool for the prediction of promiscuous peptide binders to a set of supertypes with a CPF >95%, irrespective of ethnicity. We named this tool PEPVAC, and it is Online at the site http://immunax.dfci.harvard.edu/PEPVAC/ hosted by the Molecular Immunology Foundation/Dana-Farber Cancer Institute. The web interface to PEPVAC is divided into several sections that facilitate intuitive use (Figure 3A). Main features of the web server are discussed bellow.
In PEPVAC, input query to carry epitope predictions is entered in the GENOME section (Figure 3A). Input consists of a single or various protein sequences in FASTA format. Only the standard 20 amino acid residues are considered. There are several translated genomes from pathogenic organisms that can be selected as inputs. More useful, a user-provided local file containing a set of protein sequences can be uploaded to the server using the choose/browse bottom. PEPVAC can also process files with protein sequences, in which the variable sites have been masked with a dot ‘.’ symbol. In that case, peptide-binding predictions will be carried out only over consecutive stretches of nine or more residues. Sequences with variable positions masked according to the Shannon entropy variability metric (4,19) can be obtained at the site http://immunax.dfci.harvard.edu/bioinformatics/Tools/sva.html. Currently, there is a limit of 200 sequences and 50000 symbols that can be processed per request. If such limits are exceeded, the server will return an error.
The A2, A3, B7, B15 and A24 (Figure 2 and Table 1) supertypes have been chosen for promiscuous peptide-binding predictions in PEPVAC. Only those peptides that are predicted to bind to all the alleles included in the supertypes are returned in the output (Figure 3B). Threshold for the prediction of promiscuous peptide binders in PEPVAC has been fixed to provide a reduced and manageable set of promiscuous peptide binders to each supertype. As an example, predicted promiscuous peptides to the above five supertypes from a genome, such as that of Influenza virus A (4160 amino acids) distributed in 10 distinct open reading frames, represent only 5.51% (254 9mer peptides) of all possible peptides (4617 9mer peptides).
In PEPVAC, predictions of supertypic peptide binders are combined with the prediction of proteasomal cleavage using probabilistic language models derived from HLA I-restricted epitopes (14). Currently, there are three optional models for proteasomal cleavage that differ in their sensitivity/specificity ratio of the predictions as discussed elsewhere (14). These models are selected within the PROTEASOME CLEAVAGE section. Model 1 has the highest sensitivity (~95%) and the lower specificity (~60%). Conversely, Model 3 has the lowest sensitivity (65%) with the largest specificity (80%). Model 2 has a sensitivity and specificity of ~70%. Promiscuous peptide binders containing a C-terminal end, predicted to be the result of proteasomal cleavage, are shown in violet in the result page (Figure 3B). In the previous example with the Influenza virus A, the list of promiscuous peptide binders to the five selected supertypes decreases from 254 down to 170 peptides (3.7% of all 9mer peptides from Influenza virus A genome) after considering proteasomal cleavage using Model 1. Furthermore, a combination of the predictions of peptide-MHCI binding and proteasomal cleavage increases the specificity of the epitope predictions by discarding predicted peptide-MHCI binders that are experimentally unable to elicit CD8+ T-cell responses (20).
The results page returned by PEPVAC is shown in Figure 3B. This page first displays a summary of the predictions, including the chosen selections, the number of predicted peptides and the minimum population coverage provided by the supertypic selection, followed by the predicted peptide binders to each of the selected supertypes (only A3 in the shown example). Peptides are predicted to bind to all alleles included in the supertype, and appear ranked with regard to the PSSMs of the first allele included in the supertype. Relevant information about each sorted peptide includes its protein source as well as its molecular weight.
This manuscript was supported by NIH grant AI50900 and the Molecular Immunology Foundation. We wish to acknowledge John-Paul Glutting for programming assistance. Funding to pay the Open Access publication charges for this article was provided by NIH grant AI50900.
Conflict of interest statement. None declared.