T-cells are the key component of the adaptive immune system, playing a pivotal role fighting both infectious agents and cancer cells (1
). T-cell-based immune responses are driven by antigenic peptides (epitopes), presented in the context of major histocompatibility complex (MHC) molecules (2
). Therefore, the prediction of peptides that can bind to MHC molecules has become the basis for the anticipation of T-cell epitopes (3
). MHC molecules fall into two major classes, namely MHC class I (MHCI) and MHC class II (MHCII). Antigens presented by MHCI and MHCII are recognized by two distinct sets of T-cells, CD8+
T and CD4+
T-cells, respectively. Identification of T-cell epitopes is important for both understanding disease pathogenesis and vaccine design. Thus, the availability of computational methods that can readily identify potential epitopes from primary protein sequences has fueled a new paradigm in vaccine development that is driven by this epitope discovery.
A major complication to this vaccine development approach is the extreme polymorphism of the MHC molecules. In the human, MHC molecules are known as human leukocyte antigens (HLAs), and there are hundreds of allelic variants of the class I (HLA I) and the class II (HLA II) molecules. These HLA allelic variants bind distinct sets of peptides as MHC polymorphism is the basis for peptide-binding specificity (4
), and are expressed at vastly variable frequencies in different ethnic groups (5
). This complexity suggests that a large number of HLA molecules will have to be targeted for peptide-binding predictions, requiring so many peptides to elicit a broadly protective multi-epitope vaccine as to be impractical. Interestingly, groups of several HLA molecules (supertypes) can bind largely overlapping sets of peptides (6
). The identification of these HLA supertypes facilitates the epitope-based vaccine development for the following two reasons: first, targeting of representative HLA alleles from distinct supertypes allows the immune response to be stimulated in a variety of genetic backgrounds; second, the selection of promiscuous peptide binders to those alleles included within a given supertype limits the number of peptides to be considered without decreasing the spectrum of the immune response.
In this paper, we describe a web server, PEPVAC (Promiscuous EPitope-based VACcine), that allows the prediction of promiscuous epitopes to five HLA I supertypes: A2 (A*0201-07, A*0209 and A*6802), A3 (A*0301, A*1101, A*3101, A*3301, A*6801 and A*6601), A24 (A*2402 and B*3801), B7 (B*0702, B*3501, B*5101-02, B*5301 and B*5401) and B15 (A*0101, B*1501_B62 and B1502). These supertypes were defined using a method based on the clustering of the predicted peptide-binding repertoire of MHC molecules (8
). The combined phenotypic frequency of these supertypes is >95% for five major American ethnicities (Black, Caucasian, Hispanic, Native American and Asian). Thus, targeting these supertypes with epitope predictions would potentially provide a population coverage ≥95%, regardless of ethnicity.
Peptides binding to HLA I molecules are potential CD8+
T-cell epitopes. In vivo
, the C-terminus of these antigenic epitopes results from the selective proteolysis of cytosolic proteins mediated by the proteasome (9
). The proteasome is thus important for determining these epitopes. Therefore, PEPVAC has also been implemented with an algorithm for the identification of those peptides containing a C-terminus that is likely to be the result of proteasomal cleavage. Finally, PEPVAC also allows the prediction of conserved epitopes from sequences with variability masked. The combination of these two features serves in both refining the predictions of T-cell epitopes and limiting the number of potential epitopes.
Prediction of peptide-MHCI binding
The peptide-binding mode of MHCI molecules differs from that of MHCII (10
), and as result, the prediction of peptide-MHCII binding is less reliable than that of peptide-MHCI binding. Thereby, we have focused here in the prediction of MHCI ligands, a class that is specifically recognized by CD8+
cytotoxic T lymphocytes. Peptides binding to a specific MHCI molecule are related by sequence similarity, and thus we use position-specific scoring matrix (PSSM) from aligned MHCI ligands as the predictors of peptide-MHCI binding in combination with a dynamic algorithm. PSSMs are also known as profiles and weight matrices and have previously been shown to be adequate tools for the prediction of peptide-MHC binding (13
). PSSMs are derived from block alignments of MHCI ligands that are of the same length. Such a restriction guarantees proper structural alignment of ligands and subsequent accuracy of the peptide-binding predictions (13
). Given that MHCI-ligands are usually of nine residues in length, PSSMs used in this study are for the prediction of ligands of that same size (nine residues). Accuracy of the prediction of peptide-MHCI binding using PSSMs varies depending on threshold and the targeted MHCI molecule. On average, however, ROC analyses of the predictions at different thresholds result in AUC
values (Area Under ROC Curve) above 0.8, indicating that these PSSMs are very good for predictors of peptide-MHCI binding. Furthermore, >80% of known CD8+
T-cell epitopes can be predicted at a 2% threshold from their protein sources.
Supertypes: identification and population coverage analysis
We defined HLA I supertypes through clustering of predicted MHC peptide-binding repertoires (8
). In brief, the core of the method consists of the generation of a distance matrix whose coefficients are inversely proportional to the peptide binders shared by any two HLA molecules (). Subsequently, this distance matrix is fed to a phylogenic clustering algorithm to establish the kinship among the distinct HLA peptide-binding repertoires. shows a phylogenic tree built upon the peptide-binding repertoire of 55 HLA I molecules, using a Fitch and Margoliash clustering algorithm (17
). We defined supertypes () as groups of HLA I alleles with ≥20% peptide-binding overlap (pairwise between any pair of alleles). The supertypes identified in this study include the A2, A3, B7, B27 and B44 supertypes previously identified by Sidney et al
). Furthermore, we have also identified three new supertypes, BX, B15 and B57 (). The cumulative phenotypic frequency (CPF) of these supertypes is shown in . CPF was calculated using the gene and haplotype frequencies reported for five distinct American ethnic groups including Blacks, Caucasians, Hispanic, North American Natives and Asians (18
). CPF represents the population coverage that would be provided by a vaccine composed of epitopes restricted by the alleles included in the supertype. The A2, A3 and B7 supertypes have the largest CPF in the five studied ethnic groups, close to 90%, irrespective of ethnicity. To increase the population coverage to ≥95%, regardless of ethnicity, it is necessary to include at least two more supertypes. Specifically, the supertypes A2, A3, B7, B15 and A24/B44 represent the minimal supertypic combinations with the indicated population coverage. Alleles belonging to each of these supertypes are shown in and .
Figure 1 Strategy to define HLA I supertypes. HLA I supertypes are identified by clustering their peptide-binding repertoire (8). The method consists of four basic steps. (i) Predict the peptide-binding repertoire (i,j sets in figure) of each HLA I molecule from (more ...)
Figure 2 HLA I peptide-binding overlap and supertypes. The Figure shows an unroot dendrogram built after clustering the overlap between the peptide-binding repertoire of the indicated HLA I molecules. Peptide-binding repertoires of HLA I molecules were obtained (more ...)
Cumulative phenotype frequency of defined supertypes
PEPVAC web server
Following the HLA I supertypic analysis as discussed, we have implemented a tool for the prediction of promiscuous peptide binders to a set of supertypes with a CPF >95%, irrespective of ethnicity. We named this tool PEPVAC, and it is Online at the site http://immunax.dfci.harvard.edu/PEPVAC/
hosted by the Molecular Immunology Foundation/Dana-Farber Cancer Institute. The web interface to PEPVAC is divided into several sections that facilitate intuitive use (). Main features of the web server are discussed bellow.
Figure 3 The PEPVAC web server. (A) PEPVAC input page. The page is divided into several sections. E-MAIL, for obtaining the results via e-mail (optional). GENOMES, where a selection of genomes from pathogenic organisms is available, as well as the possibility (more ...) Input and limitations
In PEPVAC, input query to carry epitope predictions is entered in the GENOME section (). Input consists of a single or various protein sequences in FASTA format. Only the standard 20 amino acid residues are considered. There are several translated genomes from pathogenic organisms that can be selected as inputs. More useful, a user-provided local file containing a set of protein sequences can be uploaded to the server using the choose/browse bottom. PEPVAC can also process files with protein sequences, in which the variable sites have been masked with a dot ‘.’ symbol. In that case, peptide-binding predictions will be carried out only over consecutive stretches of nine or more residues. Sequences with variable positions masked according to the Shannon entropy variability metric (4
) can be obtained at the site http://immunax.dfci.harvard.edu/bioinformatics/Tools/sva.html
. Currently, there is a limit of 200 sequences and 50
000 symbols that can be processed per request. If such limits are exceeded, the server will return an error.
Supertypes and thresholds
The A2, A3, B7, B15 and A24 ( and ) supertypes have been chosen for promiscuous peptide-binding predictions in PEPVAC. Only those peptides that are predicted to bind to all the alleles included in the supertypes are returned in the output (). Threshold for the prediction of promiscuous peptide binders in PEPVAC has been fixed to provide a reduced and manageable set of promiscuous peptide binders to each supertype. As an example, predicted promiscuous peptides to the above five supertypes from a genome, such as that of Influenza virus A (4160 amino acids) distributed in 10 distinct open reading frames, represent only 5.51% (254 9mer peptides) of all possible peptides (4617 9mer peptides).
In PEPVAC, predictions of supertypic peptide binders are combined with the prediction of proteasomal cleavage using probabilistic language models derived from HLA I-restricted epitopes (14
). Currently, there are three optional models for proteasomal cleavage that differ in their sensitivity/specificity ratio of the predictions as discussed elsewhere (14
). These models are selected within the PROTEASOME CLEAVAGE section. Model 1 has the highest sensitivity (~95%) and the lower specificity (~60%). Conversely, Model 3 has the lowest sensitivity (65%) with the largest specificity (80%). Model 2 has a sensitivity and specificity of ~70%. Promiscuous peptide binders containing a C-terminal end, predicted to be the result of proteasomal cleavage, are shown in violet in the result page (). In the previous example with the Influenza virus A
, the list of promiscuous peptide binders to the five selected supertypes decreases from 254 down to 170 peptides (3.7% of all 9mer peptides from Influenza virus A
genome) after considering proteasomal cleavage using Model 1. Furthermore, a combination of the predictions of peptide-MHCI binding and proteasomal cleavage increases the specificity of the epitope predictions by discarding predicted peptide-MHCI binders that are experimentally unable to elicit CD8+
T-cell responses (20
The results page returned by PEPVAC is shown in . This page first displays a summary of the predictions, including the chosen selections, the number of predicted peptides and the minimum population coverage provided by the supertypic selection, followed by the predicted peptide binders to each of the selected supertypes (only A3 in the shown example). Peptides are predicted to bind to all alleles included in the supertype, and appear ranked with regard to the PSSMs of the first allele included in the supertype. Relevant information about each sorted peptide includes its protein source as well as its molecular weight.