The MHC molecules are extremely polymorphic giving rise to many different peptide-binding specificities being expressed in the human population. More than 500 different HLA-DR molecules and more than 2000 different HLA-DQ and HLA-DP molecules have been described
[2]. The only partially pan-specific HLA-DR prediction algorithm publicly available is the TEPITOPE method
[27]. This method describes binding of peptides to 50 HLA-DR molecules. However, as shown in this work, the TEPITOPE method leaves large portions of the HLA-DR allelic polymorphism undescribed.
In the present work, we develop a HLA-DR pan-specific method, NetMHCIIpan, capable of providing quantitative predictions of peptide binding to all HLA-DR molecules with known protein sequence. The method is based on artificial neural networks and is trained on quantitative peptide HLA-DR binding data including the peptide-binding core, peptide flanking residues, and the HLA-DR residues estimated to be within interaction distance of the bound peptide. The natural strength of the method is the ability to predict binding of peptides to any HLA-DR molecule, thus being truly HLA-DR pan-specific. Further, since the method is artificial neural network based, it can capture non-linear relationships defining the binding specificity both within the peptide and between the peptide and the HLA molecule. This is fundamentally different from the methodology underlying the TEPITOPE method, that relies on the approximation that peptide binding specificities can be determined as a summation over independent HLA pockets preferences. The method is validated in terms of prediction of peptide binding to hitherto un-characterized HLA-DR molecules, large-scale leave-one-out experiments, cross-validation and identification of endogenously presented peptides and experimentally validated binding cores. In all validation experiments, the NetMHCIIpan method was shown to perform better than or comparable to TEPITOPE, the only other partially HLA-DR pan-specific binding prediction method publicly available.
A powerful application of the HLA-DR pan-specific prediction algorithm would be to search for highly promiscuous peptide sequences that will bind to most HLA-DR alleles. Such peptides could be of high value in the development of synthetic and recombinant vaccines, since they would bind universally in most humans independently of MHC class II genetic background and thus potentially provide universal helper T cell activation. By way of example, we applied the pan-specific method to identify peptides, predicted to bind a set of prevalent HLA-DR alleles. Prevalent alleles were selected as HLA-DR alleles with a maximal allelic frequency above 1% in an ethnic population as reported by Middleton et al.
[36]. In doing so, we could identify peptides predicted to bind promiscuously to all prevalent HLA-DR molecules. Earlier efforts have been made to identify such highly promiscuous peptides. The PADRE sequence
[37] is one of the most prominent examples of such peptides. Using the pan-specific method, the PADRE sequence is predicted to bind to less than 40% of the prevalent HLA-DR molecules. The analysis shown here demonstrates that exhaustive searches for truly pan-promiscuous HLA-DR are indeed feasible using the proposed pan-specific method.
The pan-specific approach relies on the ability of the neural networks to capture general features of the relationship between peptides and HLA sequences and interpret these in terms of a binding affinity. For this approach to provide reliable predictions, it is essential that polymorphism of the HLA molecules described by the pan-specific method is to some degree covered by the data included in the training of the method. For the
NetMHCIIpan prediction method, we have included binding data covering only 14 of the more than 500 known HLA-DR molecules
[2], thus very likely leaving large regions of the HLA specificity space uncovered. On the basis of the specificity clustering shown in , we can identify HLA-DR alleles with un-characterized binding specificities as these alleles are found far from the alleles included in the training of the pan-specific method. Such novel HLA-DR molecules include the DRB1*14 molecules, i.e., DRB1*1407 (12.5%) and some of the DRB1*11, like DRB1*1103 (5%), as well as DRB1*12 alleles like DRB1*1202 (35%) placed close to center of the tree. The number in parenthesis after each allele is the maximal allelic frequency in an ethnic population as reported by Middleton et al. 2003
[36].
We have previously shown how integrative approaches combining bioinformatics and immunoassays to identify and experimental assay peptide with uncharacterized binding affinity can improve the prediction accuracy of peptide/MHC class I prediction algorithms
[38]. Using the pan-specific approach to identify HLA class II molecules with uncharacterized binding specificities, we suggest extending this search strategy into the dimension of MHC polymorphism. A schematic illustration of this search strategy integrating bioinformatics and high throughput immunoassays is shown in .
Here, we illustrate an iterative cycle that identifies novel MHC molecules with predicted binding specificities that are dissimilar to the specificities included in the training of the pan-specific method. Next, immunoassays should be developed describing the binding specificity of these molecules by identifying peptides with un-characterized binding affinity, and experimentally assay these peptides. Such an approach should allow for rapid and efficient sampling of both the MHC polymorphism and the diversity of peptide binding.
The current version of
NetMHCIIpan and the benchmark data used in this work is available at
http://www.cbs.dtu.dk/services/NetMHCIIpan. The service covers all HLA-DR alleles with known protein sequence. The method will be updated as more data becomes available. In the future, it is our hope to extend the method to also cover HLA-DQ and HLA-DP molecules.