In this study we have presented a comprehensive dataset for the systematic evaluation of MHC class II peptide binding prediction methods. This dataset consists of three components. The first component is a large set of 10,017 quantitative peptide-binding affinities for 16 MHC class II types that significantly expands the amount of publicly available data. These data were generated under identical experimental conditions and comprise affinities for binders as well as non-binders. The second component is a set of non-redundant structures of MHC class II molecules complexed with peptide ligands compiled from the PDB. This set of structures provided a “gold standard” for evaluating the ability of prediction methods to locate the 9-mer core of epitopes. The last component is a set of 664 peptides that has been tested experimentally to determine their ability to stimulate CD4+ T cells from widely utilized C57BL/6 (H-2b) strain of laboratory mice. Together, these datasets serve as a benchmark set to facilitate the development and testing of algorithms for predicting peptide binding to MHC as well as T-cell responses.
Several previous studies have compared the performances of various MHC class II binding prediction methods 
. The Borras-Cuesta study 
from 2000 only had a limited number of peptides, alleles and methods to compare. The two recent studies were published after we finished our initial comparison. Gowthaman et al 
compared six commonly used method with data spanning seven MHC class II alleles. However, their evaluation dataset comprised only 179 peptides, limiting the significance of their results. Rajapakse et al 
compared their multi-objective evolutionary algorithms (MOEA) with five other algorithm using two datasets. The first dataset consisted of 1 training and 10 testing datasets on HLA DRB1*0401 assembled from different sources. The second dataset was extracted from the IEDB and comprised more than 5,000 peptides covering 16 MHC class II alleles. We couldn't include MOEA in our comparison since it is not publicly available at the moment. Despite the difference in datasets used in comparison, their conclusion is consistent with ours in that SMM-align, TEPITOPE and ARB are the better performing methods.
We have carried out a comprehensive unbiased evaluation of existing MHC class II epitope prediction algorithms using these datasets. Except binding prediction for ARB, all the other MHC class II prediction algorithms are evaluated in a completely blinded fashion. In our analysis, the better performing methods proved to be those that are based on quantitative matrices extended by method specific features. For example, SMM-align is the only method tested that considers the contribution of residues outside of the binding groove, and TEPITOPE is the only method whose matrices are based on experiments aimed to determine individual amino acid's contribution to binding. Merely using quantitative matrices alone is not sufficient to ensure good performance, since pure position specific scoring matrix based methods such as RANKPEP and SYFPEITHI do not perform as well.
One potential reason for the differential performance of various methods is the likely different number of data points utilized by the various methods in the training stage. In this respect, we anticipate that the datasets described herein, and now made publicly available, could be utilized to retrain several of the methods and further increase their performance.
Despite the large number of existing MHC class II epitope prediction methods, the best performance is generally not as good as that for MHC class I methods. Indeed, it is notable that the majority of methods examined in the present study have also been employed to make predictions for MHC class I peptide binding, and almost invariably their performance is appreciably better in the context of class I 
. For example, when SMM 
was applied to predict epitopes for several MHC class I molecules, it achieved an average AUC of 0.874, which is substantially higher than that for class II (0.783).
In an attempt to identify what limits the performance of MHC class II binding prediction, we tested the ability of prediction methods to identify the 9-mer peptide cores revealed in crystal structures of MHC-peptide complexes. Except for PROPRED and SYFPEITHI, the methods examined performed poorly, suggesting that difficulties in identification the correct binding core contribute to the inferior performance of class II binding prediction. It is noteworthy that the two methods with the best core predictions do not take all positions of a peptide into account when making binding predictions, but rather focus on anchor positions in the peptide. This may explain why especially the ARB method performs much poorer in the core identification rather than the binding predictions: It treats all positions in the peptide identically and relies on automated peptide alignments to derive an overall peptide profile. While this inclusion of weakly interacting positions can be an advantage to predict overall peptide binding, it may lower the accuracy when picking the correct core.
In an attempt to improve upon the prediction performance realized by individual prediction tools, we implemented a consensus approach for class II binding predictions. The consensus approach was found to clearly outperform each individual prediction approach when measured over the entire dataset, and provided the best predictions for 10 out of 14 molecules. This shows that the consensus approach is just as useful for MHC class II peptide binding prediction as its recent successful application for MHC class I molecules 
. In a smaller study addressing 3 different prediction methods in the context of a single DR type, Mallios previously came to a similar conclusion 
Other types of meta approaches have been successfully applied to MHC binding prediction. For example, Mallios 
has used an iterative stepwise discriminant analysis meta-algorithm to successfully classify binders and non-binders for HLA-DR1. Stern and co-workers effectively used a two-dimensional dot plot to combine the prediction results of SYFPEITHI and TEPITOPE 
. Finally, Trost et al 
have reported achieving greater accuracy in MHC class I binding predictions by combing results from multiple prediction tools. Compared to these methods, our median rank approach does not depend on the absolute values of scores and it has exceptional scalability since typical sorting algorithms have running times proportional to n
where n is the number of cases needed to be sorted. Overall, it is astonishing that the systematic use of consensus predictions comes rather late (see Mallios 
) to the problem of MHC peptide binding since consensus approaches have for quite some time proven their superiority in a number of fields, notably protein structure prediction 
In any case, it is also likely that the remarkable increase in performance obtained by the use of the consensus approach hinges on the fact that it combines information derived from methods trained on large numbers of data points with methods incorporating structural considerations leading to effective core predictions. We are currently working on development of algorithms specifically combining these two different features.
We also tested the ability of MHC class II binding prediction methods to predict a peptide's ability to activate CD4+ T cells. Most of the methods were associated with good performance. This was somewhat surprising since T cell activation is a multi-step process where multiple signals are needed for successful activation 
. In addition, a peptide that binds well to MHC molecules is not necessarily a good stimulator for T-cell response as different amino acids are interacting with T cell receptor. It is important to point out that the performance was based on a set of 664 peptides of which only 9 activated CD4+ T cells. The limited number of positive cases makes the ROC curve jagged and the AUC values calculated less robust. Despite the encouraging AUC values achieved by several methods, it is still necessary to test a large number of peptides to identify most of the T cell activating peptides. In addition, all those methods still have high numbers of false positives peptides that are predicted binders but will not activate T cells. Since experimental efforts to test T cell activation are even more time consuming than testing peptide-MHC binding, significant efforts are needed to develop tools that can identify T cell activating peptides with high sensitivity and specificity.
In conclusion, we have presented a set of benchmarks to facilitate the evaluation and development of MHC class II binding predictions. While several good methods are available, these do not reach the performance of those for MHC class I molecules. We have shown that a simple and robust consensus approach can improve the prediction performance for the great majority of the MHC class II molecules tested. Finally, we speculate that novel approaches that capture distinct features of MHC class II peptide interactions could lead to more successful predictions than the current approaches, which are commonly developed as extensions of MHC class I predictions.