Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles.
The predictive performance of the SMM-align method was demonstrated to be superior to that of the Gibbs sampler, TEPITOPE, SVRMHC, and MHCpred methods. Cross validation between peptide data set obtained from different sources demonstrated that direct incorporation of peptide length potentially results in over-fitting of the binding prediction method. Focusing on amino terminal peptide flanking residues (PFR), we demonstrate a consistent gain in predictive performance by favoring binding registers with a minimum PFR length of two amino acids. Visualizing the binding motif as obtained by the SMM-align and TEPITOPE methods highlights a series of fundamental discrepancies between the two predicted motifs. For the DRB1*1302 allele for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the SMM-align method identifies a preference for hydrophobic or neutral amino acids at the anchors.
The SMM-align method was shown to outperform other state of the art MHC class II prediction methods. The method predicts quantitative peptide:MHC binding affinity values, making it ideally suited for rational epitope discovery. The method has been trained and evaluated on the, to our knowledge, largest benchmark data set publicly available and covers the nine HLA-DR supertypes suggested as well as three mouse H2-IA allele. Both the peptide benchmark data set, and SMM-align prediction method (NetMHCII) are made publicly available.
Major histocompatibility complex molecules (MHCs) play an essential role in the host pathogen interactions determining the onset of a host immune response. One arm of the cellular immune system is guided by the MHC class I complexes that present peptides derived from intra cellular proteins to cytotoxic T cell circulating in the blood periphery. The MHC class II complexes guide the other arm of the cellular immune system. These complexes present peptides derived from endocytosed proteins to CD4+ helper T lymphocytes (HTLs) to stimulate cellular and humoral immunity against the pathogenic microorganism.
Predicting the peptides that bind to MHC class II molecules can effectively reduce the number of experiments required for identifying helper T cell epitopes and play an important role in rational vaccine design. Large efforts have been invested in deriving such prediction methods. In general terms, the different methods can be classified in two groups. One group being quantitative matrices estimated from experimentally derived position specific binding profiles [1-3], and the other group comprising data driven bioinformatical motif search methods. The number of different bioinformatical methods proposed to predict MHC class II binding is large and growing including Gibbs samplers , Ant colony , Artificial neural networks , Support vector machines [7,8], hidden Markov models , and other motif search algorithms [10-12]. However, most of these methods have been trained and evaluated on very limited data sets covering only a single or a few different MHC class II alleles. Further the majority of the methods are trained on MHC ligand data (peptides eluted from MHC complexes present on the cell surface). This type of qualitative prediction methods are well suited to classify data in to binders and non-binders, but they do not allow a direct prediction of the peptide:MHC binding strength.
Recently, a large set of quantitative MHC class II peptide-binding data has been made publicly available on the IEDB databases . The data set comprises peptide data with IC50 binding affinities for more than 14 HLA (human MHC) and several mouse MHC class II alleles.
Here, we present a novel method, SMM-align, for quantitative MHC class II binding predictions. The method is an extension of the stabilization matrix method proposed by Peters et al. . The SMM-align method seeks to identify a weight matrix that optimally reproduces the measured IC50 values for each peptide in the training set.
The method allows for identification the MHC class II binding motif in terms of a position specific weight matrix. The output of the SMM-align method is IC50 binding affinity values, enabling direct readout of the peptide:MHC binding affinity.
To our knowledge, only three other methods are publicly available for quantitative MHC class II prediction, namely the ARB , SVRMHC  and MHCpred  methods. Other methods such as SVMHC  and Propred  are implementations of the TEPITOPE method , and provide prediction scores that are not in any direct way proportional to the peptide binding affinity. Both the SVRMHC and MHCpred methods are trained on relatively small sets of quantitative peptide binding data contained within the AntiJen database  and could probably improve if retrained on the data used here. The SVRMHC method covers six, and MHCpred three MHC class II alleles, respectively. The ARB method is trained on quantitative peptide binding data available within the IEDB database . The TEPITOPE is an experimentally derived virtual matrix based prediction method that covers 50 different HLA-DR alleles, and relies on the approximation that the peptide binding specificity can be determined solely from alignment of MHC pockets amino acids .
In this work, we design a large-scale benchmark calculation covering 14 HLA-DR and three mouse H2-IA alleles. We compare the predictive performance of five methods in terms of their ability to predict binding affinity of more than 4600 peptides. The methods included in the benchmark are; SMM-align, Gibbs sampler , SVRMHC , MHCpred , ARB  and TEPITOPE .
The SMM-align method was applied to derive a position specific scoring matrix for prediction of MHC-II binding affinities for each of the 14 HLA-DR and three H2-IA alleles in the benchmark dataset. The predictive performance of the method was estimated using five-fold cross-validation.
Cross-validated predictive performance
The predictive performance of the SMM-align method is compared to that of the Gibbs sampler, TEPITOPE, SVRMHC, MHCpred, and ARB methods. The SMM-align, Gibbs sampler, and ARB methods cover all 14 alleles. TEPITOPE covers 11, SVRMHC five, and MHCpred only three of the 14 alleles. The predictive performance of the different methods was measured in terms of the area under the ROC curve (AUC) , the Pearson correlation , and the Spearman's rank correlation . Since the TEPITOPE method does not produce prediction values that are linearly related to the log-transformed IC50 binding affinities, the use of the Pearson correlation coefficient would be an inappropriate measure for the prediction accuracy for this method. Hence, we for this method only evaluate the predictive performance using the other two measures.
Table 1 gives a summary of the HLA-DR benchmark calculation results.
Table 1. Summary of the HLA-DR benchmark results.
The table demonstrates the predictive power of the SMM-align method as compared to that of the other methods. Note that caution should be taking when evaluating the predictive performance of the ARB method. This method has been trained on data from the IEDB database, and thus very likely has been trained on data included in the benchmark evaluation set. The Gibbs sampler has relative poor performance compared to the SMM-align method. There are many possible reasons for this low performance. Most importantly, the Gibbs sampler method is trained on qualitative data only. Before applying the Gibbs sampler method, the data are classified into binders and non-binders, and only the set of binders are included when estimating the binding motif weight matrix. The SMM-align method, on the other hand, includes both binding and non-binding data when estimating the motif weight matrix.
The details of the benchmark calculation as evaluated in terms of the AUC performance measure are shown in Table 2 (evaluation in terms of the Pearson's and Spearman's rank correlations are shown in Supplementary materials table 1 [see Additional file 1]).
Table 2. Details of the benchmark calculation covering the 14 HLA-DR alleles.
Additional file 1. Details of the benchmark calculation covering the 14 HLA-DR alleles. The predictive performance is shown in terms of the Pearson's correlation (upper table) and the Spearsman's rank correlation (lower table) for the SMM-align, Gibbs sampler , TEPITOPE , SVRMHC , MHCpred , and ARB methods, respectively. The SMM-PRF method refers to the extended SMM-align method including penalties for long peptides and short amino terminal peptide flanking residues, and the NetMHCII method refers to the final extended SMM align method including direct encoding of peptide flanking residues and penalties for longer peptides and short amino terminal peptide flanking residues. The first column gives the allele names as 1*0101 for DRB1*0101 etc The last column gives the number of peptide data included for each allele. For each allele, the performance of the SMM-align, Gibbs sampler, and NetMHCII methods was estimated using five-fold cross-validation as described in the text.
Format: PDF Size: 70KB Download file
This file can be viewed with: Adobe Acrobat Reader
The SMM-align, ARB and TEPITOPE methods all have similar predictive performances. Comparing to the other methods the SMM-align method has the highest predictive performance, followed by the Gibbs sampler and SVRMHC methods. The MHCpred method performs the poorest. The direct relation between the SMM-align prediction scores and the log-transformed IC50 binding affinity is confirmed by fact that a least square linear fit to the log-transformed binding data as a function of the SMM-align prediction values has a slope close to unity (data not shown).
Incorporating the peptide length in prediction of peptide:MHC class II binding
Chang et al.  recently proposed a strategy for incorporating of peptide length into prediction of peptide-MHC class II binding, and demonstrated that at least for some alleles the approach lead to significant improvements in prediction accuracy.
Here, we evaluate the peptide length-based approach on two data sets covering the three alleles included in the work by Chang et al. . The data are taken from two sources. AntiJen: Data from the AntiJen database as downloaded from the supplementary data in Chang et al., and IEDB: Data from the IEDB database . For each allele in the two data sets, we train the SMM-align method in three distinct manners; a) using no peptide length information, b) including a peptide length affinity baseline estimated from the training data as described by Chang et al., and c) including a peptide length affinity baseline estimated from the data in the alternative data set, i.e., using the DRB1*0101 AntiJen data set to estimate the baseline correction for the training of the IEDB DRB1*0101 data and vise versa. The result of the benchmark calculation is shown in Table 3.
Table 3. Predictive performance in terms of the area under the ROC curve (AUC) of the different methods evaluated on six data sets.
Focusing on the first three rows in the table comparing, the predictive performance of the SMM method is seen to compare favorably to that of both ISC-PLS, and TEPITOPE. The average predictive performance in terms of AUC for the ISC-PLC, TEPITOPE, and SMM methods on the three alleles in the AntiJen data sets is 0.692, 0.692, and 0.738, respectively. Further, the table shows that the performance gain proposed the Chang et al. for the AntiJen DRB1*0101, and DRB1*1501 allele data sets is recovered in our implementation (see SMM-regr). However, it is striking to observe that the alternative baseline correction consistently for all alleles leads to a dramatic drop in predictive performance (see SMM-regr-alter). This suggests that the baseline subtraction, rather than being alleles specific, is highly data set dependent and that the performance gain observed including an affinity baseline correction does not necessarily reflect a genuine feature of MHC class II binding predictions, but may arise as a result of over-fitting the method to a length distribution and binding profile of a particular data set.
We observed a large discrepancy between the affinity-length profiles for the peptide data in the AntiJen, IEDB and SYFPEITHI databases (details of the calculation are shown in Supplementary materials, Figure 1 [see Additional file 2]). The short peptides (length < 15 amino acids) in the IEDB data set, seems to follow an affinity profile in agreement with the observed length profile for natural MHC-II ligands in the SYFPEITHI database. This is in contrast to what was observed for the peptides the AntiJen data set. For longer peptides, both the AntiJen and IEDB data sets followed a similar affinity profile that deviated strongly from the length profile of natural MHC-II ligands. The large discrepancy between the affinity-length profiles for the two databases provides a possible explanation as to why the alternative baseline correction gives so poor predictive performance. While the AntiJen data sets tend to have high binding affinities for short peptides, the opposite is the case for the IEDB data sets, and applying an AntiJen derived baseline correction to an IEDB data set thus could give no improvement to the prediction results.
Figure 1. Length distribution of amino terminal PFRs for MHC-II binding and non-binding peptides. All peptide data for the three alleles in the AntiJen and IEDB data sets are included in the figure. Binding peptides have an affinity stronger than 500 nM. The PFR is defined as the residues flanking the peptide-binding core as determined by the SMM-align method.
Additional file 2. MHC-II binding affinity as a function of peptide length for three MHC-II alleles, DRB1*0101, DRB1*0401, and DRB1*1501. In the left figure results for the DRB1*0101 allele are displayed, and the right figure shows an average over the 3 alleles. For each data set, the mean binding affinity for peptides of a given length is shown as a function of the peptide length. In black is shown the curves for the data in the AntiJen data set . In red is shown the curves for the data in the IEDB data set . The green curves show histograms of the length distribution of natural MHC ligands as downloaded from the SYFPEITHI database . As suggested by Cheng et al., values for peptide lengths where no affinity data are available are set to the mean binding value over the entire data set. All curves are smoothed using a running mean of length three. It is clear from the figure that the AntiJen and IEDB data sets have very distinct mean binding profiles for short peptides (length < 15 amino acids). In this regime of peptide lengths, the IEDB data set, in contrast to the AntiJen data set, seems to follow an affinity profile in agreement with the observed length profile for natural MHC-II ligands. For longer peptides, both the AntiJen and IEDB data sets follow a similar affinity profile that deviate strongly from the length profile of natural MHC-II ligands.
Format: PDF Size: 51KB Download file
This file can be viewed with: Adobe Acrobat Reader
Including peptide flanking residues
Looking at differences in the length distribution of the amino terminal peptide flanking residues (PFR) as identified by the SMM alignment method, suggest a feature common to both the AntiJen and IEDB data that separates binding from non-binding peptides.
In Figure 1, the length distribution of the amino terminal PFRs is shown for the combined set of binding and non-binding peptides in the AntiJen, and IEDB data sets, respectively. In the figure, a PFR length of zero indicates that the MHC-II binding core starts at the amino terminal of the peptide leaving a flanking region of zero amino acids. From the figure, it is apparent that a significantly larger fraction of the non-binding peptides have amino terminal PFRs shorter than 2 amino acids (p < 0.001, Fishers exact test), suggesting that amino terminal PFRs do play an important role in stabilizing the peptide:MHC complex. A similar picture is not observed at the C terminal (data not shown).
The requirement for presence of amino terminal flanking amino acids in combination with the observation that the SMM algorithm tends to over-predict the binding affinity for longer peptides (data not shown), suggests a simple scheme suitable for modifying the SMM predictions
where S is the original prediction score from the SMM method, and p is a parameter determining the penalty for short PFRs and longer peptides. In a small cross validation experiment, an optimal value for p was determined to be equal to 0.1.
The predictive performance using this ad-hoc modification scheme is shown in Table 1 through Table 3 as SMM-PFR. From the table, it is apparent that the modification improves the predictive performance for all alleles in both data sets. The average performance in terms of AUC for the SMM-align method on the six alleles common to the IEDB and AntiJen data sets is 0.729. Using the proposed modification scheme this number is increased to 0.748. For the alleles in the IEDB data set, the average predictive performance of the SMM-align method is 0.730. This value is increased to 0.749 using the PFR modification scheme. In total the PRF modification scheme improved the predictive performance for 15 of the 17 (14 IEDB and 3 AntiJen) data sets, making the improvement highly statistical significant (p = 0.001, Fishers exact test).
An attempt to directly encode the amino acids composition of the PFR's as input to the SMM-align method gave further improvements to the prediction accuracy. Here, the SMM weight matrix was extended to a length of 11 to incorporate the effect of PFR's. PFR's were encoded to the SMM-align method as the average Blosum62 score  over a maximum length of three amino acids. The average predicted performance in terms of the AUC using this PFR encoding scheme in combination with the penalty for longer peptides and short amino terminal peptide flanking residues was 0.756 for the alleles in the IEDB data set, and 0.750 for the six alleles in the combined AntiJen and IEDB data set. The performance excluding PRF sequence encoding and including only the penalty for longer peptides and short amino terminal peptide flanking residues was 0.749, and 0.748, respectively, for the two datasets. The gain in predictive performance is minor. However, the performance is consistently increased for all three alleles in AntiJen data set, and is increased for 11 of the alleles in the IEDB data set, making the improvement highly statistical significant (p = 0.001, Fishers exact test) suggesting that amino acid composition of the PFRs does play some role in stabilizing the peptide:MHC complex.
Mouse H2-IA alleles
Next, the SMM-align method was applied to derive a method for prediction of MHC-II binding affinities for set of three mouse H2-IA alleles in the benchmark dataset. The predictive performance of the method was estimated using five-fold cross-validation. The methods SMM-align, ARB  and PredBalbc  were included in the benchmark. Table 4 gives the results in terms of the AUC predictive performance of the benchmark calculation.
Table 4. Summary of the mouse H2-IA benchmark.
The table demonstrates the predictive power of the SMM-align method, as the performance is higher than or comparable to that of the ARB method. Note that caution should also here be taking when evaluating the predictive performance of the ARB method. This method has been trained on data from the IEDB database, and thus very likely has been trained on data included in the evaluation set. The PredBalbc method seems to perform significantly worse that the other two methods. Noteworthy is the limited gain in predictive performance of the SMM align method when including peptide flanking residues and penalty for long peptides and short amino terminal peptide flanking residues. Here, the H2-IAb allele shows a drop in predictive performance when including PRFs. However, the H2-IAb allele is trained on very limited amount of peptide data, and one could speculate that PRF might improve the predictive performance also for this allele, as more peptide training data becomes available.
The final NetMHCII prediction method
The final MHC class II prediction method covers 14 HLA-DR and three H2-IA alleles. For each allele, the method is trained in a five-fold cross-validated manner as described in Methods using multiple sequence encoding schemes, Gibbs sampler derived position specific weight matrices, direct encoding of PRFs and penalties for longer peptides and short amino terminal peptide flanking residues. We denote the final method NetMHCII.
Visualization of the peptide binding motifs
The difference in predictive performance between the SMM-align and TEPITOPE method is striking for several alleles. The binding motifs can be visualized in a highly condensed manner using sequence logos . Figure 2 shows such Kullback-Leibler logos  for the binding motifs determined by the SMM-align, Gibbs sampler, and TEPITOPE methods, respectively, for the alleles DRB1*0101, and DRB1*1302. The logos are determined from the top 1% of 10.000 random natural peptides selected from the SWISS-PROT database . For the DRB1*0101 allele, the logos show a clear agreement between all three methods, with three major anchors at positions 1, 4 and 6. However, for the DRB1*1302 allele, the logos are in strong disagreement both with regard to the location of and the preferred amino acids at the anchor positions. The TEPITOPE method identifies the position 1 and 4 as anchors, with all anchors except P1 preferring basic amino acids. The SMM-align method, on the other hand, identifies positions 1 and 3 as primary anchors, all with a strong preference for neural or hydrophobic or neutral amino acids.
Figure 2. Kullback-Leibler logo visualizations of peptide binding motifs. The upper panel depicts the motif for the DRB1*0101 allele, and the lower panel the motif for the DRB1*1302 alleles. From left the different columns show the motif estimated by the SMM (NetMHCII), Gibbs sampler, and TEPITOPE methods, respectively. The height of a column in the logo is proportional to the relative information content in the sequence motif, and the letter height is proportional to the amino acid frequency 
We have developed an integrated alignment and motif identification algorithm, SMM-align. The method is a hybrid between the SMM method proposed by Peters et al. , and the Gibbs sampler method . The method is trained on quantitative MHC:peptide binding data, allowing for a direct prediction of MHC:peptide binding affinities. The peptide data is encoded to the SMM-align method using several sequence schemes including sparse, Blosum and position specific weight matrix encoding. The binding prediction is determined as the ensemble average over the predictions obtained from the different encoding schemes. The search for the optimal SMM-align solution is performed using a Metropolis Monte Carlo (MC) search . To allow for an effective sampling of the potentially large number of local minima in the weight space, an ensemble average of suboptimal MC solution was included in the SMM-align method. Finally for the human HLA-DR alleles, prior knowledge of the preferred amino acids at the P1 position in the binding motif was implemented to guide the MC search. The final method is termed NetMHCII and covers 14 HLA-DR and three mouse H2-IA alleles.
The large-scale MHC class II peptide binding benchmark covering 14 HLA-DR and three mouse H2-IA alleles enabled an evaluation of the predictive performance of a set of different publicly available prediction methods including the Gibbs sampler, TEPITOPE, SVRMHC, MHCpred, and ARB methods. For each allele, the peptide binding data were split into five groups each with minimal sequence overlap and thus ideally suited for cross-validated method validation. The benchmark calculation demonstrated that for the HLA-DR alleles the NetMHCII method outperforms most of the other methods. Only the ARB method had a comparable performance. However, a direct comparison to the predictive performance of the ARB method is difficult since the ARB method most likely is trained on the data included in the evaluation sets. The MHCpred method was shown to have the poorest performance. A general tendency was observed for small training data set where the TEPITOPE and NetMHCII prediction methods achieve comparable predictive performances, underlining the need for large data sets in order to generate accurate MHC class II prediction methods.
Incorporation of peptide length in to MHC class II binding prediction algorithms as suggested by Chang et al.  was demonstrated to result in a potential strong over-fitting of the predictive performance. In a cross-validation experiment using affinity data from both the AntiJen and IEDB databases, the two data sets were shown to have highly different peptide length binding profiles, suggesting that the performance gain reported by Chang et al. not necessarily reflects a genuine feature of MHC class II binding predictions, but could arise as a result of over-fitting the method to a length distribution and binding profile of a particular data set. One can speculate why the two data sets have so different affinity-length profiles. A possible reason could be that a large fraction (13%) of the peptides in the AntiJen data have an unnatural amino acid composition (more than 70% alanin for instance). The fraction of unnatural peptides is less than 1% for the IEDB data set. The similarity in profile between the IEDB and SYFEITHI data sets suggests that short peptides with length less than 13–14 amino acids do indeed bind poorly to MHC class II molecules. Both the IEDB and AntiJen affinity profiles show that the likelihood of binding has limited dependence on peptide length for longer peptides. This is in large disagreement with the length profile for natural MHC-II ligands, where the likelihood of observing a peptide of a given length decreases rapidly as the peptide length passes 16 amino acids. However, here it is important to stress the different nature of the three data sets. Both the IEDB and AntiJen data sets contain quantitative data on in-vitro binding of peptides to MHC-II molecules. The SYFPEITHI data set, on the other hand, reflects the length of peptides that are naturally presented through the class II antigen presentation pathway. A major event in this pathway is binding to the MHC-II molecule. The difference in the affinity profile and the profile of natural ligands supports the notion put forward by Nelson et al. , that antigen processing continues after peptide binding to the MHC class II molecule. First, the longer peptides bind to MHC class II and are next trimmed by exopeptidases before presentation.
The predictive performance of the SMM-align method is relatively poor when compared to the performance values obtained when predicting peptide binding for MHC class I alleles. Here, the predictive performance in terms of the area under the ROC curve tends to fall in the range 0.9–0.95 depending on the allele and number of data points available for the training . There are many possible explanations for the poor predictive performance. Most importantly, the MHC class II binding motif is more degenerate than that of MHC class I. For MHC class I, the anchor positions are highly conserved, often allowing accommodation of only a few different amino acids. As seen from the binding motifs in Figure 2, the situation is quite different for MHC class II. Here, even the most dominant anchor positions allow for a large number of different amino acids. Due to this high degeneracy one might expect a general lower predictive performance. However, there are other issues affecting the predictive performance of the SMM-align (and most other MHC class II binding prediction) method. The SMM-align method takes as a fundamental assumption that the peptide:MHC binding affinity is determined solely from the nine amino acids in binding core motif. This is clearly a large oversimplification since it is known that peptide flanking residues (PFR) on both sides of the binding core may contribute to the binding affinity and stability . An example of such influence of the peptide flanking amino acids can be observed for the DRB1*0401 restricted peptide WIILGLNKIVRMYSPTSI. Here, the core region (IVRMYSPTS) as identified by both the SMM-align, and TEPITOPE methods, is highlighted in italic. The binding affinity for the peptide is 1.37 nM. However, also a truncated version of the peptide exists in the data set, LNKIVRMYSPTSI. This peptide shares the binding core sequence with the complete peptide, but its binding affinity is 100 fold lower (177.80 nM). This example clearly illustrates the significance of the peptide flanking amino acids in determining the peptide binding affinity. Here, we have incorporated the PFRs by directly encoding the amino acids composition of the PFR's as input to the SMM-align method, and as an ad-hoc strategy that disfavors binding registers with short amino terminal PRF-length and binding of longer peptides, and demonstrated that these PFR modification schemes indeed lead to a significant improvement in predictive performance.
Comparing the binding motifs identified by the NetMHCII and TEPITOPE method highlighted a series of fundamental discrepancies. For the DRB1*1302 allele, for instance, the TEPITOPE method favors basic amino acids at most anchor positions, whereas the NetMHCII method identifies a preference for hydrophobic or neutral amino acids at the anchors. The TEPITOPE and NetMHCII methods are very different in nature. The TEPITOPE binding motif is derived using "virtual" matrices obtained by alignment of binding pocket amino acids and experimentally derived binding specificities . The NetMHCII binding motif, on the other hand, is derived directly from peptide binding data. It remains to be determined which amino acids preference conforms to the experimental binding motif.
Other groups have reported prediction algorithms with very high predictive performance values also for MHC class II binding. However, these studies have been limited to small data sets covering a single or a few different MHC molecules [6,8,30]. Here, we have designed a benchmark setup allowing for large-scale validation and comparison of MHC class II prediction algorithms. Future work based on this type of large-scale benchmark analysis should help identifying which methodologies are most suitable for development of algorithms for MHC class II binding.
Quantitative peptide:HLA binding data were downloaded from the IEDB database November 2006 . Only HLA DR alleles with more than 100 unique peptides and mouse H2-IA data with more than 75 unique peptides were included. The final data set covers 14 HLA-DR and three H2-IA alleles, with a total number of peptide IC50 values of 5147. This dataset is thereafter referred to as the IEDB data set.
The SMM-align method includes a weight matrix encoding the amino acids preferences identified by a Gibbs sampler trained on HLA ligand data (peptides known to bind a given HLA complex). HLA ligand data were downloaded from the SYFPEITHI database . Only peptides of nine amino acids length or more and peptides not present in the IEDB data set were included. A total of 360 HLA ligands were included in the SYF data set.
A summary of the data is shown in Table 5
Table 5. Data included in the benchmark calculation.
Designing a benchmark is quite more difficult for Class II binding prediction compared to Class I due to the broad length variation between the different peptides, and the potential data redundancy this imposes. To make an evaluation of a prediction method, one has to define the evaluation set so that none of the 9 mer sequences of the evaluation set are present in the training data. We have designed a simple Hobohm1  inspired algorithm that aims at minimizing the overlap between training and evaluation data. The algorithm is applied to all peptide data for each allele. For each allele: Add each new peptide to list of non-redundant peptide sequences, NR, if it has no identical nonamer peptide overlap with any of the peptides already on the NR list. Otherwise the new peptide is added to the cluster defined by the first hit on the NR list. Next all peptides in the NR list (together with their cluster members) are split into five subsets. We are aware that this approach does not ensure zero overlap on the 9 mer level between the different data subsets. However, the overlap is minor and for most alleles in the order 0.5–2% (data not shown).
Gibbs sampler weight matrices
For each allele, a weight matrix describing the binding motif was constructed based on the relevant data in the SYF data set and the set of binding peptides in the IEDB training set, using the Gibbs sampler method as described by Nielsen et al. . An IC50 value of 500 nM was used identify peptide binders from the IEDB data set.
The SMM-align method
The binding motif for all MHC class II alleles is defined in terms of a 9 × 20 weight matrix, where 9 is the length of the binding motif, and 20 the number of different amino acids. The SMM-align method seeks to identify a weight matrix that optimally reproduces the measured IC50 values for each peptide. Inspired by the work on MHC class I binding, the IC50 affinity values in nM units are log-transformed using the relation 1 - log50k(IC50 nM), before optimizing the weights in the matrix . Peptides with affinity values greater than 50,000 nM are assigned a log-transformed value of zero. The weight matrix is next optimized so that the mean square error between predicted and measured log50k(IC50) values is minimal.
The predicted binding affinity for a peptide sequence is determined as the highest nonamer peptide score within the peptide, where a nonamer peptide score is calculated as
where wla' is the binding motif weight at position l for amino acid a', and vala', is the sequence-encoding value for amino acid a' for amino acid a. The peptide sequences are presented to the SMM-align method using several sequence-encoding schemes. The first is the conventional sparse encoding where each amino acid is encoded as a 20-digit binary number (a single 1 and 19 zeros). The second is the Blosum50 encoding in which the amino acids are encoded as the Blosum50 score for replacing the amino acid with each of the 20 amino acids . Note that since the sequence encoding for each amino acid thus is a constant "vector", the relation for the peptide score can be simplified to the conventional scoring scheme
For both the sparse and Blosum encoded SMM-align matrices, the prediction scores thus converts to a simple matrix sum.
The final prediction score for a nonomer peptide is calculated as the average of the sparse and Blosum encoded predictions.
A Metropolis Monte Carlo (MC) procedure  is invoked to search for the optimal weight matrix. Initially, random weights are assigned to the matrix keeping the sum at each position equal to zero. In each Monte Carlo step, a position is selected at random, and the weight on two amino acids are updated keeping the sum of the weights equal to zero. The energy function guiding the Monte Carlo search is
where si is the prediction score, mi is the measured (log-transformed) binding affinity, N is the number of peptide data, L is the binding motif length, A is the number of different amino acids, wla are the weight matrix elements, and a term weighted by a parameter λ is introduced to avoid over-fitting. This term penalizes high weights and thereby forces weights that do not significantly lower the energy function towards small values. In a small-scale cross-validation experiment using only sparse sequence encoding, the parameter λ was determined to have an optimal value of 0.02.
The SMM-align method can readily be extended to include the Gibbs sampler weight matrix, by expanding the space of the SMM weights to include additional weights for the L positions in the Gibbs sampler matrix. In the Monte Carlo search, the number of weights is then 189 for a nonamer binding motif. Only non-negative values are allowed for the weights on the Gibbs sampler matrix. The final weight matrix is determined as the weighted mean of the SMM-align and Gibbs sampler matrices, with relative weights on the Gibbs sampler matrix determined from the Monte Carlo search.
As demonstrated earlier, the Gibbs sampler performance can be greatly improved by restricting the number of allowed amino acids at the P1 position in the binding motif . We adopt a similar approach, to improve the performance of the SMM-align method for HLA-DR alleles. In the Monte Carlo step modifying the weights at position P1 in the matrix, hydrophobic amino acids (ILVMFYW) are forced to take only non-negative values, and non-hydrophobic amino acids are forced to take only non-positive values.
The probability of accepting a move in the Monte Carlo search is determined by the relation
where dE is the difference in energy between the end and start configurations, and T a scaler.
Each MC search was initiated with random weights. The scalar T was initialized to 0.01 and lowered to 0.000001 in 20 uniform steps. At each value of T, 2500 Monte Carlo moves were performed. The acceptance of a move was determined using Equation 2. The motif length, L, was fixed at nine amino acids.
The configuration-space of the peptide sequences contains many local minima with close to identical energy. In order to achieve an effective sampling of these local minima, the MC calculations were repeated 25 times with different initial weight configurations. For each run the final energy and weight matrix were recorded, and the top 10 scoring matrices were kept in the matrix ensemble.
In the five-fold cross-validated training the peptides were split into five subsets as described earlier. One of the subsets were left out from the SMM training and used as evaluation set. The remaining subsets were used to train the weight matrix. In this manner, all peptides will in turn be part of the evaluation set, and the predictive performance can be estimated on all data. This approach lowers the possible effects of over-fitting while keeping the size of data set for evaluation maximal.
The SVRMHC predictions were obtained using default parameter setting for the SVRMHC webserver . The server returns pIC50 prediction scores for each nonamer within the query peptide, and the maximum score was assigned as the binding pIC50 prediction value for the query peptide.
The MHCpred predictions were obtained using default parameter setting for the MHCpred webserver . The server returns IC50 prediction scores for each nonamer within the query peptide, and the minimum score was assigned as the binding IC50 prediction value for the query peptide.
The ARB predictions were obtained using default parameter setting for the ARB webserver .
MN developed the SMM-align method, designed the MHC class II benchmark, trained the prediction method and did the performance comparison between the different prediction methods. CL prepared in the IEDB and SYFPEITHI peptide data sets. All authors read and corrected the manuscript.
This work was supported by NIH contract HHSN266200400083C.
Bui HH, Sidney J, Peters B, Sathiamurthy M, Sinichi A, Purton KA, Mothe BR, Chisari FV, Watkins DI, Sette A: Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications.
Sturniolo T, Bono E, Ding J, Raddrizzani L, Tuereci O, Sahin U, Braxenthaler M, Gallazzi F, Protti MP, Sinigaglia F, Hammer J: Generation of tissue-specific and promiscuous HLA ligand databases using DNA microarrays and virtual HLA class II matrices.
Toseland CP, Clayton DJ, McSparron H, Hemsley SL, Blythe MJ, Paine K, Doytchinova IA, Guan P, Hattotuwagama CK, Flower DR: AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data.
J Chem Phys 1953, 21:1087-1092. Publisher Full Text
Peters B, Bui HH, Frankild S, Nielson M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, Wilson SS, Sidney J, Lund O, Buus S, Sette A: A community resource benchmarking predictions of peptide binding to MHC-I molecules.