MHC2SKpan: a novel kernel based approach for pan-specific MHC class II peptide binding prediction

Guo, Linyuan; Luo, Cheng; Zhu, Shanfeng

doi:10.1186/1471-2164-14-S5-S11

Volume 14 Supplement 5

Twelfth International Conference on Bioinformatics (InCoB2013): Computational Biology

Research
Open access
Published: 16 October 2013

MHC2SKpan: a novel kernel based approach for pan-specific MHC class II peptide binding prediction

Linyuan Guo¹,
Cheng Luo¹ &
Shanfeng Zhu¹

BMC Genomics volume 14, Article number: S11 (2013) Cite this article

2815 Accesses
14 Citations
Metrics details

Abstract

Background

Computational methods for the prediction of Major Histocompatibility Complex (MHC) class II binding peptides play an important role in facilitating the understanding of immune recognition and the process of epitope discovery. To develop an effective computational method, we need to consider two important characteristics of the problem: (1) the length of binding peptides is highly flexible; and (2) MHC molecules are extremely polymorphic and for the vast majority of them there are no sufficient training data.

Methods

We develop a novel string kernel MHC2SK (MHC-II String Kernel) method to measure the similarities among peptides with variable lengths. By considering the distinct features of MHC-II peptide binding prediction problem, MHC2SK differs significantly from the recently developed kernel based method, GS (Generic String) kernel, in the way of computing similarities. Furthermore, we extend MHC2SK to MHC2SKpan for pan-specific MHC-II peptide binding prediction by leveraging the binding data of various MHC molecules.

Results

MHC2SK outperformed GS in allele specific prediction using a benchmark dataset, which demonstrates the effectiveness of MHC2SK. Furthermore, we evaluated the performance of MHC2SKpan using various benckmark data sets from several different perspectives: Leave-one-allele-out (LOO), 5-fold cross validation as well as independent data testing. MHC2SKpan has achieved comparable performance with NetMHCIIpan-2.0 and outperformed NetMHCIIpan-1.0, TEPITOPEpan and MultiRTA, being statistically significant. MHC2SKpan can be freely accessed at http://datamining-iip.fudan.edu.cn/service/MHC2SKpan/index.html.

Background

Binding of antigenic peptides to major histocompatibility complex (MHC) class molecules is a core step in adaptive (specific) immune response. There are two major categories of MHC molecules: class I MHC (MHC-I) molecules and class II MHC (MHC-II) molecules. In contrast to MHC-I that mainly recognize peptides from intracellular antigens, MHC-II molecules are mainly responsible for binding peptides from extracellular antigens. These binding peptides are then presented on cell surfaces to the receptors of T helper (Th) cells, by which the adaptive immune system recognizes the antigen and starts specific responses, such as activating B cells to excrete antibodies neutralizing the pathogen [1]. Therefore, the accurate prediction of MHC binding peptides is important for understanding the mechanism of immune recognition and facilitating the process of epitope based vaccine design [2]. With the advantage of low financial cost and rapid deployment, computational methods have become increasingly important. They have already been used to choose very few promising candidate eptiopes that are further verified by biochemical experiments [3].

Although many computational methods have been developed to predict MHC class II binding peptides in the last few years [4–15], recent experimental results on benchmark datasets show that the performance of these methods needs to be improved [16–18]. Two distinct characteristics make the MHC-II peptide binding prediction problem very difficult. Firstly, the binding groove of MHC class II molecules is open in two directions. This results in a large length variation of of binding peptides (usually 11-20 amino acids) [19]. Several computational methods, such as TEPITOPE [9], SMM-align [4] and NN-align [5], try to locate the binding core of a peptide in the modeling process, which is a nonamer sitting in the binding groove of MHC molecules. However, the identified core may not be accurate and other important sequence information would be lost. Secondly, MHC are extremely polymorphic with a few thousand allele variants. By October 2012, IMGT/HLA has accumulated more than 1800 HLA (human leukocyte antigen, the name of MHC in Humans) class II allelic variants [20]. Many earlier computational methods, such as SMM-align and NN-align, are allele-specific ones that use the binding data of target MHC molecule to train a model to predict its binding specificity. However, vast majority of MHC-II molecules do not have sufficient binding data to train a reliable prediction model. In fact, there are less than 35 HLA class II molecules that have several hundred peptides with binding affinities in IEDB [21]. For addressing this problem, pan-specific approaches have been recently proposed to make predictions for any alleles with the known protein sequence [18]. The basic idea of pan-specific methods is to identify the relationship among MHC alleles so that the binding preferences of target MHC molecules can be captured.

MULTIPRED is the first pan-specific predictor for HLA-I [22]. It trains a supertype-specific model by incorporating the binding data in the same supertype, where a set of MHC molecules have similar peptide binding preferences [23]. Our pervious work has shown that incorporating binding data of MHC-I molecules in the same supertype can alleviate the scarcity of binding data and improve the prediction accuracy [24]. Moreover, in the last few years, several pan-specific methods have been developed for predicting the binding specificity of MHC-II molecules based on different principles [9–15], such as position specific scoring matrices (PSSMs), artificial neural network (ANN) and kernel based method. TEPITOPE [9] and TEPITOPEpan [15] are two PSSMs based methods. TEPTIOPE is a pioneering MHC-II pan-specific predictor, with the limitation of covering only 51 out of more than 1000 HLA-DR alleles. To overcome this limitation, we have developed TEPITOPEpan that covers all possible HLA-DR alleles. Its main idea is to extrapolates the preferences of 51 HLA-DR molecules covered by TEPITOPE to all uncharacterized. Not only NetMHCIIpan-1.0 [10] but also NetMHCIIpan-2.0 [11] are ANN based methods. Both versions utilize an ensemble of artificial neural network (ANN) with different network structures and initialization parameters, while the main difference is the way of determining the binding core. MultiRTA [14] is based on a regularized thermodynamic model and it considers all possible binding core configurations. MHCIIMulti [12] is a kernel based method that makes use of multi-instance technique for measuring the similarity between peptides. According to several recent bench-mark studies, overall NetMHCIIpan-2.0 performed the best, whereas TEPITOPE and TEPITOPEpan were good at identifying binding core, and achieved good accuracy in recognizing T-cell epitopes as well as HLA-ligands [15, 18].

Compared with feature vector based methods, kernel-based methods can deal with the flexibility of peptide lengths more naturally. With carefully designed kernels, these methods can perform very well without undertaking the complicated tasks of feature extraction and selection [25]. Most recently, Giguère et al. has developed a general string (GS) kernel for leaning a peptide-protein binding affinity [26], and GS kernel has achieved the good prediction accuracy in several applications, such as peptide-protein binding prediction on the data from the PepX database, MHC-II binding prediction and quantitative structure affinity prediction. The similarity between two peptides defined by GS is actually a sum of similarity scores by substring comparisons. Because GS was designed for a general problem of peptide-protein binding prediction, it did not take into consideration some distinct features of MHC-II binding peptides. Firstly, GS considers very short substrings of even one or two amino acids in computing similarity. Moreover, the consideration of long substrings for computing similarity in GS depends on its parameter. However, a short substring pattern is less significant and may bring noise, while the long substring pattern should be favored. Secondly, GS penalizes the similarity of two substrings if their starting positions in two peptides are different. However, this kind of penalization is unreasonable for MHC-II binding peptides. For example, it is common for the binding cores of two peptides starting at different positions. The similarity between these two binding cores by GS would be very low due to penalization even if they are identical. To overcome these drawbacks of GS, we propose a new string kernel for MHC-II, MHC2SK, which emphasizes the long substring of peptides and considers the variation of peptide lengths.

MHC2SK outperformed GS in the allele-specific prediction task on a benchmark dataset, which demonstrates the effectiveness of MHC2SK. Furthermore, we extended MHC2SK to MHC2SKpan for pan-specific MHC-II peptide binding prediction by leveraging the binding data of various MHC molecules. We evaluated the performance of MHC2SKpan on three benchmark datasets from several aspects: Leave-one-allele-out (LOO), 5-fold cross validation as well as independent data testing. MHC2SKpan achieved comparable performance with NetMHCIIpan-2.0 and outperformed TEPITOPEpan, NetMHCIIpan-1.0 and MultiRTA, being statistically significant.

Materials and methods

Data

We used 4 benchmark data sets: NielsenSet1, NielsenSet2, NielsenSet3 and EpanSet4 to evaluate the performance of different MHC-II peptide binding prediction methods. Specifically, NielsenSet1 was used for comparing the performance of MHC2SK with a kernel based allele-specific method, GS. The remaining three were used for comparing the performance of MHC2SKpan with other four well-known pan-specific predictors, such as NetMHCIIpan-2.0, NetMHCIIpan-1.0, TEPITOPEpan and MultiRTA.

NielsenSet1 consists of 4603 peptides covering 14 HLA-DR molecules. It was originally used for developing the SMM-align method [4]. NielsenSet2 was obtained from [10], and it is composed of 14607 peptides associated with 14 HLA-DR molecules. NielsenSet3 was taken from [11], and it consists of 33931 peptides covering 24 HLA-DR molecules. EpanSet4 was from [15] and was composed of 2412 peptides covering 14 HLA-DR molecules. These 14 molecules are neither in NielsenSet1, nor in NielsenSet2, with only two of them appearing in NielsenSet3. This is why the dataset was originally used for evaluating the performance of different pan-specific methods on novel MHC molecules [15].

Method

In this section, we briefly describe several string kernels related to our work. After presenting the notations, we first introduce Spectrum RBF string kernel (SRBF), which is closely related to GS and MHC2SK. After that, we describe GS and our newly developed MHC2SK kernel. Finally, we extend MHC2SK to MHC2SKpan for pan-specific MHC-II binding prediction.

Notation

Let Σ be a set of all the alphabets of amino acids, and for each amino acid a ∈ Σ we define an encoding function $φ : Σ \to ℝ^{d}$ . φ(a) = (φ₁(a), φ₂(a), ..., φ_d(a)) is a vector where φ_i(a) represents one of the d properties of the amino acid a. In the experiments we utilize the widely used Blosum62 [27] to define the encoding function φ. In the following subsections we denote s and s' as two amino acid chains with length |s| and |s'| respectively. Similarly, we denote y and y' as two peptides, yi→i+l-1 is a substring of y of length l with the starting position i and end position i + l - 1, y'_j→j+l-1is a substring of y' of length l with the starting position j and end position j + l - 1, and x and x' as two MHC molecules (or its pseudosequence representation).

Spectrum RBF string kernel (SRBF)

The spectrum RBF string kernel was proposed by Toussaint et al. [28] for MHC-I peptide binding prediction. As spectrum RBF string kernel is directly related to GS and MHC2SK, we review it briefly here. For s and s' with an equal length under a certain encoding scheme, such as Blosum62, we can compute their similarity using RBF kernel

K_{l, σ_{c}}^{φ} (s, s^{'}) = exp (- \frac{\sum_{i = 1}^{l} | | φ (s_{i}) - φ ({s^{'}}_{i}) | |^{2}}{2 σ_{c}^{2}})

(1)

where |s|=|s'|=l and s_i denote the i-th amino acid in sequence s. Similar to spectrum kernel [29], the similarity between two peptides y and y' with different lengths can be computed by considering the substrings of length l. According to [28], SRBF can be computed as follows

K_{S R B F} (y, y^{'}, l, σ_{c}) ≜ \sum_{i = 1}^{| y | - l + 1} \sum_{j = 1}^{| y^{'} | - l + 1} K_{l, σ_{c}}^{φ} (y_{i \to i + l - 1}, y_{j \to j + l - 1}^{'}) = \sum_{i = 1}^{| y | - l + 1} \sum_{j = 1}^{| y^{'} | - l + 1} exp (- \frac{\sum_{k = 0}^{l - 1} | | φ (y_{i + k}) - φ ({y^{'}}_{j + k}) | |^{2}}{2 σ_{c}^{2}})

(2)

where y_i+kdenote the (i + k)-th amino acid in the sequence y. It's worth noticing that, for computing the similarity between y and y', K_{S RBF} only compares their substrings with a fixed length (l), which may ignore some important information about the commonality of y and y'.

Generic String kernel (GS)

GS was proposed by Giguère et al. as a general kernel for learning peptide-protein binding [26]. It can be formulated as follows:

\begin{array}{l} K_{G S} (y, y^{'}, L, σ_{p}, σ_{c}) & ≜ \sum_{l = 1}^{L} \sum_{i = 1}^{| y | - l + 1} \sum_{j = 1}^{| y^{'} | - l + 1} exp (\frac{- {(i - j)}^{2}}{2 σ_{p}^{2}}) K_{l, σ_{c}}^{φ} (y_{i \to i + l - 1}, {y^{'}}_{j \to j + l - 1}) \\ = \sum_{l = 1}^{L} \sum_{i = 1}^{| y | - l + 1} \sum_{j = 1}^{| y^{'} | - l + 1} exp (\frac{- {(i - j)}^{2}}{2 σ_{p}^{2}}) exp (- \frac{\sum_{k = 0}^{l - 1} | | φ (y_{i + k}) - φ ({y^{'}}_{j + k}) | |^{2}}{2 σ_{c}^{2}}) \end{array}

(3)

where L ≥ 1 is the maximum length of substrings under comparison, and σ_p is the parameter for penalizing the similarity of y and y'_{j→ j+l-1}that start from different positions of i and j, respectively. From this, we can see that GS is a weighted combination of many SRBFs that take into account substrings with different lengths. However, considering the distinct features of MHC-II binding prediction, the penalization is unreasonable, and an additional parameter σ_p also increases the training time significantly. In addition, GS considers SRBFs of very short substrings, only one amino acid (l = 1 in equation (3)). This kind of short patterns are less significant, and may bring noise into the similarity computation.

MHC-II String Kernel (MHC2SK)

Considering the distinct features of MHC-II binding prediction, we design a novel kernel, MHC2SK, as follows

\begin{array}{l} K_{M H C 2 S K} (y, y^{'}, L, σ_{c}) & ≜ \sum_{l = L^{'}}^{min (| y |, | y^{'} |)} \sum_{i = 1}^{| y | - l + 1} \sum_{j = 1}^{| y^{'} | - l + 1} K_{l, σ_{c}}^{φ} (y_{i \to i + l - 1}, {y^{'}}_{j \to j + l - 1}) \\ = \sum_{l = L^{'}}^{min (| y |, | y^{'} |)} \sum_{i = 1}^{| y | - l + 1} \sum_{j = 1}^{| y^{'} | - l + 1} exp (- \frac{\sum_{k = 0}^{l - 1} | | φ (y_{i + k}) - φ ({y^{'}}_{j + k}) | |^{2}}{2 σ_{c}^{2}}) \end{array}

(4)

There are two main differences between MHC2SK and GS. Firstly, MHC2SK removes the penalized term $exp (\frac{- {(i - j)}^{2}}{2 σ_{p}^{2}})$ in the similarity computation. Omitting the parameter σ_p also reduces the training cost significantly. Secondly, MHC2SK emphasizes more on longer substring patterns for computing similarity. L' is the parameter for the minimum length of substring patterns considered in MHC2SK, while the maximum length is the largest possible length (min(|y|, |y'|)). In contrast, the minimum length of substring patterns in GS is 1, and the maximum length is determined by L. We can see that MHC2SK is a combination of SRBFs considering different lengths, thus MHC2SK is also positive semi-definite.

MHC-II String Kernel for pan-specific prediction (MHC2SKpan)

For the purpose of training a pan-specific model for any alleles with the known protein sequence, similar to the strategy proposed by KISS [30], we define the allele-peptide (x, y) pairwise kernel by obtaining the product between an allele kernel and a peptide kernel.

K ((x, y), (x^{'}, y^{'})) ≜ K_{a l l e l e} (x, x^{'}) \cdot K_{p e p t i d e} (y, y^{'})

(5)

For the peptide kernel, we can use MHC2SK kernel. For the HLA allele representation, we apply the pseudo sequence proposed by Nielsen et al [10]. The pseudo sequence is composed of 21 polymorphic amino acid positions in potential contact with the binding peptide. Since all the allele pseudo sequences are of equal length, we use the RBF kernel (equation 1) as the allele kernel. Then we can extend MHC2SK to MHC2SKpan for pan-specific prediction as follows:

K_{M H C 2 S K p a n} ((x, y), (x^{'}, y^{'})) ≜ K_{a l l e l e} (x, x^{'}) \cdot K_{p e p t i d e} (y, y^{'}) = K_{| x |, σ_{a}}^{φ} (x, x^{'}) \cdot K_{M H C 2 S K} (y, y^{'}, L^{'}, σ_{c})

(6)

where |x| = |x'| is the length of HLA pseudo sequence (21 in our case).

Results and discussion

Experimental procedure and evaluation metrics

The prediction model was learned by the support vector regression (SVR) algorithm. We made use of libsvm tool [31] and its SVR implementation with customized kernels, which were computed by the methods mentioned in the last section. The libsvm tool can be downloaded at http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Two standard metrics, the area under ROC curve (AUC) and Pearson correlation coefficient (PCC), were used to evaluate the performance of different prediction methods. In addition, for comparing performance differences of two predictors, we use one-tailed per-allele binomial test to measure its statistical significance.

For the datasets of NielsenSet1, NielsenSet2 and NielsenSet3, according to the studies presenting these data [4, 10, 11], the peptide with the binding affinity of less than 500nM was deemed as a binder. For EpanSet4, binding affinity is not available, and we used the binary labels in the dataset directly. Similar to several previous studies [4, 10], for computing PCC, the binding value was obtained by 1 - log(IC50)/log(50, 000), where IC50 is binding affinity measured in nM. We first compared the performance of GS and MHC2SK using NielsenSet1 by 5-fold cross validation. As SRBF is closely related to GS and MHC2SK, we also implemented SRBF as a baseline. We then compared the performance of MHC2SKpan with several well-known pan-specific methods, using Leave-One-Allele-Out (LOO) on NielsenSet2 and 5-fold cross validation on NielsenSet3. Finally we examined the performance of MHC2SKpan and other pan-specific methods on an independent test set, EpanSet4. These experiments have different focuses. The main purpose of LOO is to examine the generalization ability of pan-specific methods on novel alleles. For the 5-fold cross validation, the main purpose is to examine the performance of pan-specific methods using binding data of both target and other alleles. For the independent test, the main purpose is to examine the performance of pan-specific methods on the test data from different sources. For all the experiments, we used the grid search to learn the parameters in the three kernels. For GS kernel, we used the following ranges: σ_p ∈ (0, 15], σ_c ∈ (0, 5] and L ∈ [1, 20]. For MHC2SK kernel, we used the following ranges: σ_c ∈ (0, 5] and L' ∈ [1, 9]. Compared with MHC2SK, MHC2SKpan had an additional parameter σ_a, which was searched in (0, 15]. For SRBF kernel, we used the following ranges: σ_c ∈ (0, 5] and l ∈ [1, 9].

Evaluation by NielsenSet1

Table 1 shows the performance comparison of MHC2SK, GS and SRBF on NielsenSet1 using 5-fold cross validation. We obtain the 5 fold partition of the data from the original study [4]. Same as [4], in each round, 4 folds are used for training the model and tuning the parameters according to AUC. The best parameters on training data are used to build the model and make the prediction on test data. As illustrated in Table 1, MHC2SK achieved the best performance in both AUC and PCC. For example, MHC2SK achieved the highest average PCC of 0.450, which is followed by SRBF (0.419) and GS (0.411). Specifically, MHC2SK outperformed GS in 12 and SRBF in 11 out of all 14 alleles. Both of them are statistically significant (binomial test, p-value < 0.05). In addition, MHC2SK obtained the highest average AUC (0.747), which is followed by GS (0.727) and SRBF (0.718). Specifically, MHC2SK outperformed SRBF in 11 out of all 14 alleles, being statistically significant (binomial test, p-value < 0.05), and GS in 9 out of all 14 alleles. From the experimental results, we can clearly see that MHC2SK performed best among all three kernel based methods.

Table 1 Five-fold cross validation performance of MHC2SK method compared to GS and SRBF methods on NielsenSet1. For each allele, we display the largest value in boldface.

Full size table

Evaluation by NielsenSet2

Table 2 presents the result of MHC2SKpan and four other well-known predictors, MultiRTA, TEPITOPEpan, NetMHCIIpan-2.0 and NetMHCIIpan-1.0 using NielsenSet2. As TEPITOPEpan did not need any training data, we ran TEPITOPEpan directly on NielsenSet2 to get its prediction result [15]. For all other models, the experimental result was achieved by LOO, where we trained the model on the binding peptides of 13 alleles, and then made prediction on the one allele left as testing [10, 11]. The results of MultiRTA, NetMCHIIpan-2.0 and NetMHCIIpan-1.0 were from [11, 14]. For MHC2SKpan, we learned the model using the parameters that achieved the best average AUC per allele in the training data, and made prediction on the test allele. The experimental results show that NetMHCIIpan-2.0 and MHC2SKpan are two best prediction methods with very close performances. For example, NetMHCIIpan-2.0 achieved the highest average PCC of 0.606, which is closely followed by MHC2SKpan (0.605), and then NetMHCIIpan-1.0 (0.541), MultiRTA (0.531), and TEPITOPEpan (0.404). Specifically, MHC2SKpan outperformed NetMHCIIpan-2.0 in 8, NetMHCIIpan-1.0 in 13, MultiRTA in 12, and TEPITOPEpan in 14 out of all 14 alleles, with last three being statistically significant (binomial test, p-value < 0.05). Similar experimental results were obtained in terms of AUC. NetMHCIIpan-2.0 obtained the largest average AUC of 0.799, which is closely followed by MHC2SKpan (0.795), and then MultiRTA (0.773), NetMHCIIpan-1.0 (0.767), and TEPITOPEpan (0.710). Specifically, MHC2SKpan outperformed NetMHCIIpan-2.0 in 6, MultiRTA in 11, NetMHCIIpan-1.0 in 12, and TEPITOPEpan in 13 out of all 14 alleles. The last three are statistically significant (binomial test, p-value < 0.05). Overall, MHC2SKpan outperformed NetMHCIIpan-1.0, MultiRTA and TEPITOPEpan, being statistically significant, and achieved the comparable performance with the state-of-the-art predictor, NetMHCIIpan-2.0.

Table 2 LOO benchmark comparison of MHC2SKpan with four well-known pan-specific methods on NielsenSet2. MRTA, Tepan, Pan1.0, Pan2.0 and MKpan are the abbreviations for MultiRTA, TEPITOPEpan, MetaMHCIIpan-1.0, MetaMHCIIpan-2.0 and MHC2SKpan, respectively. For each allele, we display the largest value in boldface.

Full size table

Evaluation by NielsenSet3

Table 3 compares the performance of MHC2SKpan with TEPITOPEpan and NetMHCIIpan-2.0 on NielsenSet3 using 5-fold cross validation. The partition of the data, and the experimental result of NetMHCIIpan-2.0 are from the original paper [11]. As NetMHCIIpan-1.0 and MultiRTA were not trained on NielsenSet3 using 5-fold cross-validation, we could not report their results in Table 3. We ran TEPITOPEpan directly on NielsenSet3 to get its prediction result [15]. From this experimental result using 5-fold cross validation, we can find again that MHC2SKpan achieved comparable performance with NetMHCIIpan-2.0. Since TEPITOPEpan could not take advantage of sufficient training data, it did not perform very well. For example, NetMHCIIPan-2.0 achieved an average AUC of 0.846, and MHC2SKpan achieved an AUC of 0.843, which was followed by TEPITOPEpan (0.738). Specifically, MHC2SKpan outperformed NetMHCIIpan-2.0 in 11, and TEPITOPEpan in 23 out of 24 alleles. And the last one is statistically significant (binomial test, p-value < 0.01).

Table 3 Five-fold cross validation comparison of MHC2SKpan and NetMHCIIpan-2.0 on NielsenSet3. For each allele, we display the largest value in boldface.

Full size table

Evaluation by EpanSet4

Table 4 compares the performance of MHC2SKpan and other four pan-specific methods on an independent testing set, EpanSet4. Please note that 12 out of all 14 alleles are not in any of NielsenSet1, NielsenSet2 and NielsenSet3, which means that it is a good benchmark dataset for examining the performance of pan-specific models on novel alleles. MHC2SKpan was trained on NielsenSet3 using LOO, and the result of other pan-specific methods are from the original paper [15]. From the experimental results we find that MHC2SKpan performed best among all five pan-specific methods. MHC2SKpan obtained the largest average AUC (0.734), which is followed by NetMHCIIpan-2.0 (0.732), TEPITOPEpan (0.712), NetMHCIIpan-1.0 (0.701) and MultiRTA (0.677). MHC2SKpan outperformed both NetMHCIIpan-2.0 and NetMHCIIpan-1.0 in 9, and MultiRTA in 11 out of all 14 alleles. If we exclude two molecules (DRB1*12:01 and DRB1*03:02) appearing in NielsenSet3, we can still see clear advantage of MHC2SKpan over other pan-specific methods. In this case, MHC2SKpan obtained the largest average AUC of 0.730, which is followed by NetMHCIIpan-2.0 (0.722), TEPITOPEpan (0.707), NetMHCIIpan-1.0 (0.693) and MultiRTA (0.672).

Table 4 The AUC performance comparison of MHC2SKpan with MutliRTA, TEPITOPEpan, NetMHCIIpan-1.0 and NetMHCIIpan-2.0 on EpanSet4. For each allele, we display the largest value in boldface. The last row is the average result by excluding two alleles in NielsenSet3, DRB1*03:02 and DRB1*12:01.

Full size table

In this experiment, MHC2SKpan used the same set of parameters to predict the binding specificities of novel alleles. The parameters were estimated from training data NielsenSet3 using LOO, and it might not be a good configuration for a novel allele. The parameter σa of MHC2SKpan is actually used to measure the similarities among different MHC molecules. A large σ_a will incorporate the binding data of more MHC molecules into training process, and it may bring some unrelated MHC molecules. On the other hand, a small σ_a will only incorporate the binding data of a small number of MHC molecules into training process, and it may omit some related MHC molecules. In an ideal case, a suitable σ_a should be used for each target MHC molecule. To examine the effect of σ_a, we further checked the performance of MHC2SKpan on the 4 DRB alleles in EpanSet4: DRB1*12:01, DRB3*02:02, DRB1*13:01 and DRB1*03:02. The reason for choosing these four alleles was that (1) they have large number of binding data (DRB1*12:01, DRB3*02:02 and DRB1*13:01); or (2) they do not appear in NielsenSet3 (DRB1*03:02 and DRB1*12:01). Figure 1 shows the change of AUC on these 4 alleles with respect to the variation of σ_a. Here σ_a ranges from 0.5 to 15 with an interval of 0.5. σ_a = 6.5 is the learned parameter from NielsenSet3 used to generate Table 4. We can see that it is actually not a good setting for these alleles, especially for DRB3*02:02. Specifically, for DRB3*02:02, the best AUC is 0.808 with σ_a = 2 which is much higher than its current performance (0.789) under default setting. Another interesting discovery is that, for DRB1*03:02, with a large σa, the performance is actually improved. This may suggest more binding data from other alleles is helpful for DRB1*03:02. All these indicate that the performance of MHC2SKpan could be further improved if we can customize the parameters for the target MHC molecules.

Discussion

Both GS and MHC2SK have their roots in SRBF, which only considers substrings of a fixed length for computing similarities. However, by considering the characteristics of MHC-II peptide binding prediction, MHC2SK explicitly incorporates two important features into the kernel design: (1) emphasizing more on long substrings and (2) the great variation of peptide lengths. In contrast, without considering these domain knowledge, GS has to tune an additional parameter σ_p, which will increase training cost heavily. It may also lead to unsatisfactory result due to scarcity and noisy in training data. The experimental results on NielsenSet1 clearly demonstrate the advantage of MHC2SK over GS and SRBF. Actually, incorporating domain knowledge into model design becomes increasingly important for achieving the good prediction accuracy in bioinformatics [32].

Furthermore, we extend MHC2SK to MHC2SKpan for pan-specific MHC binding prediction. The performance of MHC2Skpan and other four well known pan-specific methods have been extensively evaluated using three benchmark datasets by LOO, cross-validation and independent testing. MHC2SKpan achieved good performance in all these experiments. Specifically, the LOO result on NielsenSet2 shows that MHC2SKpan outperformed NetMHCIIpan-1.0, TEPITOPEpan and MultiRTA, being statistically significant. MHC2SKpan achieved comparable performance with the-state-of-the-art model, NetMHCIIpan-2.0, in both LOO on NielsenSet2 and 5-fold cross validation on NielsenSet3. Moreover, MHC2SKpan is the best method in the independent test on EpanSet4. Experimental results also suggest that MHC2SKpan can achieve better prediction result if we customize the parameters for the target MHC molecules. Additionally, in contrast to NetMHCIIpan-2.0 using ensemble techniques, MHC2SKpan is an individual model. The performance of MHC2SKan could be further improved by various ensemble techniques [33, 34].

Conclusion

In this work, we present a state-of-the-art kernel based method, MHC2SKpan, for pan-specific MHC-II binding prediction. On the one hand, it can effectively incorporate the physical and chemical properties of amino acids for measuring the similarities among the peptides of different lengths. On the other hand, the relationship among different MHC molecules can be directly captured and utilized for pan-specific binding prediction. Experimental results on various benchmark datasets from different perspectives demonstrated that MHC2SKpan achieved comparable performance with the leading predictor, NetMHCIIpan-2.0, and outperformed three well known pan-specific methods, NetMHCIIpan-1.0, TEPITOPEpan and MultiRTA, being statistically significant. Automatically tuning the parameters in MHC2SKpan for a novel target MHC to improve its performance would be a very interesting future work.

References

Janeway C, Travers P, Walport M, Shlomchik M: Immunobiology: the immune system in health and disease. 2005, Garland Science Publishing, New York., 6
Google Scholar
Lund O, Nielsen M, Lundegaard C, Kesmir C, Brunak S: Immunological bioinformatics. 2005, MIT press
Google Scholar
Nielsen M, Lund O, Buus S, Lundegaard C: MHC Class II epitope predictive algorithms. Immunology. 2010, 130 (3): 319-328. 10.1111/j.1365-2567.2010.03268.x.
Article PubMed CAS PubMed Central Google Scholar
Nielsen M, Lundegaard C, Lund O: Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC bioinformatics. 2007, 8: 238-10.1186/1471-2105-8-238.
Article PubMed PubMed Central Google Scholar
Nielsen M, Lund O: NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction. BMC bioinformatics. 2009, 10: 296-10.1186/1471-2105-10-296.
Article PubMed PubMed Central Google Scholar
Bordner AJ, Mittelmann HD: Prediction of the binding affinities of peptides to class II MHC using a regularized thermodynamic model. BMC bioinformatics. 2010, 11: 41-10.1186/1471-2105-11-41.
Article PubMed PubMed Central Google Scholar
Salomon J, Flower DR: Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores. BMC bioinformatics. 2006, 7: 501-10.1186/1471-2105-7-501.
Article PubMed PubMed Central Google Scholar
Wang P, Sidney J, Kim Y, Sette A, Lund O, Nielsen M, Peters B: Peptide binding predictions for HLA DR, DP and DQ molecules. BMC bioinformatics. 2010, 11: 568-10.1186/1471-2105-11-568.
Article PubMed PubMed Central Google Scholar
Sturniolo T, Bono E, Ding J, Raddrizzani L, Tuereci O, Sahin U, Braxenthaler M, Gallazzi F, Protti MP, Sinigaglia F, et al: Generation of tissue-specific and promiscuous HLA ligand databases using DNA microarrays and virtual HLA class II matrices. Nature biotechnology. 1999, 17: 555-561. 10.1038/9858.
Article PubMed CAS Google Scholar
Nielsen M, Lundegaard C, Blicher T, Peters B, Sette A, Justesen S, Buus S, Lund O: Quantitative predictions of peptide binding to any HLA-DR molecule of known sequence: NetMHCIIpan. PLoS computational biology. 2008, 4 (7): e1000107-10.1371/journal.pcbi.1000107.
Article PubMed PubMed Central Google Scholar
Nielsen M, Justesen S, Lund O, Lundegaard C, Buus S: NetMHCIIpan-2.0-Improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure. Immunome research. 2010, 6: 9-10.1186/1745-7580-6-9.
Article PubMed PubMed Central Google Scholar
Pfeifer N, Kohlbacher O: Multiple instance learning allows MHC class II epitope predictions across Alleles. Algorithms in Bioinformatics. 2008, 210-221.
Chapter Google Scholar
Zaitlen N, Reyes-Gomez M, Heckerman D, Jojic N: Shift-invariant adaptive double threading: learning MHC II-peptide binding. Journal of Computational Biology. 2008, 15 (7): 927-942. 10.1089/cmb.2007.0183.
Article PubMed CAS Google Scholar
Bordner AJ, Mittelmann HD: MultiRTA: A simple yet reliable method for predicting peptide binding affinities for multiple class II MHC allotypes. BMC bioinformatics. 2010, 11 (482):
Zhang L, Chen Y, Wong HS, Zhou S, Mamitsuka H, Zhu S: TEPITOPEpan: extending TEPITOPE for peptide binding prediction covering over 700 HLA-DR molecules. PLoS One. 2012, 7 (2): e30483-10.1371/journal.pone.0030483.
Article PubMed CAS PubMed Central Google Scholar
Wang P, Sidney J, Dow BC adn Mothe´, Sette A, Peters B: A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput Biol. 2008, 4 (e1000048):
Lin H, Zhang G, Tongchusak S, Reinherz E, Brusic V: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics. 2008, 9 (S22):
Zhang L, Udaka K, Mamitsuka H, Zhu S: Toward more accurate pan-specific MHC-peptide binding prediction: a review of current methods and tools. Briefings in bioinformatics. 2012, 13 (3): 350-364. 10.1093/bib/bbr060.
Article PubMed CAS Google Scholar
Sette A, Adorini L, Colon S, Buus S, Grey H: Capacity of intact proteins to bind to MHC class II molecules. The Journal of Immunology. 1989, 143 (4): 1265-1267.
PubMed CAS Google Scholar
Robinson J, Mistry K, McWilliam H, Lopez R, Parham P, Marsh S: The IMGT/HLA database. Nucleic Acids Res. 2011, 39: D1171-D1176. 10.1093/nar/gkq998.
Article PubMed CAS PubMed Central Google Scholar
Vita R, Zarebski L, Greenbaum J, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B: The immune epitope database 2.0. Nucleic Acids Res. 2010, 38: D854-D862. 10.1093/nar/gkp1004.
Article PubMed CAS PubMed Central Google Scholar
Brusic V, Petrovsky N, Zhang G, Bajic V: Prediction of promiscuous peptides that bind HLA class I molecules. Immunol Cell Biol. 2002, 80 (3): 280-285. 10.1046/j.1440-1711.2002.01088.x.
Article PubMed CAS Google Scholar
Sette A, Sidney J: Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics. 1999, 50: 201-212. 10.1007/s002510050594.
Article PubMed CAS Google Scholar
Zhu S, Udaka K, Sidney J, Sette A, Aoki-Kinoshita KF, Mamitsuka H: Improving MHC binding peptide prediction by incorporating binding data of auxiliary MHC molecules. Bioinformatics. 2006, 22 (13): 1648-1655. 10.1093/bioinformatics/btl141.
Article PubMed CAS Google Scholar
Scho¨lkopf B, Tsuda K, Vert JP: Kernel methods in computational biology. 2004, Cambridge, Mass.: MIT Press
Google Scholar
Giguère S, Marchand M, Laviolette F, Drouin A, Corbeil J: Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC bioinformatics. 2013, 14 (82):
Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences. 1992, 89 (22): 10915-10919. 10.1073/pnas.89.22.10915.
Article CAS Google Scholar
Nora T, Christian W, Oliver K, Gunnar R: Exploiting physico-chemical properties in string kernels. BMC Bioinformatics. 2010, 11 (Suppl 8): S7-10.1186/1471-2105-11-S8-S7.
Article Google Scholar
Leslie C, Eskin E, Noble WS: The spectrum kernel: A string kernel for SVM protein classification. Proceedings of the pacific symposium on biocomputing. 2002, Hawaii, USA, 7: 566-575.
Google Scholar
Jacob L, Vert JP: Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics. 2008, 24 (3): 358-366. 10.1093/bioinformatics/btm611.
Article PubMed CAS Google Scholar
Chang CC, Lin CJ: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2 (3): 27-
Google Scholar
Baldi P, Brunak S: Bioinformatics - the machine learning approach (2. ed.). MIT Press 2001
Hu X, Zhou W, Udaka K, Mamitsuka H, Zhu S: MetaMHC: a meta approach to predict peptides binding to MHC molecules. Nucleic Acids Research. 2010, 38 (Web-Server): 474-479.
Article Google Scholar
Hu X, Mamitsuka H, Zhu S: Ensemble approaches for improving HLA class I-peptide binding prediction. J Immunol Methods. 2011, 374 (1-2): 47-52. 10.1016/j.jim.2010.09.007.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work has been partially supported by National Natural Science Foundation of China (61170097), and Scientific Research Starting Foundation for Returned Overseas Chinese Scholars, Ministry of Education, China. Shanfeng Zhu would like to thank the China Scholarship Council for the financial support on his visit at University of Illinois at Urbana-Champaign.

Declarations

Publication of this article was funded by National Natural Science Foundation of China.

This article has been published as part of BMC Genomics Volume 14 Supplement 5, 2013: Twelfth International Conference on Bioinformatics (InCoB2013): Computational biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S5.

Author information

Authors and Affiliations

School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, 200433, China
Linyuan Guo, Cheng Luo & Shanfeng Zhu

Authors

Linyuan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Luo
View author publications
You can also search for this author in PubMed Google Scholar
Shanfeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shanfeng Zhu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Method development: LG SZ. Conceived and designed the experiment: LG SZ. Performed the experiment: LG CL. Designed the web site LG. Analyzed the data: LG SZ. Wrote the paper: LG SZ.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Guo, L., Luo, C. & Zhu, S. MHC2SKpan: a novel kernel based approach for pan-specific MHC class II peptide binding prediction. BMC Genomics 14 (Suppl 5), S11 (2013). https://doi.org/10.1186/1471-2164-14-S5-S11

Download citation

Published: 16 October 2013
DOI: https://doi.org/10.1186/1471-2164-14-S5-S11

Twelfth International Conference on Bioinformatics (InCoB2013): Computational Biology

MHC2SKpan: a novel kernel based approach for pan-specific MHC class II peptide binding prediction

Abstract

Background

Methods

Results

Background

Materials and methods

Data

Method

Notation

Spectrum RBF string kernel (SRBF)

Generic String kernel (GS)

MHC-II String Kernel (MHC2SK)

MHC-II String Kernel for pan-specific prediction (MHC2SKpan)

Results and discussion

Experimental procedure and evaluation metrics

Evaluation by NielsenSet1

Evaluation by NielsenSet2

Evaluation by NielsenSet3

Evaluation by EpanSet4

Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Genomics

Contact us

Twelfth International Conference on Bioinformatics (InCoB2013): Computational Biology

MHC2SKpan: a novel kernel based approach for pan-specific MHC class II peptide binding prediction

Abstract

Background

Methods

Results

Background

Materials and methods

Data

Method

Notation

Spectrum RBF string kernel (SRBF)

Generic String kernel (GS)

MHC-II String Kernel (MHC2SK)

MHC-II String Kernel for pan-specific prediction (MHC2SKpan)

Results and discussion

Experimental procedure and evaluation metrics

Evaluation by NielsenSet1

Evaluation by NielsenSet2

Evaluation by NielsenSet3

Evaluation by EpanSet4

Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us