Department of Computer Science and Software Engineering, Université Laval, Québec, Canada

Department of Molecular Medicine, Université Laval, Québec, Canada

Abstract

Background

The cellular function of a vast majority of proteins is performed through physical interactions with other biomolecules, which, most of the time, are other proteins. Peptides represent templates of choice for mimicking a secondary structure in order to modulate protein-protein interaction. They are thus an interesting class of therapeutics since they also display strong activity, high selectivity, low toxicity and few drug-drug interactions. Furthermore, predicting peptides that would bind to a specific MHC alleles would be of tremendous benefit to improve vaccine based therapy and possibly generate antibodies with greater affinity. Modern computational methods have the potential to accelerate and lower the cost of drug and vaccine discovery by selecting potential compounds for testing in silico prior to biological validation.

Results

We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalizes eight kernels, comprised of the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm for it’s approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first time, a machine learning predictor is capable of predicting the binding affinity of any peptide to any protein with reasonable accuracy. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets.

Conclusion

On all benchmarks, our method significantly (p-value ≤ 0.057) outperforms the current state-of-the-art methods at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any quantitative biological activity. Moreover, generating reliable peptide-protein binding affinities will also improve system biology modelling of interaction pathways. Lastly, the method should be of value to a large segment of the research community with the potential to accelerate the discovery of peptide-based drugs and facilitate vaccine development. The proposed kernel is freely available at

Background

The cellular function of a vast majority of proteins is performed through physical interactions with other proteins. Indeed, essentially all of the known cellular and biological processes depend, at some level, on protein-protein interactions (PPI)

Considering the nature of the interaction surface, protein secondary structures are essential for binding specifically to protein interaction domains. Peptides also represent templates of choice for mimicking a secondary structure in order to modulate protein-protein interactions

Yearly, large sums of money are invested in the process of finding druggable targets and identifying compounds with medicinal utility. The widespread use of combinatorial chemistry and high-throughput screening in the pharmaceutical and biotechnology industries implies that millions of compounds can be tested for biological activity. However, screening large chemical libraries generates significant rates of both false positives and negatives. The process is expensive and faces a number of challenges in testing candidate drugs and validating the hits, all of which must be done efficiently to reduce costs and time. Computational methods with reasonable predictive power can now be envisaged to accelerate the process, thus providing an increase in productivity at a reduced cost.

As an example, peptides ranging from 8 to 12 AA represent the recognition unit for the MHC (Major Hiscompatibility Complex). Being capable of predicting which peptides bind to a specific MHC alleles would be of tremendous benefit to improve vaccine based therapy, possibly generating antibodies with greater affinity that could yield an improved immune response. Moreover, simply having data on the binding affinity of peptides and proteins could readily assist system biology modelling of interaction pathways.

The ultimate goal is to build a predictor of the highest binding affinity peptides. This task would be facilitated if one had a fast and accurate binding affinity predictor. Indeed, with this predictor, one could easily predict the binding affinity of huge sets of peptides and select the candidates with the highest predicted binding affinity, or use stochastic search methods such as simulated annealing if the set of peptides were too large. This paper provides a step in this direction with the use of a machine learning algorithm based on kernel methods and a novel kernel.

Traditional machine learning approaches focused on using binary binding data for classification of compounds (binding, non-binding)

The Immune Epitope Database (IEDB)

We propose a new machine learning approach based on kernel methods

For the machine learning algorithm itself, we show that kernel ridge regression

Methods

Statistical machine learning and kernel ridge regression in our context

Given a set of training examples (or cases), the task of a learning algorithm is to build an accurate predictor. In this paper, each example will be of the form ((**x**, **y**), **x** represents a peptide, **y** represents a protein, and **x** and the protein **y**. A multi-target predictor is a function **x**, **y**) when given any input (**x**, **y**). In our setting, the output **x**, **y**) is a real number estimate of the “true” binding energy (or the binding affinity) **x** and **y**. The predictor **x**, **y**), **x**, **y**) is very similar to the real output

With kernel methods, each input (**x**, **y**) is implicitly mapped to a **
ϕ
**(

The loss incurred by predicting a binding energy _{
w
}(**x**, **y**) on input (**x**, **y**), when the true binding energy is **w**, (**x**, **y**),

The fundamental assumption in machine learning is that each example ((**x**, **y**), _{
w
} having the smallest possible _{
w
}) defined as the expected loss

However, the learning algorithm does not have access to **x**
_{
i
}, **y**
_{
i
}), _{
i
}) is assumed to be generated independently according to the same (but unknown) distribution _{
w
} minimizing the **w**) will have a small risk _{
w
}) whenever the obtained value of **w**) is small. Here, **w**) is defined as

for some suitably-chosen constant **w**), **w**, is called a

The **w**
^{∗} that minimizes **w**) lies in the linear subspace span by the training examples. In other words, we can write

where the coefficients _{
i
} are called the **
ϕ
**(

As was proposed by several authors **a** = (_{1}, …, _{
n
}) and **b** = (_{1}, …, _{
m
}) denotes the vector **a** ⊗ **b** = (_{1}
_{1}, _{1}
_{2}, …, _{
n
}
_{
m
}) of all the **a** and **b**. If we now define the peptide kernel

Consequently, from the representer theorem we can write the multi-target predictor as

In the case of the quadratic loss **w**, (**x**, **y**), **w** · **x**, **y**))^{2}, **w**) is a strongly convex function of **w** for any strictly positive **w**
^{∗} where the gradient **w**) / **w** vanishes. For the quadratic loss, this solution **w**
^{∗} is given by

where **K** denotes the Gram matrix of kernel values **I** denotes de **K**, the inverse of **K** + **I** / ^{3}) time with the Gaussian-elimination method and in ^{2.376}) time with the Coppersmith-Winograd algorithm.

Finally, we will also consider the single protein target case where only one protein _{
w
} predicts the binding energy from a feature vector **x** is now given by **w**) but with **
ϕ
**(

Kernel methods have been extremely successful within the last decade, but the choice of the kernel is critical for obtaining good predictors. Hence, confronted with a new application, we must be prepared to design an appropriate kernel. The next subsections show how we have designed and chosen both peptide and protein kernels.

A generic string (GS) kernel for small bio-molecule strings

String kernels for bio-molecules have been applied with success in bioinformatics and computational biology. Kernels for large bio-molecules, such as the local-alignment kernel

The proposed kernel, which we call the generic string (GS) kernel, is a similarity measure defined for any pair (**x**, **x**
^{′}) of strings of amino acids. Let **x** of amino acids (e.g., a peptide), let |**x**| denote the length of string **x**, as measured by the number of amino acids in **x**. The positions of amino acids in **x** are numbered from 1 to |**x**|. In other words, **x** = _{1}, _{2}, …, _{|x|} with all _{
i
} ∈

Now, let

is a vector where each component _{
i
}(**
ψ
**

Let **x**, **x**
^{′}) of strings of length at least

In other words, this kernel compares each substring _{
i+1}, _{
i+2} ,.., _{
i+l
} of **x** of size **x**
^{′} having the same length. Each substring comparison yields a score that depends on the **
ψ
**-similarity of their respective amino acids and a shifting contribution term that decays exponentially rapidly with the distance between the starting positions of the two substrings. The

Also, note that the GS kernel can be used on strings of different lengths, which is a great advantage over a localized string kernel (of fixed length) such as the RBF, the weighted degree kernels _{
p
} approaches + _{
c
} approaches 0, the GS kernel becomes identical to the blended spectrum kernel

**Fixed parameters**

**Freeparameters**

**Kernel name**

(⋆) Substituting **
ψ
**

_{
p
} → 0, _{
c
} → 0

Hamming distance

_{
p
} → 0, _{
c
} → 0

Dirac delta

_{
p
} → _{
c
} → 0

Blended Spectrum

_{
p
} →

_{
c
}

Blended Spectrum RBF

_{
c
} → 0

_{
p
}

Oligo

_{
p
} → 0

_{
c
}

Radial Basis Function (RBF)

_{
p
} → 0, _{
c
} → 0

Weighted degree (⋆)

_{
p
} → 0

_{
c
}

Weighted degree RBF (⋆)

_{
p
}, _{
c
}

Generic String (GS)

In contrast, Leslie et al. **
ψ
** encoding function, amino acids properties are used to have a smooth transition between unimportant and critical mutations. Moreover, the transition can be adjusted thought the

Also, Saigo et al.

In the next subsection, we prove that the GS kernel is symmetric positive semi-definite and, therefore, defines a scalar product in some large-dimensional feature space (see _{
p
}, _{
c
}), there exists a function

Consequently, the solution minimizing the ridge regression functional **w**) will be given by Equation (1) and is guaranteed to exist whenever the GS Kernel is used.

Symmetric positive semi-definiteness of the GS kernel

The fact that the GS kernel is positive semi-definite follows from the following theorem. The proof is provided as supplementary material [see Additional file

**The proof of theorem 1.** This file presents the proof of Theorem 1, therefore it proves that the GS kernel is symmetric positive semi-definite.

Click here for file

Theorem 1

Let

Then, the kernel

is also symmetric positive semi-definite.

The positive semi-definiteness of the GS kernel comes from the fact that the GS kernel is a particular case of the more general kernel _{
l
}(**y**, **y**
^{′}) in kernel **
ψ
**

Indeed, this equality is a simple specialization of Equation (4.13) of

Finally, it is interesting to point out that Theorem 1 can be generalized to any function

Efficient computation of the GS kernel

To cope with today’s data deluge, the presented kernel should have a low computational cost. For this task, we first note that, before computing **x**, **x**
^{′}, _{
p
}, _{
c
}) for each pair (**x**, **x**
^{′}) in the training set, we can first compute

for each pair (^{′}) of amino acids. After this pre-computation stage, done in ^{2}) time, each access to ^{′}) is done in O(1) time. We will not consider the running time of this pre-computation stage in the complexity analysis of the GS kernel, because it only has to be done once to be used for any 5-tuple (**x**, **x**
^{′}, _{
p
}, _{
c
}). Recall that the binding affinity predictor, given by Equation 1, can be built only after we have computed the ^{2} elements of the kernel matrix **K** (for a training set of ^{2} is usually much larger than ^{2}, we can omit this pre-computation time in the complexity analysis of kernel evaluations.

Now, recall that we have defined **
ψ
**(

Following this, we can now write the GS kernel as

where min(**x**| - **x**
^{′}| - **x** and **x**
^{′}.

Now, for any **x**|, |**x**
^{′}|, and any **x**|}, **x**
^{′}|}, let

We therefore have

Since min(**x**| - **x**
^{′}| - _{
i,j
} seems to involve ^{2}) operations. However, we can reduce this complexity term to

We thus have

The computation of each entry _{
i,j
} therefore involves only **x**| · |**x**
^{′}| ·

To test the efficiency of this dynamic programming algorithm, we conducted an experiment measuring the speedup obtained from using this algorithm versus a naïve implementation of Equation (4) that did not exploit dynamic programming. For peptides of length 15, 35 and 55, we measured the speedup obtained while computing 2,500 kernel values as a function of the kernel parameter

For a given value of _{
n
} / _{
d
}, where _{
n
} is the running time of the naïve implementation and _{
d
} is the running time used by the dynamic programming algorithm.

The results shown in Figure

A benchmark experiment comparing the running times of the GS kernel dynamic programming algorithm and a naïve implementation of the GS kernel

**A benchmark experiment comparing the running times of the GS kernel dynamic programming algorithm and a naïve implementation of the GS kernel.** This figure shows the speedup of the dynamic programming algorithm over a naïve implementation of the GS kernel as a function of the kernel parameter _{p }= 0.5 and _{c }= 0.5.

GS Kernel approximation

In this section, we show how to compute a very close approximation of the GS kernel in linear time. Such a feature is interesting if one wishes to do a pre or post treatment where the symmetric positive semi-definite (SPSD) property of the kernel is not required. For example, as opposed to the training stage where the inverse of **K** + **I** / **K**, kernel values in the prediction stage could be approximated. Indeed, the scalar product with **
α
** is defined for non positive semi-definite kernel values. This scheme would greatly speed up the predictions with a very small lost of accuracy and precision.

The shifting penalizing term, _{
p
}⌉. In this case, the contribution of any substring beyond **x**| × |**x**
^{′}| matrix

**x**| + **x**
^{′}| - ^{2} non-zero values around it’s diagonal. We can therefore write this approximation of the GS kernel as

It is clear that only values of ^{′} is dominated by the computation of matrix **x**| + **x**
^{′}| - ^{2} entries can be computed in **x**|, |**x**
^{′}|)). Since ^{′} ∈ **x**|, |**x**
^{′}|)), giving an optimal linear complexity.

To determine the speedup that can be obtained by approximating the GS kernel, we conducted an experiment measuring this speedup for different peptide lengths. For a given value of _{
p
}, the speedup _{
f
} / _{
a
}, where _{
f
} is the time required for the computation using the GS kernel and _{
a
} is the time required for the computation using the approximated GS kernel.

Figure _{
p
}⌉, for peptides of length _{
p
} ≥

A benchmark experiment comparing the running times of the approximated GS kernel and the GS kernel

**A benchmark experiment comparing the running times of the approximated GS kernel and the GS kernel.** This figure shows the speedup of the approximation algorithm over the full computation of the GS kernel as a function of the kernel parameter _{p}. The running times were recorded while computing 1,000,000 kernel values for peptides of length 15, 35 and 55. The other kernel parameters are _{c }= 0.5 and

Kernel for protein binding pocket

Hoffmann et al. _{
L
}. Since both scores are invariant by rotation and translation, they are not positive semi-definite kernels. To obtain a valid kernel, we have used the so-called empirical kernel map where each **y** is mapped explicitly to (**y**
_{1}, **y**), **y**
_{2}, **y**), …, **y**
_{
m
}, **y**)). To ensure reproducibility and avoid implementation errors, all experiments were done using the implementation provided by the authors. An illustration of the pocket creation for the SupCk kernel is shown in Figure

A pyMOL illustration of a binding pocket used in the binding pocket kernel

**A pyMOL illustration of a binding pocket used in the binding pocket kernel.** This pyMOL illustration of a binding pocket, used for the binding pocket kernel

Kernel for protein structure

The MAMMOTH kernel is a similarity function between protein secondary structure proposed by Qiu et al.

Metrics and experimental design

When dealing with regression values, classical metrics used for classification such as the area under the ROC curve (AUC)

**AUC results for experiments on MHC-II.** This file presents AUC values obtained for the experiments on MHC-II datasets and provides an explanation on how these values were calculated.

Click here for file

Fortunately, metrics such as the root mean squared error (RMSE), the coefficient of determination (^{2}) and the Pearson product-moment correlation coefficient (PCC) are more suited for measuring the performance of predictors on regression problems. Therefore, in this paper, we have used the PCC and the RMSE to evaluate the performance of our method.

Except when otherwise stated, 10 folds nested cross-validation was done for estimating the PCC and the RMSE of the predicted binding affinities (See Figure

Illustration of the nested cross-validation procedure

**Illustration of the nested cross-validation procedure.** Nested 10-fold cross-validation. For each of the 10 outer folds, an inner 9 fold cross-validation scheme was used to select hyperparameters.

More precisely, let _{
k
} for ^{th} outer fold and let **x**
_{
i
}, **y**
_{
i
}), _{
i
}) of the predictor built from

An algorithm that, on average, produces a predictor that makes the same quadratic error as the constant predictor

As for the RMSE, it was computed using

Therefore, the perfect predictor will give

All the p-values reported in this article were computed using the two-tailed Wilcoxon signed-ranked test.

Finally, for all the experiments, hyperparameters for the GS kernels and the learning algorithms were selected by grid search using the following ranges: _{
p
} ∈ ]0, 18], _{
c
} ∈ [0, 18] and

Data

PepX database

The PepX database

The few complexes with positive binding energies were removed from the dataset. No other modifications were made to the original database.

Major histocompatibility complex class II (MHC-II)

Two different approaches were used for the prediction of MHC class II - peptide binding affinities: single-target and multi-target (pan-specific).

Single-target prediction experiments were conducted using the data from the IEDB dataset proposed by the authors of the RTA method

Pan-specific experiments were conducted on the IEDB dataset proposed by the authors of the NetMHCIIpan method

As pan-specific learning requires comparing HLA alleles using a kernel, the allele identifiers contained in the dataset were not directly usable for this purpose. Hence, to obtain a useful similarity measure (or kernel) for pairs of HLA alleles, we used the pseudo sequences composed of the amino acids at highly polymorphic positions in the alleles’ sequences. These amino acids are potentially in contact with peptide binders

Quantitative structure affinity model (QSAM) benchmark

Three well-studied benchmark datasets for designing quantitative structure affinity models were also used to compare our approach: 58 angiotensin-I converting enzyme (ACE) inhibitory dipeptides, 31 bradykinin-potentiating pentapeptides and 101 cationic antimicrobial pentadecapeptides. These data sets were recently the subject of extensive studies

Results and discussion

PepX database

To our knowledge, this is the first kernel method attempt at learning a predictor which takes the protein crystal and the peptide sequence as input to predict the binding energy of the complex. Many consider this task as a major challenge with important consequences for molecular biology. Standard string kernels for protein primary structures such as the LA-kernel and the blended spectrum (BS) were used while conducting experiments on proteins. They did not yield good results, mainly because they do not consider the protein’s secondary structure information. To validate this hypothesis and improve our results, we tried using the MAMMOTH kernel. The MAMMOTH kernel did improve the results (see Table

**SVR**

**KRR**

**sup-CK**

**sup-CK**

**BS**

**MAMMOTH**

**sup-CK**
_{
L
}

**BS**

**BS**

**GS**

**BS**

**BS**

**BS**

**GS**

Best results are highlighted in bold.

PepX Unique

0.6822

0.7072

**0.7300**

0.5873

0.5828

0.7110

0.7264

PepX All

0.8227

0.8580

0.8648

0.7769

0.8152

0.8601

**0.8652**

Choosing a kernel for the peptides was also a challenging task. Sophisticated kernels for local signals such as the RBF, the weighted degree, and the weighted degree RBF could not be used because peptide lengths were not equal. In fact, peptide lengths vary between 5 and 35 amino acids, which makes the task of learning a predictor and designing a kernel even more challenging. This was part of our motivation in designing the GS kernel. For all experiments, the BLOSUM 50 matrix was found to be the optimal amino acid descriptors during cross-validation.

Table _{
L
} kernels for binding pockets. It is surprising that the sup-CK_{
L
} kernel does not outperform the sup-CK kernel on both benchmarks, since the addition of the atom partial charges should provide more relevant information to the predictor.

Figures _{
L
} for the PepX All dataset. For illustration purposes, the absolute value of the binding energy has been plotted. We observe that the predictor has the property of maintaining ranking of binding affinities. Consequently, peptides with high binding affinity can generally be identified—an important feature for drug discovery. Peptides with the highest binding affinities are the ones that, ultimately, will serve as precursor drug or scaffold in a rational drug design program.

Predicted values as a function of the true values for the PepX Unique dataset

**Predicted values as a function of the true values for the PepX Unique dataset.** Predicted values for all peptide-protein complexes as a function of the true value. A perfect predictor would have all it’s predictions lying on the

Predicted values as a function of the true values for the PepX All dataset

**Predicted values as a function of the true values for the PepX All dataset.** Predicted values for all peptide-protein complexes as a function of the true value. A perfect predictor would have all it’s predictions lying on the

Experiments showed that a Pearson correlation coefficient of ≈1.0 is attainable on the training set when using the binding pocket kernel, the GS kernel and a large value for the complexity-accuracy trade-off parameter

Major histocompatibility complex class II (MHC-II)

Single-target predictions

We performed a single-target prediction experiment using the dataset proposed by the authors of the RTA method

Three common metrics were used to compare the methods: the Pearson correlation coefficient (PCC), the root mean squared error (RMSE), and the area under the ROC curve (AUC). The PCC and the RMSE results are presented in Table

**PCC**

**RMSE (kcal/mol)**

**MHC ****
β
**

**KRR+GS**

**RTA**

**KRR+GS**

**RTA**

**# of examples**

Best results for each metric are highlighted in bold. The PCC results show that the proposed method (KRR+GS) outperforms the RTA method with a p-value of 0.0308. The RMSE results show that KRR+GS outperforms the RTA method on all 16 allotypes with a p-value of 0.0005.

DRB1*0101

**0.632**

0.530

**1.20**

1.43

5648

DRB1*0301

**0.538**

0.425

**1.16**

1.46

837

DRB1*0401

**0.430**

0.340

**1.44**

1.72

1014

DRB1*0404

**0.491**

0.487

**1.25**

1.38

617

DRB1*0405

**0.530**

0.442

**1.09**

1.35

642

DRB1*0701

**0.645**

0.484

**1.24**

1.62

833

DRB1*0802

**0.469**

0.412

**1.19**

1.34

557

DRB1*0901

0.303

**0.369**

**1.55**

1.68

551

DRB1*1101

**0.550**

0.450

**1.17**

1.45

812

DRB1*1302

**0.468**

0.464

**1.51**

1.64

636

DRB1*1501

**0.502**

0.438

**1.41**

1.53

879

DRB3*0101

0.380

**0.425**

**1.03**

1.13

483

DRB4*0101

**0.613**

0.522

**1.10**

1.33

664

DRB5*0101

**0.541**

0.434

**1.20**

1.57

835

H2*IA_{
b
}

**0.603**

0.556

**1.00**

1.15

526

H2*IA_{d}

0.325

**0.563**

**1.44**

1.53

306

Average:

**0.501**

0.459

**1.25**

1.46

Pan-specific predictions

To evaluate the performance of our method and the potential of the GS kernel, pan-specific predictions were performed using the dataset proposed by the authors of NetMHCIIpan

To assess the performance of the proposed method, the PCC and the RMSE results are shown in Table

**PCC**

**RMSE (kcal/mol)**

**MHC ****
β
**

**KRR+GS**

**MultiRTA**

**NetMHCIIpan-2.0**

**KRR+GS**

**MultiRTA**

**# of examples**

Best results for each metric are highlighted in bold. The PCC results show that the proposed method (KRR+GS) outperforms MultiRTA with a p-value of 0.001 and NetMHCIIpan-2.0 with a p-value of 0.0574. The RMSE results indicate that KRR+GS outperforms MultiRTA with a p-value of 0.0466.

DRB1*0101

**0.662**

0.619

0.627

1.48

**1.33**

5166

DRB1*0301

**0.743**

0.438

0.560

**1.29**

1.36

1020

DRB1*0401

**0.667**

0.534

0.652

**1.36**

1.56

1024

DRB1*0404

0.709

0.623

**0.731**

**1.18**

1.33

663

DRB1*0405

0.606

0.566

**0.626**

**1.25**

1.28

630

DRB1*0701

0.694

0.620

**0.753**

**1.34**

1.51

853

DRB1*0802

**0.728**

0.523

0.700

**1.23**

1.45

420

DRB1*0901

0.471

0.375

**0.474**

**1.53**

2.01

530

DRB1*1101

**0.786**

0.603

0.721

**1.16**

1.46

950

DRB1*1302

**0.416**

0.365

0.337

1.73

**1.68**

498

DRB1*1501

**0.612**

0.513

0.598

**1.46**

1.57

934

DRB3*0101

**0.654**

0.603

0.474

1.52

**1.10**

549

DRB4*0101

**0.540**

0.508

0.515

**1.41**

1.61

446

DRB5*0101

**0.732**

0.543

0.722

**1.28**

1.60

924

Average:

**0.644**

0.531

0.606

**1.37**

1.49

The PCC results show that our method outperforms the MultiRTA

The RMSE values are only shown for our method and the MultiRTA method, since such values were not provided by the authors of NetMHCIIpan-2.0. The RMSE results indicate that our method globally outperforms MultiRTA with a p-value of 0.0466.

Quantitative structure affinity model (QSAM) benchmark

For all datasets, the extended

Table

**SVR**

**KRR**

**RBF**

**RBF**

**GS**

Best results are highlighted in bold.

ACE

0.8782

0.8807

**0.9044**

Bradykinin

0.7491

0.7531

**0.7641**

Cationic

0.7511

0.7417

**0.7547**

We observed that kernel ridge regression (KRR) had a slight accuracy advantage over support vector regression (SVR). Moreover, SVR has one more hyperparameter to tune than KRR: the

Additionnal results and external validation

To act as an external source of validation for our results and to assess the performance of the GS kernel, we participated in the 2012 Machine Learning Competition in Immunology

These results support our claim that the GS kernel is a state-of-the-art kernel for peptides and a valuable tool for computationnal biologists.

Conclusions

We have proposed a new kernel designed for small bio-molecules (such as peptides) and pseudo-sequences of binding interfaces. The GS kernel is an elegant generalization of eight known kernels for local signals. Despite the richness of this new kernel, we have provided a simple and efficient dynamic programming algorithm for its exact computation and a linear time algorithm for its approximation. Combined with the kernel ridge regression learning algorithm and the binding pocket kernel, the proposed kernel yields promising results on the PepX database. For the first time, a predictor capable of accurately predicting the binding affinity of any peptide to any protein was learned using this database. Our method significantly outperformed RTA on the single-target prediction of MHC-II binding peptides. Impressive state-of-the-art results were also obtained on the pan-specific MHC-II task, outperforming both MultiRTA and NetMHCIIpan-2.0. Moreover, the method was successfully tested on three well studied datasets for the quantitative structure affinity model.

A predictor trained on the whole IEDB database or PDB database, as opposed to benchmark datasets, would be a substantial tool for the community. Unfortunately, learning a predictor on very large datasets (over 2,5000 examples) is still a major challenge with most machine learning methods, as the similarity (Gram) matrix becomes hard to fit into the memory of most computers. We propose to expand the presented method to very large datasets as future work. The proposed kernel is freely available at

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SG designed the GS kernel, algorithms for it’s computation, implemented the learning algorithm and conducted experiments on the PepX and QSAM datasets. MM designed the learning algorithm. FL and MM did the proof of the symmetric positive semi-definiteness of the GS kernel. AD conducted experiments on MHC-II datasets. JC provided biological insight and knowledge. This work was done under the supervision of MM, FL and JC. All authors contributed to, read and approved the final manuscript.

Acknowledgements

Computations were performed on the SciNet supercomputer at the University of Toronto, under the auspice of Compute Canada. The operations of SciNet are funded by the Canada Foundation for Innovation (CFI), the Natural Sciences and Engineering Research Council of Canada (NSERC), the Government of Ontario and the University of Toronto. JC is the Canada Research Chair in Medical Genomics. This work was supported in part by the Fonds de recherche du Québec - Nature et technologies (FL, MM & JC; 2013-PR-166708) and the NSERC Discovery Grants (FL; 262067, MM; 122405).