Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB)

Open Access Research

Exploiting physico-chemical properties in string kernels

Nora C Toussaint1*, Christian Widmer2, Oliver Kohlbacher1 and Gunnar Rätsch2

Author Affiliations

1 Center for Bioinformatics, Eberhard-Karls-Universität, Sand 14, 72076 Tübingen, Germany

2 Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany

For all author emails, please log on.

BMC Bioinformatics 2010, 11(Suppl 8):S7  doi:10.1186/1471-2105-11-S8-S7

Published: 26 October 2010



String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.


We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.


In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.


Data sets, code and additional information are available from webcite. Implementations of the developed kernels are available as part of the Shogun toolbox.