Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

A detailed error analysis of 13 kernel methods for protein–protein interaction extraction

Domonkos Tikk12*, Illés Solt13, Philippe Thomas1 and Ulf Leser1

Author Affiliations

1 Knowledge Management in Bioinformatics, Computer Science Department, Humboldt-Universität zu Berlin, 10099 Berlin, Germany

2 Software Engineering Institute, Óbuda University, 1034 Budapest, Hungary

3 Department of Telecommunications and Telematics, Budapest University of Technology and Economics, 1117 Budapest, Hungary

For all author emails, please log on.

BMC Bioinformatics 2013, 14:12  doi:10.1186/1471-2105-14-12

Published: 16 January 2013

Abstract

Background

Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.

Results

We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.

Conclusions

Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

Keywords:
Protein–protein interaction; Relation extraction; Kernel methods; Error analysis; Kernel similarity