Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

A realistic assessment of methods for extracting gene/protein interactions from free text

Renata Kabiljo1*, Andrew B Clegg2 and Adrian J Shepherd1

Author affiliations

1 School of Crystallography and Institute of Structural and Molecular Biology, Birkbeck College, University of London, Malet Street, London WC1E 7HX UK

2 Research Department of Structural and Molecular Biology and Institute of Structural and Molecular Biology, University College London, Gower St, London, WC1E 6BT UK

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2009, 10:233  doi:10.1186/1471-2105-10-233

Published: 28 July 2009

Abstract

Background

The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger.

Results

Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions.

Conclusion

In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community.