Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

A text-mining system for extracting metabolic reactions from full-text articles

Jan Czarnecki1, Irene Nobeli1, Adrian M Smith2 and Adrian J Shepherd1*

Author Affiliations

1 Department of Biological Sciences and Institute of Molecular and Structural Biology, Birkbeck, University of London, Malet Street, London, WC1E 7HX, UK

2 Unilever R&D, Colworth Science Park, Sharnbrook, Bedfordshire, MK44 1LG, UK

For all author emails, please log on.

BMC Bioinformatics 2012, 13:172  doi:10.1186/1471-2105-13-172

Published: 23 July 2012

Abstract

Background

Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway — metabolic pathways — has been largely neglected.

Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein–protein interactions.

Results

When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task.

Conclusions

We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein–protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed.