Log on / register
Feedback | Support | My details

This article is part of the supplement: A critical assessment of text mining methods in molecular biology .

Open AccessReport

Exploring the boundaries: gene and protein identification in biomedical text

Jenny Finkel1 email, Shipra Dingare2 email, Christopher D Manning1 email, Malvina Nissim2 email, Beatrice Alex2 email and Claire Grover2 email

1Department of Computer Science, Stanford University, Stanford CA 94305-9040, USA

2Institute for Communicating and Collaborative Systems, University of Edinburgh, United Kingdom

author email corresponding author email

BMC Bioinformatics 2005, 6(Suppl 1):S5doi:10.1186/1471-2105-6-S1-S5

Published: 24 May 2005

Abstract

Background

Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools.

Methods

We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts.

Results

This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the "open" evaluation and a precision of 0.78 and recall of 0.85 in the "closed" evaluation.

Conclusion

Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.