Log on / register
Feedback | Support | My details

This article is part of the supplement: A critical assessment of text mining methods in molecular biology

Open AccessReport

Automatically annotating documents with normalized gene lists

Jeremiah Crim email, Ryan McDonald email and Fernando Pereira email

Department of Computer and Information Science, University of Pennsylvania, Levine Hall, 3330 Walnut Street, Philadelphia, Pennsylvania, USA, 19104

author email corresponding author email

BMC Bioinformatics 2005, 6(Suppl 1):S13doi:10.1186/1471-2105-6-S1-S13

Published: 24 May 2005

Abstract

Background

Document gene normalization is the problem of creating a list of unique identifiers for genes that are mentioned within a document. Automating this process has many potential applications in both information extraction and database curation systems. Here we present two separate solutions to this problem. The first is primarily based on standard pattern matching and information extraction techniques. The second and more novel solution uses a statistical classifier to recognize valid gene matches from a list of known gene synonyms.

Results

We compare the results of the two systems, analyze their merits and argue that the classification based system is preferable for many reasons including performance, simplicity and robustness. Our best systems attain a balanced precision and recall in the range of 74%–92%, depending on the organism.


© 1999-2008 BioMed Central Ltd unless otherwise stated