Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: A critical assessment of text mining methods in molecular biology

Open Access Open Badges Report

BioCreAtIvE Task1A: entity identification with a stochastic tagger

Shuhei Kinoshita12, K Bretonnel Cohen1, Philip V Ogren13 and Lawrence Hunter1*

Author Affiliations

1 Center for Computational Pharmacology, University of Colorado School of Medicine, Denver, Colorado

2 Fujitsu Ltd., BioChemical Information Project, 1-9-3 Nakase Mihama-ku Chiba, JAPAN

3 Dept. of Computer Science, University of Colorado at Boulder, Boulder, Colorado

For all author emails, please log on.

BMC Bioinformatics 2005, 6(Suppl 1):S4  doi:10.1186/1471-2105-6-S1-S4

Published: 24 May 2005



Our approach to Task 1A was inspired by Tanabe and Wilbur's ABGene system [1,2]. Like Tanabe and Wilbur, we approached the problem as one of part-of-speech tagging, adding a GENE tag to the standard tag set. Where their system uses the Brill tagger, we used TnT, the Trigrams 'n' Tags HMM-based part-of-speech tagger [3]. Based on careful error analysis, we implemented a set of post-processing rules to correct both false positives and false negatives. We participated in both the open and the closed divisions; for the open division, we made use of data from NCBI.


Our base system without post-processing achieved a precision and recall of 68.0% and 77.2%, respectively, giving an F-measure of 72.3%. The full system with post-processing achieved a precision and recall of 80.3% and 80.5% giving an F-measure of 80.4%. We achieved a slight improvement (F-measure = 80.9%) by employing a dictionary-based post-processing step for the open division. We placed third in both the open and the closed division.


Our results show that a part-of-speech tagger can be augmented with post-processing rules resulting in an entity identification system that competes well with other approaches.