Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the BioNLP 08 ACL Workshop: Themes in biomedical language processing

Open Access Research

Accelerating the annotation of sparse named entities by dynamic sentence selection

Yoshimasa Tsuruoka12*, Jun'ichi Tsujii123 and Sophia Ananiadou12

Author Affiliations

1 School of Computer Science, The University of Manchester, MIB, 131 Princess Street, Manchester, M1 7DN, UK

2 National Centre for Text Mining (NaCTeM), MIB, 131 Princess Street, Manchester, M1 7DN, UK

3 Department of Computer Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan

For all author emails, please log on.

BMC Bioinformatics 2008, 9(Suppl 11):S8  doi:10.1186/1471-2105-9-S11-S8

Published: 19 November 2008

Abstract

Background

Previous studies of named entity recognition have shown that a reasonable level of recognition accuracy can be achieved by using machine learning models such as conditional random fields or support vector machines. However, the lack of training data (i.e. annotated corpora) makes it difficult for machine learning-based named entity recognizers to be used in building practical information extraction systems.

Results

This paper presents an active learning-like framework for reducing the human effort required to create named entity annotations in a corpus. In this framework, the annotation work is performed as an iterative and interactive process between the human annotator and a probabilistic named entity tagger. Unlike active learning, our framework aims to annotate all occurrences of the target named entities in the given corpus, so that the resulting annotations are free from the sampling bias which is inevitable in active learning approaches.

Conclusion

We evaluate our framework by simulating the annotation process using two named entity corpora and show that our approach can reduce the number of sentences which need to be examined by the human annotator. The cost reduction achieved by the framework could be drastic when the target named entities are sparse.