Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the BioNLP 08 ACL Workshop: Themes in biomedical language processing

Open Access Research

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes

Veronika Vincze1*, György Szarvas1*, Richárd Farkas2, György Móra1 and János Csirik2

Author Affiliations

1 University of Szeged, Department of Informatics, Human Language Technology Group, Árpád tér 2., Szeged, Hungary

2 Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi Vértanúk tere 1., Szeged, Hungary

For all author emails, please log on.

BMC Bioinformatics 2008, 9(Suppl 11):S9  doi:10.1186/1471-2105-9-S11-S9

Published: 19 November 2008

Abstract

Background

Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).

Results

The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.

Conclusion

Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.