Log on / register
Feedback | Support | My details
Open AccessResearch article

MBA: a literature mining system for extracting biomedical abbreviations

Yun Xu1,2 email, ZhiHao Wang1,2 email, YiMing Lei1,2 email, YuZhong Zhao1,2 email and Yu Xue3 email

1Department of Computer Science and Technology, University of Science and Technology of China Hefei, Anhui 230027, PR China

2Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application Hefei, Anhui 230027, PR China

3School of Life Science, University of Science and Technology of China Hefei, Anhui 230027, PR China

author email corresponding author email

BMC Bioinformatics 2009, 10:14doi:10.1186/1471-2105-10-14

Published: 9 January 2009

Abstract

Background

The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.

Results

A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus.

Conclusion

We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., <CNS1, cyclophilin seven suppressor>), but also non-acronym-type abbreviations (e.g., <Fas, CD95>).


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.