BMC Bioinformatics Volume 10
|
Viewing options:Associated material:Related literature:- Articles citing this article
- Other articles by authors
- Related articles/pages
Tools:Post to:
|
Research articleMBA: a literature mining system for extracting biomedical abbreviationsYun Xu1,2 , ZhiHao Wang1,2 , YiMing Lei1,2 , YuZhong Zhao1,2 and Yu Xue3  1Department of Computer Science and Technology, University of Science and Technology of China Hefei, Anhui 230027, PR China 2Anhui Province-MOST Co-Key Laboratory of High Performance Computing and Its Application Hefei, Anhui 230027, PR China 3School of Life Science, University of Science and Technology of China Hefei, Anhui 230027, PR China author email corresponding author email
BMC Bioinformatics 2009,
10:14doi:10.1186/1471-2105-10-14
|
|
| Published: |
9 January 2009 |
Abstract
Background
The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.
Results
A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus.
Conclusion
We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., <CNS1, cyclophilin seven suppressor>), but also non-acronym-type abbreviations (e.g., <Fas, CD95>). |