Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: Twelfth International Conference on Bioinformatics (InCoB2013): Computational Biology

Open Access Research

Literature classification for semi-automated updating of biological knowledgebases

Lars Rønn Olsen12, Ulrich Johan Kudahl23, Ole Winther14 and Vladimir Brusic25*

Author Affiliations

1 Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark

2 Cancer Vaccine Center, Dana-Farber Cancer Institute, Boston, MA, USA

3 Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark

4 Cognitive Systems, DTU Compute, Technical University of Denmark, Lyngby, Denmark

5 Department of Computer Science, Metropolitan College, Boston University, Boston MA, USA

For all author emails, please log on.

BMC Genomics 2013, 14(Suppl 5):S14  doi:10.1186/1471-2164-14-S5-S14

Published: 16 October 2013

Abstract

Background

As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature.

Results

We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature.

Conclusion

We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.

Keywords:
Text mining; machine learning; biological databases; automation