Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: A critical assessment of text mining methods in molecular biology

Open Access Report

Mining protein function from text using term-based support vector machines

Simon B Rice1, Goran Nenadic23 and Benjamin J Stapley13*

Author Affiliations

1 Faculty of Life Sciences, University of Manchester, UK

2 School of Informatics, University of Manchester, UK

3 National Centre for Text Mining, Manchester, UK

For all author emails, please log on.

BMC Bioinformatics 2005, 6(Suppl 1):S22  doi:10.1186/1471-2105-6-S1-S22

Published: 24 May 2005

Abstract

Background

Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents.

Results

The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent.

Conclusion

A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2.