Log on / register
Feedback | Support | My details

This article is part of the supplement: First International Workshop on Text Mining in Bioinformatics (TMBio) 2006

Open AccessProceedings

GO for gene documents

Padmini Srinivasan1,2 email and Xin Ying Qiu2 email

1School of Library and Information Science, University of Iowa, Iowa City, IA, USA

2Management Sciences Department, University of Iowa, Iowa City, IA, USA

author email corresponding author email

BMC Bioinformatics 2007, 8(Suppl 9):S3doi:10.1186/1471-2105-8-S9-S3

Published: 27 November 2007

Abstract

Background

Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that can supplement the expensive manual annotation processes currently in place.

Results

Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.49, 0.41 and 0.33 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We find that alternative term weighting strategies are not different from each other in performance and feature selection strategies reduce performance. The best thresholding strategy is one where a single threshold is picked for each hierarchy. Hierarchy level is important especially for molecular function and biological process. The cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research shows that it is possible to beneficially exploit the hierarchical structures by defining and testing a relaxed criteria for classification correctness. Finally it is possible to build classifiers for codes with very few associated documents but as expected a huge penalty is paid in performance.

Conclusion

The GO annotation problem is complex. Several key observations have been made as for example about topic drift that may be important to consider in annotation strategies.


© 1999-2008 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.