Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Automatic annotation of protein motif function with Gene Ontology terms

Xinghua Lu1*, Chengxiang Zhai2, Vanathi Gopalakrishnan3 and Bruce G Buchanan3

Author Affiliations

1 Dept. of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, 135 Cannon St. Suite 303, Charleston, SC 29425, USA

2 Dept of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue, Urbana, IL 61801 USA

3 Center for Biomedical Informatics, University of Pittsburgh, 200 Lothrop Street, Suite 8084, Pittsburgh, PA 15213 USA

For all author emails, please log on.

BMC Bioinformatics 2004, 5:122  doi:10.1186/1471-2105-5-122

Published: 2 September 2004

Abstract

Background

Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, amuch needed and importanttask is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base.

Results

This paperpresents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifsis viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association isfound to be a very useful feature. We take advantageof the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correctassociation. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria.

Conclusions

In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about thefunctions of newly discovered candidate protein motifs.

Keywords:
Gene Ontology; data mining; supervised learning; protein motif; feature extraction; logistic regression