Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: The 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009)

Open Access Research

Predicting gene function using few positive examples and unlabeled ones

Yiming Chen13, Zhoujun Li2*, Xiaofeng Wang4, Jiali Feng4 and Xiaohua Hu5*

Author Affiliations

1 Computer School of National University of Defense Technology,Changsha,Hunan, China

2 School of Computer Science and Engineering BeiHang University,BeiJing, China

3 College of Information Science and Technology,Hunan Agricultural University,Changsha, China

4 College of Information Engineering, Shanghai Maritime University, Shanghai, China

5 College of Information Science and Technology, Drexel University, Philadelphia, PA, 19104, USA

For all author emails, please log on.

BMC Genomics 2010, 11(Suppl 2):S11  doi:10.1186/1471-2164-11-S2-S11

Published: 2 November 2010

Abstract

Background

A large amount of functional genomic data have provided enough knowledge in predicting gene function computationally, which uses known functional annotations and relationship between unknown genes and known ones to map unknown genes to GO functional terms. The prediction procedure is usually formulated as binary classification problem. Training binary classifier needs both positive examples and negative ones that have almost the same size. However, from various annotation database, we can only obtain few positive genes annotation for most offunctional terms, that is, there are only few positive examples for training classifier, which makes predicting directly gene function infeasible.

Results

We propose a novel approach SPE_RNE to train classifier for each functional term. Firstly, positive examples set is enlarged by creating synthetic positive examples. Secondly, representative negative examples are selected by training SVM(support vector machine) iteratively to move classification hyperplane to a appropriate place. Lastly, an optimal SVM classifier are trained by using grid search technique. On combined kernel ofYeast protein sequence, microarray expression, protein-protein interaction and GO functional annotation data, we compare SPE_RNE with other three typical methods in three classical performance measures recall R, precise P and their combination F: twoclass considers all unlabeled genes as negative examples, twoclassbal selects randomly same number negative examples from unlabeled gene, PSoL selects a negative examples set that are far from positive examples and far from each other.

Conclusions

In test data and unknown genes data, we compute average and variant of measure F. The experiments showthat our approach has better generalized performance and practical prediction capacity. In addition, our method can also be used for other organisms such as human.