BMC Bioinformatics

official impact factor 3.03

Open Access Research article

The combination approach of SVM and ECOC for powerful identification and classification of transcription factor

Guangyong Zheng1,2,3, Ziliang Qian3,6, Qing Yang2, Chaochun Wei4,5, Lu Xie5, Yangyong Zhu2,5* and Yixue Li4,5*

Author Affiliations

1 School of Life Sciences, Fudan University, 220 Handan Road, Shanghai 200433, PR China

2 Department of Computing and Information Technology, Fudan University, 220 Handan Road, Shanghai 200433, PR China

3 Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, PR China

4 College of Life Sciences and Technology, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai 200240, PR China

5 Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, Shanghai 200235, PR China

6 Graduate School of the Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100039, PR China

For all author emails, please log on.

BMC Bioinformatics 2008, 9:282 doi:10.1186/1471-2105-9-282

Published: 16 June 2008

Abstract

Background

Transcription factors (TFs) are core functional proteins which play important roles in gene expression control, and they are key factors for gene regulation network construction. Traditionally, they were identified and classified through experimental approaches. In order to save time and reduce costs, many computational methods have been developed to identify TFs from new proteins and to classify the resulted TFs. Though these methods have facilitated screening of TFs to some extent, low accuracy is still a common problem. With the fast growing number of new proteins, more precise algorithms for identifying TFs from new proteins and classifying the consequent TFs are in a high demand.

Results

The support vector machine (SVM) algorithm was utilized to construct an automatic detector for TF identification, where protein domains and functional sites were employed as feature vectors. Error-correcting output coding (ECOC) algorithm, which was originated from information and communication engineering fields, was introduced to combine with support vector machine (SVM) methodology for TF classification. The overall success rates of identification and classification achieved 88.22% and 97.83% respectively. Finally, a web site was constructed to let users access our tools (see Availability and requirements section for URL).

Conclusion

The SVM method was a valid and stable means for TFs identification with protein domains and functional sites as feature vectors. Error-correcting output coding (ECOC) algorithm is a powerful method for multi-class classification problem. When combined with SVM method, it can remarkably increase the accuracy of TF classification using protein domains and functional sites as feature vectors. In addition, our work implied that ECOC algorithm may succeed in a broad range of applications in biological data mining.