Protein sequences classification by means of feature extraction with substitution matrices
1 LIMOS - Blaise Pascal University - Clermont University, BP 10448, Clermont-Ferrand 63000, France
2 LIMOS - CNRS UMR 6158, Aubière 63173, France
3 Department of Computer Science - FSJ - University of Jendouba, UMA Street, Jendouba 8100, Tunisia
4 URPAH - FST - University of Tunis El Manar, Academic Campus, Tunis 2092, Tunisia
5 Department of Computer Science - FSG - University of Gafsa, Campus of Sidi Ahmed Zarroug, Gafsa 2112, Tunisia
BMC Bioinformatics 2010, 11:175 doi:10.1186/1471-2105-11-175Published: 8 April 2010
This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.
In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works.
The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.