Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge

Open Access Proceedings

A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins

Ruchi Verma and Ulrich Melcher*

Author affiliations

Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078 USA

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2012, 13(Suppl 15):S9  doi:10.1186/1471-2105-13-S15-S9

Published: 11 September 2012

Abstract

Background

Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM).

Result

The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity.

Conclusion

The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences.

Keywords:
proteobacteria; plant proteins; SVM; machine learning; amino acid composition; dipeptide composition