Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: Proceedings of the 6th International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2010)

Open Access Proceedings

Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

Lívia Márcia Silva1, Felipe Carvalho de Souza Teixeira2, José Miguel Ortega3, Luis Enrique Zárate1 and Cristiane Neri Nobre2*

Author Affiliations

1 Departamento de Ciência da Computação - Pontifícia Universidade Católica de Minas Gerais, Brazil

2 Departamento de Ciência da Computação - Universidade Federal de São João del-Rei, Brazil

3 Departamento of Bioquímica and Imunologia - Universidade Federal de Minas Gerais, Brazil

For all author emails, please log on.

BMC Genomics 2011, 12(Suppl 4):S9  doi:10.1186/1471-2164-12-S4-S9

Published: 22 December 2011

Abstract

Background

The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow.

Results

Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the Mus musculus and Rattus norvegicus organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Nasonia vitripennis. The precision increases significantly by 39% and 22.9% for Mus musculus and Rattus norvegicus, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (Mus musculus) and from 47.45% to 88.09% (Rattus norvegicus).

Conclusions

In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.