This article is part of the supplement: Proceedings of the 8th International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2012)
Context-based preprocessing of molecular docking data
1 GPIN - Grupo de Pesquisa em Inteligência de Negócio, PPGCC, Faculdade de Informática, PUCRS Av. Ipiranga, 6681 - Prédio 32, sala 628, 90619-900, Porto Alegre, RS, Brasil
2 LABIO - Laboratório de Bioinformática, Modelagem e Simulação de Biossistemas, PPGCC, Faculdade de Informática, PUCRS Av. Ipiranga, 6681 - Prédio 32, sala 602, 90619-900, Porto Alegre, RS, Brasil
BMC Genomics 2013, 14(Suppl 6):S6 doi:10.1186/1471-2164-14-S6-S6Published: 25 October 2013
Data preprocessing is a major step in data mining. In data preprocessing, several known techniques can be applied, or new ones developed, to improve data quality such that the mining results become more accurate and intelligible. Bioinformatics is one area with a high demand for generation of comprehensive models from large datasets. In this article, we propose a context-based data preprocessing approach to mine data from molecular docking simulation results. The test cases used a fully-flexible receptor (FFR) model of Mycobacterium tuberculosis InhA enzyme (FFR_InhA) and four different ligands.
We generated an initial set of attributes as well as their respective instances. To improve this initial set, we applied two selection strategies. The first was based on our context-based approach while the second used the CFS (Correlation-based Feature Selection) machine learning algorithm. Additionally, we produced an extra dataset containing features selected by combining our context strategy and the CFS algorithm. To demonstrate the effectiveness of the proposed method, we evaluated its performance based on various predictive (RMSE, MAE, Correlation, and Nodes) and context (Precision, Recall and FScore) measures.
Statistical analysis of the results shows that the proposed context-based data preprocessing approach significantly improves predictive and context measures and outperforms the CFS algorithm. Context-based data preprocessing improves mining results by producing superior interpretable models, which makes it well-suited for practical applications in molecular docking simulations using FFR models.