Abstract
Background
Data preprocessing is a major step in data mining. In data preprocessing, several known techniques can be applied, or new ones developed, to improve data quality such that the mining results become more accurate and intelligible. Bioinformatics is one area with a high demand for generation of comprehensive models from large datasets. In this article, we propose a contextbased data preprocessing approach to mine data from molecular docking simulation results. The test cases used a fullyflexible receptor (FFR) model of Mycobacterium tuberculosis InhA enzyme (FFR_InhA) and four different ligands.
Results
We generated an initial set of attributes as well as their respective instances. To improve this initial set, we applied two selection strategies. The first was based on our contextbased approach while the second used the CFS (Correlationbased Feature Selection) machine learning algorithm. Additionally, we produced an extra dataset containing features selected by combining our context strategy and the CFS algorithm. To demonstrate the effectiveness of the proposed method, we evaluated its performance based on various predictive (RMSE, MAE, Correlation, and Nodes) and context (Precision, Recall and FScore) measures.
Conclusions
Statistical analysis of the results shows that the proposed contextbased data preprocessing approach significantly improves predictive and context measures and outperforms the CFS algorithm. Contextbased data preprocessing improves mining results by producing superior interpretable models, which makes it wellsuited for practical applications in molecular docking simulations using FFR models.
Background
Data preprocessing is a major step in data mining. Although timeconsuming, it improves data quality so they can be properly mined, thus producing more accurate, interpretable, and applicable models. Many techniques can be applied to data preprocessing [1], including data cleaning, data integration, and data transformation. In predictive machine learning problems, there is an input x and an output y; the task is to learn how to map the input to the output. Such a mapping can be defined as a function y = g(xθ) where g(.) is the model and θ its parameters [2].
Although we can find numerous algorithms for prediction, many of them only work by producing a predictive function that indicates to which target value the objects belong. However, in some data mining problems, it is necessary to have a better comprehension of the induced models. Decision trees are models well understood by users. Indeed, Freitas et al. [3] support the use of decision trees models, instead of black box algorithms, to represent, graphically, patterns revealed by data mining, for example, Support Vector Machine (SVM) or Neural Networks models. Still according to these authors [3], the hierarchical structure developed can emphasize the importance of the attributes used for prediction.
The incorporation of contextaware data preprocessing to improve mining results is an active area of research. Baralis et al. [4] develop the CASMine: a contextbased framework to extract generalized association rules, providing a highlevel abstraction of both, user habits and service characteristics, depending on the context. Nam et al. [5] discuss how the context can help classify the face image. Although these authors discuss the importance of considering the context in data mining applications while they develop their work according to a contextaware definition, the context involved is intrinsically specific to each working background. Hence, their methodologies are not suitable to the molecular docking simulations context explored in this work.
There are many areas of application where a comprehensible model is fundamental to its proper use. In bioinformatics, only a set of data and a set of data mining models may not be enough. The data and the results must represent the context in which they are embedded. Bioinformatics is a clear example of where we believe data preprocessing is instrumental. Our contribution is within the context of rational drug design (RDD). The interactions between biological macromolecules, called receptors, and small molecules, called ligands, constitute the fundamental principle of RDD. Insilico molecular docking simulations, an important phase of RDD, investigate the best bind pose and conformation of a ligand into a receptor. The best ligands are tested by invitro and/or invivo experiments. If the results are promising, a new drug candidate can be produced [6]
A proper data preprocessing may induce decisiontrees models that are able to identify important features of the receptorligand interactions from molecular docking simulations. In the present work, we propose and apply a predictive regression decisiontree on the contextbased preprocessed data from docking results and show that bioinformaticians can easily understand, explore, and apply the induced models. We apply four preprocessing techniques. Firstly, we produce and arrange all attributes based on the domain knowledge. Secondly, still based on a context domain, we improve the input by selecting two appropriate features. Thirdly, we apply a conventional machine learning feature selection to the initial set of attributes. Finally, we combine the feature selection generated using the first and second strategies with those from the third strategy. We assess the results for the model's accuracy and interpretability. Then, we demonstrate how a careful and valueadded data preprocessing can produce more effective models.
Methods
The molecular docking context
Interaction between drug candidates (ligands) and target proteins (receptors), through molecular docking simulations, is the computational basis of RDD. Given a receptor, molecular docking simulations sample a large number of orientations and conformations of a ligand inside its biding site. The simulations also evaluate the Free Energy of Binding (FEB) and rank the orientations/conformations according to their FEB scores[7]. The majority of molecular docking algorithms only consider the ligand as flexible, whereas the receptor remains rigid, due to the computational cost involved in considering the receptor's explicit flexibility. However, biological macromolecules, like protein receptors, are intrinsically flexible in their cellular environment. The receptor may modify its shape upon ligand binding, moulding itself to be complementary to the ligand [8]. This increases favourable contacts and reduces adverse interactions, which, in turn, minimizes the total FEB [9]. Therefore, it is important to consider the receptor's explicit flexibility in molecular docking simulations.
In this work, we model the full receptor explicit flexibility in the molecular docking simulations [10] using a set of different conformations for the receptor, generated by molecular dynamics (MD) simulations [11]. This type of representation, named a fullyflexible receptor (FFR) model [10], results in the need of executing large numbers of docking simulations and voluminous results to be analysed. Actually, one of the current major challenges in bioinformatics is how to handle large amounts of data [12], or big data [13].
Data modelling and acquisition
The InhA enzyme from Mycobacterium tuberculosis (Mtb) [14] is the target protein receptor in this work. It contains 268 amino acid residues and 4,008 atoms. The 3D structure (PDB ID: 1ENY) of the crystal, rigid receptor [14], was retrieved from the Protein Data Bank [15]. The FFR model of InhA (FFR_InhA) contains 3,100 snapshots from a 3.1 ns MD simulation [11]. Machado et al. [10] performed molecular docking simulations of FFR_InhA against each of the four different ligands: TCL [16], PIF [17], ETH [18] and NADH [14].
All docking results and snapshots of the FFR_InhA model were stored into a proper repository [19]. We developed this repository to integrate FFR models and docking results, allowing users to query the database from different points of view [20]. In fact, queries can traverse relationships between receptors and ligands' atoms and viceversa, including their conformations and 3D coordinates. This repository enables us to produce effective inputs to use in different data mining tasks with their corresponding algorithms.
Attributes arrangements
A major objective of this work is to reduce the number of snapshots used as input in docking simulations of a FFR model against a given ligand. In this sense, by mining the data from the FFR model and its docking results, we expect to select a subset of all available receptor conformations that are most relevant and capable of indicating whether a given ligand is a promising compound. Machado et al. [21][22] demonstrated how data mining can address this question. Winck et al. [23] obtained encouraging results by applying a contextbased preprocessing to data mining of biological text. Hence, we focus our efforts on contextbased data preprocessing. In our database [19] there are many available features. Choosing the most important ones impacts directly the choice of the proper data mining algorithm. Predictive data mining task is defined by the target attribute [24]. In the following sections we define the target and predictive attributes of the domainspecific knowledge of this work.
Target attribute definition
One way to evaluate a molecular docking simulation with AutoDock3.0.5 [25] is by examining the values of the resulting free energy of binding (FEB): the most negative FEB values generally indicate the best receptorligand binding affinity. AutoDock3.0.5 predicts the bound conformations of a ligand to a receptor. It combines an algorithm of conformation search with a rapid gridbased method of energy evaluation [25]. The AutoGrid module of AutoDock3.0.5 precalculates a 3D energybased grid of interactions for various atom types. Figure 1 shows an example of the grid box used in this work.
Figure 1. 3DGrid considering the InhA receptor and the PIF ligand. This 3DGrid has 60.0 Å of size in axes x, y and z. The distance between each point is 0.375 Å.
We adopt the FEB as our target attribute because it discriminates docking results. There is no consensus about what is the reasonable range of FEB values. Each ligand has to be considered and evaluated individually. Analysis of FEB values from the docking simulations of the FFR_InhA with the four ligands produced different ranges of minimum, maximum and average FEB values (Table 1).
Table 1. Range of FEB (Kcal/mol) values to each ligand considered.
Analysis of Table 1 shows that the difference between the lowest and highest values is very subtle. Although we have an absolute difference between these extreme values (for instance, for ETH it is 2.95 kcal/mol), there are many instances where the decimal value varies sometimes a difference between two FEB values, for instance for ETH, 6.71 and 6.03 can be significant. In previous work, Machado et al. [26][27] using the same four ligands, discretized the FEB values using three different procedures: by equal frequency, by equal width and an original method based on the mode and standard deviation of FEB values. The authors split the FEB into five classes: Excellent, Good, Regular, Bad, and Very Bad. This preprocessing step generated the input data upon which the J48 decision tree algorithm was executed. The resulting performance's measures showed that discretization by equal frequency is not satisfactory. That by equal width had good evaluation for two of the four ligands only [27]. In these cases, J48 did not generate legible trees. Discretization by the mode and standard deviation, however, had better performance's measures for two ligands and produced more legible decision trees for all four ligands[27]. Although the J48 algorithm produced encouraging results, we found it challenging to discretize FEB values whose differences were particularly small. For instance, it was difficult to decide if a FEB value of 8.10 kcal/mol is a Good or Regular FEB since the difference to the next FEB value was 0.10 kcal/mol only. Because of the significance of the decimal values we may have an important loss of information when applying this discretization to FEB values. Therefore, the FEB value is taken as real values, which implies the use of a regression predictive task of data mining.
Predictive attributes definition
According to Jeffrey [28] and da Silveira et al. [29] meaningful contact between two atoms can be established on a distance as large as 4.0 Å. In molecular docking, the FEB value is dependent on the shortest distance between atoms of the receptor's residues and ligands. This is because receptorligand atoms' pairs within 4.0 Å engage in favourable hydrogen bonds (HB) and hydrophobic contacts (HP) [28]. Hence, for each receptor (R) residue, we calculate the Euclidean distance (ED) between their atoms and the atoms of the ligand (L). We define min(Dist_{R,L}) as the predictive attribute representing the shortest distance between the ligand and the receptor's residues. Thus, min(Dist_{R,L}) with a 4.0 Å threshold indicates the presence of receptorligand favourable contacts (HBs and HPs). Only min(Dist_{R,L}) is recovered from the repository [19]. If we used all receptorligand distances the input file would have an enormous amount of attributes, for example, for the PIF ligand which has 24 atoms, the entry would have more than 96,000 attributes! This number of predictive attributes would generate model trees with huge amounts of nodes, and, therefore, would not be interpretable. Each of the 3,100 snapshots of the FFR_InhA will have 268 attributes. We repeat the same procedure for all four ligands. In the end, we have one preprocessed input for each of the four ligands.
Data preprocessing strategies
Our database does store the FFR_InhA which contains 3,100 snapshots (Sn), each with 4,008 atoms (AtR). It totalizes Sn × AtR = 12, 242, 800 receptor coordinates (CoordR). Because each docking simulation is made of 10 runs, we obtain 31,000 docking results for each ligand. However, some docking simulations runs did not converge or had positive FEB values. It occurs when the number of runs and the number of cycles defined as parameter to the algorithm are not enough to find a good position to bind the ligand into the receptor. The docking simulations were performed using the Simulated Annealing (SA) algorithm, which makes its conformation exploration using the Monte Carlo approach. Since in each step of execution a random movement is applied inside the binding site, sometimes the ligand keeps in a nonfavourable position during the number of runs established. If it happens during many runs, the docking result does not converge, that is, it does not present any interaction position/energy in the end of the execution of a given experiment. We considered these data as outliers and did not include them in the preprocessing step. We also defined the parameter ValDoc as the total number of valid docking simulations per ligand. Since AtLig is the number of atoms of each ligand, the sum of the product AtLig × ValDoc for all four ligands determines the total number of ligand coordinates (LigCoord). In summary, we have:
• CoordR = 3, 100 × 4, 008 = 12, 424, 800 records
• LigCoord_{NADH }= 52 × 11, 284 = 586, 768 records
• LigCoord_{TCL }= 18 × 28, 370 = 510, 660 records
• LigCoord_{PIF }= 24 × 30, 420 = 730, 080 records
• LigCoord_{ETH }= 13 × 30, 430 = 395, 590 records
• LigCoord = 586, 768 + 510, 660 + 730, 080 + 395, 590 = 2, 223, 098 records
Data generation
To generate an initial dataset we need to combine the 12,424,098 CoordR and the 2,223,098 LigCoord, calculate their interactions, and find their respective min(Dist_{R,L}). For that, we developed the Dataset algorithm. It executes the first preprocessing step by handling the input data and by producing the best receptorligand interactions stored in an output file: the [Input] matrix. [Input] contains ValDoc lines and 269 columns. The first 268 columns contain the 268 receptor residues min(Dist_{R,L}). To generate a proper dataset for data mining, we aggregated a target attribute in the last column, which is the corresponding FEB value. It is important to emphasize that, at this stage, min(Dist_{R,L}) can have any positive value.
 Dataset Algorithm 
LetR be a receptor
LetL be a ligand
Lett be a snapshot ofR
Letr be a residue ofR
Leta be an atom int snapshot
Letl be an atom inL
LetDist be the distance betweenL andR atoms int
LetDistanceMatrix be a matrix where each line corresponds to a residuer and each cell corresponds to the distance betweena andl
LetResult be a matrix that stores for eacht snapshot, all minimum distances betweena andl
LetInput be a matrix containingResult and, for eacht, its respective FEB value
FOR eacht inTotal_Snapshots_{R}
[Result]_{* }←null
FOR eachr inTotal_Residues_{R}
[DistanceMatrix_{*,*}]← null
FOR eacha inTotal_Atoms_Residue_Snapshot_{R,t}
FOR eachl inTotal_Atoms_Ligand_{L}
Dist_{Ra,Ll }←ED(R, L)
[DistanceMatrix_{a,l}]← Dist_{Ra,Ll}
ENDFOR
ENDFOR
[Result_{t,r}]← min([DistanceMatrix_{r,*}])
ENDFOR
[Input_{t,*}]← [Result_{t,* }+FEB_{L}]
ENDFOR
Dataset improvement
The initial dataset generated by the Dataset Algorithm contains 268 predictive attributes and a target attribute. To help improve the models induced by the data mining task, we must reduce further the amount of features. Jeffrey [28] states that the largest distance value that allows a meaningful contact between receptor and ligand atoms is 4.0 Å. The feature selection strategy in Dataset Algorithm includes distances higher than 4.0 Å. This means that the corresponding receptor residue does not establish a favourable contact with any of the ligand atoms [29]. If there is not a contact in any docking results, it is improbable that this attribute can adequately predict the FEB value. Therefore, we removed all attributes (residues) with shortest distances above the 4.0 Å threshold. ContextFS Algorithm generates a new input from the [Input] matrix output produced by Dataset Algorithm. To compare our contextbased feature selection with a wellknown machine learning feature selection algorithm, we generated one more dataset seeking to improve the initial one produced by the Dataset Algorithm. We believe that a subset of representative attributes can improve further the mining results.
 ContextFS Algorithm 
LetR be a receptor
Lett be a snapshot ofR
Letr be a residue ofR
LetInput be a produced by the Dataset Algorithm
LetInputFS be a result after our contextbased feature selection
FOR eachr inTotal_Residues_{R}
IFmin([Input_{*,r}])≤ 4
FOR eacht inInput
[InputFS_{t,r}]← [Input_{t,r}]
ENDFOR
ENDIF
ENDFOR
FOR eacht inInputFS
[InputFS_{t,*}]← [Input_{t,r+1}]
ENDFOR
Only a limited number of the existing feature selection algorithms can work effectively on regression predictive tasks. Among these, the Correlationbased Feature Selection (CFS) [30] algorithm implemented in Weka [24] can perform feature selection on our datasets. Therefore, we applied CFS to each input generated by Dataset Algorithm, with a different input for each of the four ligands. CFS is based on a filter approach that ranks features according to a correlationbased heuristic evaluation function 1. It looks for a subset that contains features uncorrelated with each other, but highly correlated with the target attribute.
where: M_{S }is a heuristic of a subset S that contains k features; barr_{cf }is the mean featuretarget correlation (f ∈ S) and is the average featurefeature intercorrelation. Equation 1 forms the core of CSF [30]. Table 2 shows the number of attributes selected after applying our feature selection methodology to the original dataset. Additionally, we generated one more dataset (Table 2, fourth column) which combines the features selected by the ContextFS Algorithm with those selected by CFS [30].
Table 2. Number of attributes selected after applying feature selection approaches.
Mining and evaluation the preprocessed data
Regression is a data mining task suitable to problems for which the attribute to be predicted is continuous. Since our target attribute is numeric, regression is the technique applied to the mining experiments in this study. Our models must be understandable and must also represent well the context in which they are inserted. Decision trees are algorithms that cover these needs and also can be applied to both classification and regression problems. The results are regression or classification models arranged in a tree structure. Decision trees can be applied to predict both continuous and discrete values. For continuous values, there are two main types of trees: regression trees and model trees. In regression trees, each leaf node stores a continuousvalued prediction, which is the average of the target attribute for the training tuples. In model trees, each leaf stores a regression model called Linear Model (LM), which is a multivariate linear equation for the target attribute [1]. Our goal is to induce models that indicate residues distances to predict a given FEB value. We expect our model to help us discover whether a snapshot, when docked to a given ligand, will lead to favourable estimated FEB values. For this, we use the M5P [31] machine learning model tree algorithm.
Evaluation of the induced models
There are several measures to verify if the induced models generated are acceptable numerical predictions. They are called predictive measures. In the case of model tree algorithms, the most widespread measures are: root meansquared error (RMSE, equation 2), mean absolute error (MAE, equation 3), and correlation coefficient [24]. Smaller values of RMSE and MAE are indicators of better models. All of these measures make use of the predicted values p_{1 }. . . p_{n }and the actual values a_{1 }. . . a_{n}.
The correlation coefficient (Equation 4) measures the statistical correlation between a and p. The values range from 1, for perfectly correlated results, to 0, when there is no correlation, and to 1, for an inverse perfect correlation. We look for perfectly correlated results or correlation coefficients closer to 1.
Where , and , being that ā and are the corresponding a and p averages.
In addition to these measures, some investigations also make use of the model interpretability metric, which is the number of nodes in the model tree. The model tree with the smallest amount of nodes generates the best interpretable models [32].
Evaluation based on the context
The measures shown in the previous section were used during the evaluation of the models generated. However, as we are interested in the usefulness of the induced models, we propose a new contextbased measure. We also analyze the induced model trees and their contents. Figure 2 shows a model tree generated upon application of our contextbased preprocessing (ContextFS Algorithm) to NADH. This model contains five nonleaf nodes, each representing a selected amino acid residue, and six LMs. Equation 5 depicts the sixth LM (LM6) composed of a selected number of predictive attributes (receptor residues) weighted by their effect in the target attribute (FEB) plus a constant value.
Figure 2. Model Tree generated by the M5P Algorithm.
We evaluate the models taking into account the receptor residues present in both the nonleaf nodes and the LMs, bearing in mind that the docking software calculates the FEB value only for the residues within the grid box around the receptor binding site (Figure 1). Consequently, if we are inducing model trees to predict FEB values, models that consider residues located outside the grid box have no direct significance. Usually, a specialist defines which residues belong to the receptor active site. These residues shape the active site for the complementary ligand binding. For InhA, the specialist selected 52 residues, here denoted by ESR. Subsequently, by inspecting each model, we identified which model's residues (MR) appear in the tree or the LMs (Figure 2 and Equation 5). Now we are able to evaluate MR and compare it with ESR by calculating the Precision (Equation 6), Recall (Equation 7) and Fscore (Equation 8) measures [1].
In the context of this analysis:
• {Relevant} ∩ {Retrieved} can be defined as ESR ∩ MR
• {Relevant} can be defined as ESR
• {Retrieved} can be defined as MR
Results
We evaluated the models by means of the predictive and context measures presented. The measures were applied separately to each of the four distinct ligands; NADH, PIF, TCL, and ETH. For each one of them, we observed the four data preprocessing strategies:
1. The results obtained by the initial dataset, generated by Dataset Algorithm;
2. The results obtained by the contextbased feature selection, generated by ContextFS Algorithm;
3. The results obtained by the feature selection generated by CFS [30];
4. The results obtained by combining both feature selection generated by ContextFS Algorithm and CFS (Table 2 fourth column).
The initial dataset was the first and, possibly, the most important contextbased data preprocessing. Without the previous knowledge about the context, it would not be possible to generate an input that produces interpretable models as we expected. Based on the fact that the initial dataset was constructed considering minimum distances (min(Dist_{R,L})), our hypothesis is that the contextbased data preprocessing we proposed, including feature selection, produces better results than using a worthy feature selection approach, where the context is not observed. Hence, we expected that the results from the third strategy would not be better than the others. On the other hand, we expected the contextbased feature selection (second strategy) to give better results than the others. The second strategy was applied considering both, the context already employed in the initial dataset and context to select appropriate features. To evaluate the results in terms of their statistical significance, we applied the Friedman Test [33] with a significance level of α = 0.05. For the context measures (Table 3), we evaluated whether strategy 2 was significantly better than the others. In this case, we could assert that it was true since p = 0.014. We infer that our feature selection approach improves the initial results. Therefore, as we are interested in the quality of the induced models, our contextbased measure can be considered as the most appropriate.
Table 3. Model evaluation predictive measures.
In Table 3 we evaluate whether strategy 3 is significantly worse than the others. We got p = 0.040 for MAE and p = 0.054 for RMSE as significance levels, indicating that probably strategy 3 is really worse than strategies 1, 2 and 4. However many effort is needed to assess it. With respect of context measures (Table 4), we evaluate whether strategy 2 is significantly better than the other ones. In this case, we can assert it is true because we got p = 0.014. We infer that our feature selection approach improves the initial results. In doing so, once we are interested in the quality of the induced models, our contextbased measure can be considered as the most appropriate.
Table 4. Model evaluation context measures.
It is noticeable in Tables 3 and 4 that the results are different for each ligand, despite employing the same strategy in the preprocessing. This is so because different ligands have different sizes, as well as different molecular interaction properties. They bind in different regions of the receptor's binding site. As a result, the target attribute FEB has different ranges of values for the distinct ligands (Table 1) and that is why the models are induced for individual ligands. Although they are not interchangeable, we expect them to be used to select ligands that belong to a similar class (high molecular similarity).
Conclusions
Data preprocessing is a significant step in data mining. In data preprocessing, different techniques are applied to improve data quality such that the mining results are more accurate and better interpretable. There are many techniques available to preprocess data, mainly for model quality measures. However, some applications, like bioinformatics, often demand wellsuited models. Hence, when the data mining process is based on the context involved, a contextbased preprocessing can improve the quality of the induced models.
In this article, we presented a case of mining data from flexible receptor molecular docking simulations results. Here the goal was to identify features that could characterize the best fit of ligands into a given receptor. Our experiments were conducted considering the InhA receptor from the M. Tuberculosis and four distinct ligands: NADH, PIF, TCL, and ETH. We showed that an appropriate contextbased data preprocessing could provide improved results.
We concentrated on four main preprocessing steps which: 1) consider the context to choose an initial set of attributes and the proper instances for each ligand input file; 2) perform feature selection on the initial dataset, taking into account the characteristics of the docking results from each ligand; 3) perform feature selection, for each ligand, based on the CFS machine learning algorithm; and 4) combine features selected by our contextbased approach (ContextFS Algorithm) and those selected by the CFS algorithm. We hypothesized that mining the preprocessed data would provide better results, with respect to the original dataset, by using the second strategy.
We performed mining experiments using the M5P model tree algorithm implemented in Weka. The values of the RMSE error measure, as well as a contextbased metric that considers the tree interpretability, suggested that we can obtain better results when using our feature selection approach (second strategy). Statistical analysis of the results, with the Friedman test, showed that our contextbased approach significantly improves predictive measures while CFS worsens context measures. We concluded that data preprocessing, which considers the context involved, can improve the mining results and produce better interpretable models. As future studies, we plan to use the induced models, generated using the second strategy, to select the most promising subset of snapshots, out of a very large ensemble, for a given ligand.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
ATW and KSM executed the preprocessing for the data mining experiments, performed all the data mining experiments, evaluated the models results and wrote the first draft of the article. DDAR helped to conceive the test cases and to evaluate the models. ONS helped to analyze the results and to write the final version of the article. All authors read and approved the final manuscript.
Author's information
ATW current address:
LaCA  Labortório de Computação Aplicada, Departamento de Computação Aplicada, Universidade Federal de Santa Maria (UFSM), Santa Maria, RS, Brasil.
KSM current address:
ComBiL, Grupo de Biologia Computacional, Centro de Ciências Computacionais, Universidade Federal do Rio Grande (FURG), Rio Grande, RS, Brasil.
Acknowledgements
We thank the reviewers for their comments and suggestions which helped to improve the manuscript. This work was supported in part by grants (305984/20128, 559917/20104, 551209/20100) from the Brazilian National Research and Development Council (CNPq) to ONS. ONS is a CNPq Research Fellow. ATW and KSM PhD scholarships were funded by CNPq and Brazilian Coordination of Improvement of Higher Education Personnel (CAPES). ATW and DDAR research missions to University of Newcastle were funded by European Commission FP7 Marie Curie IRSES grant, CILMI project.
Declarations
Publication of this article has been funded by the authors.
This article has been published as part of BMC Genomics Volume 14 Supplement 6, 2013: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (Xmeeting 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S6.
References

Han J, Kamber M: Data Mining: Concepts and Techniques. 2nd edition. Morgan Kaufmann; 2006.

Alpaydin E: Introduction to Machine Learning. 2nd edition. The MIT Press; 2010.

Freitas A, Wieser D, Apweiler R: On the importance of comprehensible classification models for protein function prediction.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010, 99:172182.

Baralis E, Cagliero L, Cerquitelli T, Garza P, Marchetti M: CASMine: providing personalized services in contextaware applications by means of generalized rules.
Knowl. Inf. Syst 2011, 28(2):283310. Publisher Full Text

Nam MY, Rhee PK: Pattern recognition using evolutionary classifier and feature selection. In Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery. FSKD'06, Berlin, Heidelberg: SpringerVerlag; 2006:393399.

Kuntz ID: Structurebased Strategies for Drug Design and Discovery.
Science 1992, 257:10781082. PubMed Abstract  Publisher Full Text

Huang SY, Zou X: Ensemble docking of multiple protein structures: Considering protein structural variations in molecular docking.
Proteins 2006, 66:399421. Publisher Full Text

Koshland DE: Application of a Theory of Enzyme Specificity to Protein Synthesis. [http://www.pnas.org/content/44/2/98.full.abstract] webcite
Proceedings of the National Academy of Sciences 1958, 44(2):98104. Publisher Full Text

Lybrand T: LigandProtein Docking and Rational Drug Design.
Curr. Opin. Struct. Biol 1995, 5:224228. PubMed Abstract  Publisher Full Text

Machado KS, Schroeder EK, Ruiz DD, Cohen EML, Norberto de Souza O: FReDoWS: a method to automate molecular docking simulations with explicit receptor flexibility and snapshots selection.
BMC Genomics 2011, 12(Supl4):114. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schroeder E, Basso L, Santos D, Norberto de Souza O: Molecular Dynamics Simulation Studies of the WildType, I21V, and I16T Mutants of IsoniazidResistant Mycobacterium tuberculosis Enoyl Reductase (InhA) in Complex with NADH: Toward the Understanding of NADHInhA Different Affinities.
Biophys. J. 2005, 89:876884. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Luscombe NM, Greenbaum D, Gerstein M: What is Bioinformatics? a Proposed Definition and Overview of the Field.

Praneenararat T, Takagi T, Iwasaki W: Integration of interactive, multiscale network navigation approach with Cytoscape for functional genomics in the big data era. [http://www.biomedcentral.com/14712164/13/S24/S24] webcite
BMC Genomics 2012, 13(Suppl 7):S24. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dessen A, Quemard A, Blanchard J, Jacobs W, Sacchettini J: Crystal Structure and Function of the Isoniazid Target of Mycobacterium tuberculosis.
Science 1995, 267:16381641. PubMed Abstract  Publisher Full Text

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne P: PDB  Protein Data Bank.
Nucl. Acids Res. 2000, 28:235242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuo M, Morbidoni H, Alland D, Sneddon S, Gourlie B, Staveski M, Leonard M, Gregory J, Janjigian A, Yee C, Musser J, Kreiswirth B, Iwamoto H, Perozzo R, Jacobs W, Sacchettini J, Fodock D: Targeting Tuberculosis and Malaria through Inhibition of Enoyl Reductase: Compound Activity and Structural Data.
J. Biol. Chem. 2003, 278(23):2085120859. PubMed Abstract  Publisher Full Text

Oliveira JS, Sousa EHS, Basso LA, Palaci M, Dietze R, Santos DS, Moreira I: An Inorganic Iron Complex that Inhibits Wildtype and an Isoniazidresistant Mutant 2transenoylACP (CoA) Reductase from Mycobacterium tuberculosis.

Wang F, Langley R, Gulten G, Dover L, Besra G, Jacobs WJ, Sacchettini J: Mechanism of thioamide drug action against tuberculosis and leprosy.
J. Exp. Med. 2007, 204:7378. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Winck A, Machado K, Norberto de Souza O, Ruiz DD: FReDD: Supporting mining strategies through a flexiblereceptor docking database. In Brazilian Symposium on Bioinformatics, Volume 5676 of LNBILNCS Advances in Bioinformatics and Computational Biology. Porto Alegre, Rio Grande do Sul, Brasil: Springer Berlin / Heidelberg; 2009:143146.

Winck A, Machado K, Norberto de Souza O, Ruiz DD: Supporting Intermolecular Interaction Analyses of FlexibleReceptor Docking Simulations. In IADIS International Conference Applied Computing. Timisoara, Romania; 2010:18.

Machado KS, Winck AT, Ruiz DD, Norberto de Souza O: Mining flexiblereceptor docking experiments to select promising protein receptor snapshots.
BMC Genomics 2010, 11(5):113. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Machado KS, Winck AT, Ruiz DD, Cohen EML, Norberto de Souza O: Mining Flexiblereceptor Docking Data.
WIREs Data Mining and Knowledge Discovery 2011, 1:532541. Publisher Full Text

Winck AT, Machado KS, Ruiz DD, de Lima VLS: Association Rules to Identify Receptor and Ligand Structures through Named Entities Recognition.

Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 3rd edition. Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann; 2011.

Morris GM, Goodsell DS, Halliday R, Huey R, Hart W, Belew RK, Olson AJ: Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function.
J. Comput. Chem. 1998, 19(14):16391662. Publisher Full Text

Machado K, Winck A, Ruiz DD, Norberto de Souza O: Discretization of FlexibleReceptor Docking Data. In Brazilian Symposium on Bioinformatics, Volume 6268 of LNBILNCS Advances in Bioinformatics and Computational Biology. Buzios, Rio de Janeiro, Brasil: Springer Berlin / Heidelberg; 2010:7579.

Machado K, Winck A, Ruiz DD, Norberto de Souza O: Comparison of Discretization Methods of FlexibleReceptor Docking Data for Analyses by Decision Trees. In IADIS International Conference Applied Computing. Timisoara; 2010:223229.

Jeffrey GA: An introduction to hydrogen bonding. Oxford University Press, New York; 1997.

da Silveira CH, Pires DEV, Minardi RC, Ribeiro C, Veloso CJM, Lopes JCD, Meira W, Neshich G, Ramos CHI, Habesch R, Santoro MM: Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins.
Proteins: Structure, Function, and Bioinformatics 2009, 74(3):727743. Publisher Full Text

Hall MA, Smith LA: Feature subset selection: a correlation based filter approach.
Proceedings of the 1997 International Conference on Neural Information Processing and Intelligent Information Systems: New Zealand 1997, 855858.

Quinlan JR: Learning with Continuous Classes.
Proceedings of the 5th Australian Joint Conference on Artificial Intelligence 1992, 343348.
World Scientific

Barros RC, Winck AT, Machado KS, Basgalupp MP, Carvalho AC, Ruiz DD, Norberto de Souza O: Automatic design of decisiontree induction algorithms tailored to flexiblereceptor docking data.
BMC Bioinformatics 2012, 13(310):114. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Siegel S, Castellan N: Nonparametric statistics for the behavioral sciences. 2nd edition. McGrawHill, Inc; 1988.