Abstract
Background
This paper addresses the prediction of the free energy of binding of a drug candidate with enzyme InhA associated with Mycobacterium tuberculosis. This problem is found within rational drug design, where interactions between drug candidates and target proteins are verified through molecular docking simulations. In this application, it is important not only to correctly predict the free energy of binding, but also to provide a comprehensible model that could be validated by a domain specialist. Decisiontree induction algorithms have been successfully used in drugdesign related applications, specially considering that decision trees are simple to understand, interpret, and validate. There are several decisiontree induction algorithms available for generaluse, but each one has a bias that makes it more suitable for a particular data distribution. In this article, we propose and investigate the automatic design of decisiontree induction algorithms tailored to particular drugenzyme binding data sets. We investigate the performance of our new method for evaluating binding conformations of different drug candidates to InhA, and we analyze our findings with respect to decision tree accuracy, comprehensibility, and biological relevance.
Results
The empirical analysis indicates that our method is capable of automatically generating decisiontree induction algorithms that significantly outperform the traditional C4.5 algorithm with respect to both accuracy and comprehensibility. In addition, we provide the biological interpretation of the rules generated by our approach, reinforcing the importance of comprehensible predictive models in this particular bioinformatics application.
Conclusions
We conclude that automatically designing a decisiontree algorithm tailored to molecular docking data is a promising alternative for the prediction of the free energy from the binding of a drug candidate with a flexiblereceptor.
Background
The pharmaceutical industry is under increasing pressure to continuously deliver new drugs to the market [1]. Since the costs involved in the development of new drugs have exceeded one billion dollars, Rational Drug Design (RDD) has become an emerging technology for cost reduction and fast development of new drugs [2].
Interaction between drug candidates (ligands) and target proteins (receptors) through molecular docking simulations is the computational basis of RDD. Given a receptor, molecular docking simulations sample a large number of orientations and conformations of a ligand inside the protein bibding site. The simulations also evaluate the Free Energy of Binding (FEB) and rank the orientations/conformations according to their FEB scores [3].
Nowadays, the majority of molecular docking algorithms only consider the ligand as flexible whereas the receptor remains rigid, due to the computational cost when considering its flexibility. Conversely, biological macromolecules, like protein receptors, are intrinsically flexible in their cellular environment, considering that the receptor may modify its shape upon ligand binding, moulding itself to be complementary to its ligand. This increases favorable contacts and reduces adverse interactions, which in turn minimizes the total FEB [4]. Therefore, it is important to consider the receptor flexibility during molecular docking.
Among all available methodologies to explicitly include the receptor flexibility in molecular docking simulations, a possible alternative is to select a series of different conformations derived from a molecular dynamics (MD) simulation of the target receptor [5]. We name this type of receptor representation fully flexiblereceptor (FFR) model [6,7], and we investigate this methodology with target receptor InhA enzyme from Mycobacterium tuberculosis[8] (Mtb), which was modeled as a set of 3,100 snapshots derived from a 3.1 ns MD simulation trajectory [9]. For that, we generated molecular docking data sets with data from docking simulations of FFRInhA[10] to six different ligands: nicotinamide adenine dinucleotide (NADH) [8], triclosan (TCL) [11], pentacyano(isoniazid)ferrate(II) (PIF) [12], ethionamide (ETH) [13], Isoniazid (INH) [14], and Triclosan derivative 20 (JPM) [15]. Explicitly including the receptor flexibility in docking simulations is computationally demanding and generates large amounts of data, which need to be analyzed and interpreted. The concept of molecular docking is better illustrated in Figure 1.
Figure 1. Molecular docking simulation. Figure 1 is divided as follows: (A) shows an example of a docking simulation from InhA, where the protein in ribbon is depicted in gray, the ETH ligand in its initial position is highlighted in red, and the final position of ETH after a molecular docking experiment is highlighted in blue; (B) presents an example of the distances between the ETH ligand and the receptor residue GLY95 (Glycine 95).
Decisiontree induction algorithms have been successfully used in drugdesign related applications [1619]. One of the main advantages of these algorithms when compared to other machine learning techniques (e.g., SVMs and Neural Networks) is that decision trees are simple to understand, interpret and validate. Thus, domain specialists (e.g., biologists, physicians, chemists) can easily verify whether the data present interesting patterns, increasing their confidence in making predictions and creating new hypotheses. Several decisiontree induction algorithms have been proposed for generaluse, but each has a bias that makes it more suitable for a particular data distribution. Hence, a common choice for decisiontree applications is to employ stateoftheart decisiontree induction algorithm C4.5 [20], regardless of the fact that it was not tailored to the biological domain of interest.
In this article, we investigate a new data mining approach that automatically generates new decisiontree algorithms tailored to a specific domain. We employ this new approach for analyzing data from fully flexiblereceptor molecular docking experiments, looking for receptor snapshots to which a particular ligand binds more favorably. With the resulting induced models from these automaticallydesigned algorithms, we expect that the inferred knowledge will help us to point out which of the conformations that were generated by the fully flexiblereceptor model are more promising to future docking experiments. This, in turn, allows a reduction of the flexiblereceptor model dimensionality and permit faster docking simulations of flexible receptors [6]. We analyze whether the decision trees generated by the automaticallydesigned algorithms have higher predictive accuracy and are more comprehensible than decision trees generated by stateoftheart decisiontree induction algorithm, C4.5 [20]. In addition, we interpret and validate our findings with the help of a domain specialist.
Method
In this section, we describe the proposed method for automatically generating decisiontree induction algorithms tailored to flexiblereceptor molecular docking data, namely Hyperheuristic Evolutionary Algorithm for automatically Designing DecisionTree algorithms (HEADDT) [21]. First, we briefly introduce decision trees, and the importance of generating comprehensible models.
Decision trees background
Automatically generating rules in the form of decision trees has been a key active research topic in the development of data exploration techniques [22]. Disciplines such as engineering (pattern recognition), statistics, decision theory, and more recently artificial intelligence (machine learning) have a large number of works dedicated to the generation and application of decision trees.
Formally, a basic topdown decisiontree induction algorithm can be recursively defined in only two steps, in the socalled Hunt’s algorithm. Let X_{t}be a set of training instances associated with node t and y={y_{1}y_{2},…,y_{k}} be the set of class labels in a kclass problem [23]:
1) if all the instances from X_{t}belong to the same class y_{t}then t is a leaf node labeled as y_{t};
2) if X_{t}contains instances that belong to more than one class, an attribute test condition is selected to partition the instances into subsets. A child node is created for each outcome of the test and the instances in X_{t}are distributed to the children based on the outcomes. Recursively apply the algorithm to each child.
This simplified algorithm is is the basis for all current topdown decision tree induction algorithm. Nevertheless, its assumptions are too stringent for practical use. For instance, it would only work if every combination of attribute values is present in the training data, and if the training data is inconsistencyfree (each combination has a unique class label). Hunt’s algorithm was improved in many ways. Its stopping criterion, for example, as expressed in step 1, requires all leaf nodes to be pure (i.e., belonging to the same class). In most practical cases, this constraint leads to enormous decision trees, which tend to suffer from overfitting. Possible solutions to overcome this problem is prematurely stopping the tree growth when a minimum level of impurity is reached, or performing a pruning step after the tree has been fully grown. Another design issue is how to properly select the split test to partition the instances into smaller subsets. In Hunt’s original approach, a costdriven function was responsible for partitioning the tree. Subsequent algorithms such as ID3 [24] and C4.5 [20] make use of informationtheory based functions for partitioning nodes in purer subsets. Finally, dealing with missing values is also a major design issue one has to face when developing a new decisiontree induction algorithm.
Alternatives to the topdown approach were proposed in the last decades, such as bottomup induction [25], evolutionary induction [2630], and ensemble of trees [31]. Notwithstanding, no strategy has been more successful in generating accurate and comprehensible decision trees with low computational effort than the greedy topdown induction strategy. Due to its popularity, a large number of approaches have been proposed for each one of the design components of topdown decisiontree induction algorithms. Considering that the manual improvement of decisiontree design components has been carried out for the past 40 years, we believe that automatically designing decisiontree induction algorithms could provide a faster, lesstedious — and equally effective — strategy for improving decisiontree algorithms. Hence, we propose in this work to automatically generate new and effective decisiontree algorithms tailored to the flexiblereceptor molecular docking data.
We recall that decisiontree induction algorithms are widelyused for knowledge discovery and pattern recognition tasks, due to their advantage of producing a comprehensible classification model. Notwithstanding, these algorithms are usually underestimated in bioinformatics, given that researchers tend to prefer methods such as support vector machines or neural networks [3234]. These methods are usually very effective in terms of predictive accuracy, but they are “black box” methods, providing little biologicallymeaningful explanation for their prediction, giving few new insight about the data or the application domain to users [35]. In many bioinformatics applications, however, the discovered model should be interpreted and validated in the context of current biological knowledge.
HEADDT
HEADDT is a hyperheuristic algorithm able to automatically design topdown decisiontree algorithms [21,36]. Hyperheuristics can automatically generate new heuristics suited to a given problem or class of problems. This is carried out by combining, through an evolutionary algorithm, components or buildingblocks of human designed heuristics [37]. HEADDT is a regular generational evolutionary algorithm, in which individuals are collections of building blocks of decisiontree algorithms. Figure 2 illustrates the evolutionary scheme followed by HEADDT. Each individual is encoded as an integer string and each gene has a different range of supported values. We divided the genes into four categories, representing the major building blocks (design components) of a decisiontree algorithm: split genes, stopping criteria genes, pruning genes, and missing values genes. We detail each category next.
Figure 2. HEADDT evolutionary scheme. Figure 2 presents the evolutionary scheme followed by HEADDT. A random initial population of individuals (decisiontree algorithms) is created and evaluated according to the performance of their corresponding trees in a metatraining set. Then, a selection procedure is responsible for choosing individuals that will undergo breeding operations. After a new population is complete, it is once again evaluated and the process continues until a maximum number of generations is reached. The best decisiontree induction algorithm is then executed over a metatest set, which estimates its performance in unseen data.
Split genes
These genes are used for selecting the attribute to split the data in the current node of the decision tree. A decision rule based on the selected attribute is thus generated, and the input data is filtered according to the outcomes of this rule. This process continues recursively. We used two genes to model the split component of a decisiontree algorithm. The first gene, with an integer value, indexes one of the 15 splitting criteria implemented: information gain [24], Gini index [38], mutual information [39], G statistics [40], Mantaras criterion [41], hypergeometric distribution [42], ChandraVarghese criterion [43], DCSM [44], χ^{2}[45], mean posterior improvement [46], normalized gain [47], orthogonal criterion [48], twoing [38], CAIR [49] and gain ratio [20]. The second gene, with a binary value, represents the split component of a decisiontree algorithm, indicating whether the splits of a decision tree will be necessarily binary or multiedged. In a binary tree, every split has only two outcomes (edges). Thus, nominal attributes with many categories have to be divided into two subsets, each representing an aggregation over several categories. In a multiedge tree, nominal attributes are divided according to their number of categories, i.e., one edge for each category. In both cases, numeric attributes always partition the tree into two subsets (att≤ threshold, att>threshold).
Stopping criteria genes
The second category of genes concerns the stopping criteria component of decisiontree induction algorithms. The topdown induction of a decision tree is recursive and it continues until a stopping criterion is satisfied. We implemented the following stopping criteria:
1) Reaching class homogeneity — when all instances that reach a given node belong to the same class, there is no reason to split this node any further. This strategy can be combined with any of the following strategies;
2) Reaching the maximum tree depth — a parameter tree depth can be specified to avoid deep trees. We have fixed its range in the interval [2,10] levels;
3) Reaching the minimum number of instances for a nonterminal node — a parameter minimum number of instances for a nonterminal node can be specified to avoid (or at least alleviate) the data fragmentation problem in decision trees. Range: [1,20] instances;
4) Reaching the minimum percentage of instances for a nonterminal node — same as before, but instead of the current number of instances, we set the minimum percentage of instances. Its range is [1%,10%] of the total number of instances;
5) Reaching an accuracy threshold within a node — a parameter accuracy reached can be specified to stop the growth of the tree when the accuracy within a node (majority of instances) reaches a given threshold. Possible values are {70%,75%,80%,85%,90%,95%,99%}.
The first of the stopping criteria genes selects one of the five different strategies for stopping the tree growth. The second gene dynamically adjusts a value within the range [0,100] to the corresponding strategy. For example, if the strategy selected by the first gene is reaching the maximum tree depth, the following mapping function is executed: result=(value mod 9) + 2. This function maps from [0,100] to [2,10], which is what was defined as the range of this strategy.
Pruning genes
Pruning is usually performed in decision trees for enhancing tree comprehensibility by reducing its size while maintaining (or even improving) accuracy. We implemented the following wellknown pruning strategies: i) reducederror pruning; ii) pessimistic error pruning; iii) minimum error pruning; iv) costcomplexity pruning; and v) errorbased pruning.
1) Reducederror pruning (REP) is a conceptually simple strategy proposed by Quinlan [50]. It uses a pruning set to evaluate the goodness of a given subtree from T. The idea is to evaluate each nonterminal node t with regard to the classification error in the pruning set. If such an error decreases when we replace the subtree T^{(t)}rooted on t by a leaf node, then T^{(t)}must be pruned. Quinlan imposes a constraint: a node t cannot be pruned if it contains a subtree that yields a lower classification error in the pruning set. The practical consequence of this constraint is that REP should be performed in a bottomup fashion.
2) Pessimistic error pruning (PEP) [50] uses the training set for both growing and pruning the tree. The apparent error rate is optimistically biased and cannot be used to decide whether pruning should be performed or not. Quinlan thus proposes adjusting the apparent error according to the continuity correction for the binomial distribution in order to provide a more realistic error rate. PEP is computed in a topdown fashion, and if a given node t is pruned, its descendants are not examined, which makes this pruning strategy efficient in terms of computational effort.
3) Minimum error pruning (MEP) [51] is a bottomup approach that seeks to minimize the expected error rate for unseen cases. It uses an adhoc parameter m for controlling the level of pruning. Usually, the higher the value of m, the more severe the pruning. Cestnik and Bratko [51] suggest that a domain expert should set m according to the level of noise in the data. Alternatively, a set of trees pruned with different values of m could be offered to the domain expert, so he/she can choose the best one according to his/her experience.
4) Costcomplexity pruning (CCP) is the postpruning strategy of the CART system [38]. It consists of two steps: (i) generate a sequence of increasingly smaller trees, beginning with T and ending with the root node of T, by successively pruning the subtree yielding the lowest cost complexity, in a bottomup fashion; (ii) choose the best tree among the sequence based on its relative size and accuracy (either on a pruning set, or provided by a crossvalidation). The idea within step (i) is that pruned tree T_{i + 1}is obtained by pruning the subtrees that show the lowest increase in the apparent error (error in the training set) per pruned leaf. Regarding step (ii), CCP chooses the smallest tree whose error (either on the pruning set or on crossvalidation) is not more than one standard error greater than the lowest error observed in the sequence of trees.
5) Errorbased pruning (EBP) was proposed by Quinlan and it is implemented as the default pruning strategy of C4.5 [20]. It is an improvement over PEP, based on a far more pessimistic estimate of the expected error. Unlike PEP, EBP performs a bottomup search, and it carries out not only the replacement of nonterminal nodes by leaves but also grafting of subtree T^{(t)}onto the place of parent t. For deciding whether to replace a nonterminal node by a leaf (subtree replacement), to graft a subtree onto the place of its parent (subtree raising) or not to prune at all, a pessimistic estimate of the expected error is calculated by using an upper confidence bound.
We designed two genes in HEADDT for pruning. The first gene indexes one of the five approaches for pruning a DT (and also the option of not pruning at all). The second gene is in the range [0,100] and its value is dynamically mapped by a function, according to the pruning method selected (similar to the second stopping criteria gene). For REP, the parameter is the percentage of training data to be used in the pruning set (varying within the interval [10%,50%]). For PEP, the parameter is the number of standard errors (SEs) to adjust the apparent error, in the set {0.5,1,1.5,2}. For MEP, the parameter m may range within [0,100]. For CCP, there are two parameters: the number of SEs (in the same range than PEP) and the pruning set size (in the same range than REP). Finally, for EBP, the parameter CF may vary within [1%,50%].
Missing values genes
Handling missing values is an important issue in decisiontree induction and its use for classification. We designed three genes for dealing with missing values in distinct scenarios: (i) during split evaluation; (ii) during instance distribution; and (iii) during classification, as follows.
For the split criterion evaluation of node t based on attribute a_{i}, we implemented the following strategies: 1) ignore all instances whose value of a_{i} is missing; 2) imputation of missing values with either the mode (nominal attributes) or the mean (numeric attributes) of all instances in t; 3) weight the splitting criterion value (calculated in node t with regard to a_{i}) by the proportion of missing values; 4) imputation of missing values with either the mode (nominal attributes) or the mean (numeric attributes) of all instances in t whose class attribute is the same of the instance whose a_{i} value is being imputed.
For deciding which child node training instance x_{j}should go to, considering a split in node t over a_{i}, we adopted the following options: 1) ignore instance x_{j}; 2) treat instance x_{j} as if it has the most common value of a_{i}, regardless of the class; 3) treat instance x_{j} as if it has the most common value of a_{i}considering the instances that belong to the same class than x_{j}; 4) assign instance x_{j} to all partitions; 5) assign instance x_{j}to the partition with the largest number of instances; 6) weight instance x_{j} according to the partition probability; 7) assign instance x_{j} to the most probable partition, considering the class of x_{j}.
Finally, for classifying a new test instance x_{j}, considering a split in node t over a_{i}, we used the strategies: 1) explore all branches of t combining the results; 2) take the route to the most probable partition (largest subset); 3) stop the classification process and assign instance x_{j} to the majority class of node t.
Evolution and fitness evaluation
The evolution of individuals in HEADDT follows the scheme presented in Figure 2. The 9gene linear genome of an individual in HEADDT is comprised of the building blocks described in the earlier sections: [split criterion, split type, stopping criterion, stopping parameter, pruning strategy, pruning parameter, mv split, mv distribution, mv classification]. One possible individual encoded by that linear string is [4,1,2,77,3,91,2,5,1], which accounts for the following algorithm:
1)Recursively split nodes with the G statistics criterion;
2)Create one edge for each category in a nominal split;
3)Perform step 1 until classhomogeneity or the maximum tree depth of 7 levels ((77 mod 9) + 2) is reached;
4)Perform MEP pruning with m = 91;
5)When dealing with missing values:
5.1)Impute missing values with mode/mean during split calculation;
5.2)Distribute missingvalued instances to the partition with the largest number of instances;
5.3)For classifying an instance with missing values, explore all branches and combine the results.
Figure 3 presents an example of how linear genomes are decoded into algorithms, and how they participate of the evolutionary cycle. The first step of HEADDT is the generation of the initial population, in which a population of 100 individuals is randomly generated (random number generation within the genes acceptable range of values). Then, the individuals participate in a pairwise tournament selection procedure for defining those that will undergo genetic operators. Individuals may participate in either onepoint crossover (80% probability), random uniform gene mutation (15% probability), or reproduction (5% probability), the three mutuallyexclusive genetic operators employed in HEADDT. In addition, HEADDT employs an elitism strategy, in which the best 5 individuals are kept from one generation to the next.
Figure 3. HEADDT in action. Figure 3 presents an illustration of HEADDT — a set of lineargenome individuals are decoded into algorithms and executed over the metatraining set. The performance of these individuals is measured and the evolutionary cycle starts with typical operations such as selection, crossover, and mutation. At the end of the cycle, the best (fitnesswise) individual is selected for inducing decision trees from the metatest set.
During fitness evaluation, a metatraining set is used for assessing the quality of each individual throughout evolution. The metatest set is used to assess the quality of the decisiontree induction algorithm evolved by HEADDT (the best individual in Figure 2). There are two distinct approaches for dealing with the metatraining and test sets: (i) evolving a decisiontree induction algorithm tailored to one specific data set; and (ii) evolving a single decisiontree induction algorithm to be employed in multiple data sets. In the first case, we have a specific data set to which we want to design a decisiontree induction algorithm. In the second case, we have one (or several) data set(s) comprising the metatraining set, and multiple data sets comprising the metatest set.
We perform experiments with both approaches previously described. In the first set of experiments, we evolved a decisiontree induction algorithm tailored to each molecular docking data set. In the second set, we evolved a single decisiontree induction algorithm, using only one data set, and applied this algorithm to all data sets. In the first scenario, we use the classification accuracy of a validation set (25% the size of the training set) to evolve the individuals (decisiontree algorithms) of HEADDT. In the second scenario, we use the classification accuracy obtained by performing 10fold crossvalidation in the metatraining set as our fitness function.
Results
The hypothesis we try to confirm in this paper is that automaticallydesigned decisiontree induction algorithms can be more effective than humandesigned, generaluse decisiontree induction algorithms for solving problems of a particular domain. More specifically, we investigate the problem of RDD through flexiblereceptor molecular docking simulations.
One way to evaluate a molecular docking simulation (e.g.AutoDock) is by examining the resulting FEB value: the smaller the FEB value, the better the binding of the ligand into the receptor binding pocket. AutoDock is a suit of programs used to predict the bound conformations of a ligand to a receptor, which applies a technique that combines an algorithm of conformationsearching with a rapid gridbased method of energy evaluation [52]. This gridbased method is performed by the module of AutoDock called AutoGrid. It precalculates a threedimensional energybased grid of interactions of various atom types. AutoGrid generates a a grid map for each atom in the ligand considering a probe atom that visits each grid point. The interaction energy between the ligand and the probe atom is then calculated and stored. To estimate the final FEB value, AutoDock applies an empirical binding free energy function where the molecular mechanismbased and empirical terms are multiplied by coefficients obtained by linear regression analysis [52].
An important feature related to FEB is the Euclidean distance (measured in Angstrom, Å) between atoms in the receptor’s residues and ligands. Thus, for each receptoramino acid residue, we calculate the distances between theirs and the ligand atoms. For all calculated distances, we only consider the shortest distance for each receptor residue. Therefore, each receptor residue may be seen as a predictive attribute in a data mining problem. Given that the InhA receptor contains 268 amino acid residues, each docking data set has 268 predictive attributes plus the FEB class, which is the attribute we are interested in predicting [53]. Each instance in a molecular docking data set is a receptor snapshot. We produced a distinct data set for each of the six previously mentioned ligands (see Table 1).
Table 1. Summary of the data sets
Since FEB is a continuous variable, we employ a discretization technique detailed in [54], which divides FEB in five levels of binding quality. Such technique makes use of the mode and standard deviation of FEB, dividing its sorted values into intervals where border values of the distribution fit each instance into its proper class. The border values are shown in Equation 1, where M_{o} and σ represents the mode and standard deviation values of the FEB distribution.
Moreover, we perform attribute selection to reduce the 268dimensional data sets using the following procedure: we remove all attributes (residues) whose shortest distance to the ligand is larger than 5 Å (distances larger than 5 Å do not establish a meaningful contact between receptor and ligand atoms). The number of disjointed attributes is within parentheses in Table 1.
We perform two different kinds of experiments — one for each fitness strategy previously detailed. In the first experiment, the metatraining set is comprised of the NADH data set (which is the natural ligand for receptor InhA). The remaining five data sets (metatest set) are then used for assessing the performance of the decisiontree algorithm that was tailored to the NADH data. For each of the five data sets, a 10fold crossvalidation procedure is performed. In the second experiment, HEADDT automatically designs a decisiontree algorithm tailored to each of the six data sets. Thus, each data set is divided in training and test sets, in a 10fold crossvalidation procedure, and then HEADDT designs an algorithm tailored to each of the training folds. We analyze the average performance of HEADDT in the 10folds, considering both test accuracy and tree size (total number of nodes) of the corresponding decision trees.
In order to provide some reassurance about the validity and nonrandomness of the obtained results, we present the results of the corrected resampled ttest statistic, following the approach proposed by Nadeau and Bengio [55]. Considering that the standard ttest has a high TypeI error when used in conjunction with random subsampling, Nadeau and Bengio [55] observe that this is due to an underestimation of the variance because the samples are not independent (i.e., the different training and test sets overlap). Consequently, they propose to correct the variance estimate by taking this dependency into account. Let a_{j} and b_{j}be the accuracy of algorithms A and B respectively, measured on run j (1≤j≤k). Assume that in each run, n_{1} instances are used for training and the remaining n_{2}instances for testing. Let dif_{j}be the difference dif_{j}=a_{j}−b_{j}, and and the estimates of mean and variance of the k differences. The statistic of the corrected resampled ttest is:
Given that we employ a 10fold crossvalidation procedure, k=10 and (n_{2}/n_{1})=(1/9). To reject the null hypothesis of equal performances between the algorithms, the value of t is tested regarding the Studentt distribution, with k−1 degrees of freedom and α is adjusted to (1−α)/2, i.e., t_{k−1,1−α/2}. Considering α=0.95 and 9 degrees of freedom, i.e., t_{9,0.025}=2.26216, the null hypothesis is rejected if t>2.26216.
Experiment 1 — An evolved decisiontree algorithm tailored to the NADH data set
Table 2 shows the classification accuracy and tree size of HEADDT and C4.5. For HEADDT, it is actually presenting the results provided by the decisiontree algorithm tailored to the NADH data set and tested on the remaining five data sets. It illustrates the average accuracy and tree size according to the 10fold crossvalidation procedure. The average of the differences and variance of the differences are also presented.
Table 2. Results of Experiment 1
First, let us consider the accuracy results depicted in Table 2. For computing the t value for each data set, we have:
Regarding the ETH data set, the value of t is 6.25, and since 6.25>2.26, HEADDT significantly outperforms C4.5 in the ETH data set. The same can be said for the PIF (t=6.22), TCL (t=9.12), INH (t=8.28), and JPM (t=7.62) data sets. These results suggest that evolving a decisiontree algorithm tailored to a particular domain — recall that in this experiment, the domain was represented by the NADH data set — is a good idea for generating more accurate decision trees.
We can also verify in Table 2 whether the trees generated by the evolved algorithm are more comprehensible (i.e., smaller) than those generated by C4.5. We can observe that HEADDT is able to generate much smaller trees than C4.5 for each data set. In the ETH data set, HEADDT generates trees that are, on average, 12 times smaller than those generated by C4.5. In PIF, this difference is even greater: HEADDT generates trees that are, on average, 40 times smaller than the C4.5 generated trees. The difference in the TCL data set is also large in favor of HEADDT: trees 15 times smaller, on average. The same behavior is observed in the INH (23 times smaller) and JPM (17 times smaller) data sets. These very large differences are reflected in the statistical test, as follows:
These values indicate that HEADDT clearly outperforms C4.5 with statistical significance regarding tree size. The evolved algorithm that was tailored to NADH and then applied to the remaining three data sets is the following:
1)Recursively split nodes with the Gain Ratio criterion;
2)Create one edge for each category in a nominal split;
3)Perform step 1 until classhomogeneity or the minimum number of 15 instances is reached;
4)Perform EBP pruning with cf = 5%;
5)When dealing with missing values:
5.1)Ignore missing values in split calculation;
5.2)Weight missing values according to the partition probability;
5.3)For classifying an instance with missing values, go to the most probable partition.
Experiment 2 — An evolved decisiontree algorithm for each data set
In this experiment, we make use of HEADDT to evolve a decisiontree algorithm tailored to each ligand data set (one algorithm per data set). Table 3 presents the results of this strategy.
Table 3. Results of Experiment 2
Observe that HEADDT generates more accurate trees than C4.5 for all data sets. The values of t for each data set are: 6.28 (ETH), 6.39 (PIF), 10.85 (TCL), 2.28 (NADH), 4.57 (INH), and 5.90 (JPM) which means that the trees generated by the algorithms evolved by HEADDT outperform those by C4.5 with statistical significance in all data sets, regarding test accuracy. The next step is, once again, to verify whether the trees generated by the evolved algorithms are more comprehensible than those generated by C4.5. We can see in Table 3 that the trees generated by HEADDT are much smaller than those by C4.5. The values of t for ETH, PIF, TCL, NADH, INH, and JPM are, respectively: 23.95, 16.84, 20.99, 12.18, 14.16, and 12.23. Hence, the trees generated by HEADDT are significantly smaller than the trees provided by C4.5 with statistical assurance.
Discussion
We conclude from Experiments 1 and 2 that both strategies for automatically generating decisiontree algorithms are effective for the problem of RDD with flexiblereceptor docking data. Experiment 1 seems to be a better option than Experiment 2, considering that it requires a single execution of HEADDT, instead of multiple runs — one for each data set. After evolving a single algorithm in Experiment 1, the computational cost of applying the evolved algorithm in new data sets is very low: building a tree takes O(m×nlogn) time (m is the number of attributes and n the number of instances), plus the individual complexity of a pruning method. A single execution of HEADDT requires operations such as breeding and fitness evaluation. Breeding takes negligible time, which is a known fact in evolutionary algorithms that deal with strings. Fitness evaluation, on the other hand, is the bottleneck of HEADDT, because each individual has to generate a decision tree for a given data set. We can estimate the time complexity of HEADDT as O(i×g×m×nlogn), where i is the number of individuals and g the number of generations. In practice, however, the number of evaluations is much smaller than i×g, because repeated individuals are not reevaluated. Also, individuals selected by elitism and reproduction are not reevaluated, saving computational time.
Our experiments suggest that not only the resulting trees from the tailored algorithms are more accurate than those generated by C4.5, but also significantly smaller (i.e., more comprehensible). The importance of generating comprehensible predictive models in several application domains has already been established. In bioinformatics, it provides advantages such as [35]: (i) improving the biologist’s confidence in the prediction; (ii) giving the biologist new insight about the data; (iii) giving the biologist ideas for hypotheses creation; and (iv) allowing the detection of errors in the model or in the data. Hence, we believe HEADDT is a good alternative to C4.5 for domains in which comprehensible accurate models are of great importance, such as molecular docking.
Considering the importance of modelcomprehensibility to confirm or reject hypotheses regarding the available data, we briefly discuss some intriguing facts regarding the decision tree generated in Experiment 1 with data from the ETH ligand. The generated tree has a total of 49 nodes and 25 leaves (see Figure 4). We first highlight that only the three levels of FEB that indicate a reasonable conformation of InhA to ETH (REGULAR, GOOD, EXCELLENT) appear in the leaves. More specifically, the decision tree ignored rules that could predict conformations that are irrelevant with respect to the docking experiments.
Regarding our findings within the decision tree model, note that ETH binds to InhA as an adduct (ETHNADH) formed with the NADH coenzyme [56]. The active site of the InhA receptor, composed of the coenzyme and substratebinding cavity, is almost completely filled during InhAETHNADH interaction [13]. Almost all residues found in the ETH decision tree model are directly related or very close to the definition of the receptor active site. For instance, we verify that three of them (PHE41, GLY192, PRO193) are listed as the top25 most significant residues to InhA flexibility, as reported in [53]. PHE41 helps fit the adenine portion of NADH into its binding site while GLY192 and PRO193 are located in the substratebinding loop region. This is a remarkable prediction since these residues do actually make direct contact with the adduct in the crystal structure [13], meaning that they are determinant to the inhibition of InhA by ETH. In addition, of other six residues that appear in the model (VAL12, ILE15, ILE120, ALA124, ASN159, VAL163), one (ILE15) interacts directly with the adduct while the others are structural nearestneighbors to the residues that belong to the top25. Having such information at hand allowed the biologist to improve his/her understanding regarding InhA flexibility and function, as well as its behavior when docking with ETH and other similar ligands.
Looking at the leaves that classify snapshots as EXCELLENT, we noticed that the residues in their paths are quite far from the receptor binding pocket. Even though these residues seem to be irrelevant at first sight, they are actually determining whether the residues that are closer to the receptor can actually provide lower FEB values, and thus better conformation of InhA to ETH. For instance, we obtained the following rules in the decision tree model:
(1)IF (ILE15 > 13.37) AND (ASN159 <= 4.97)
THEN FEB = EXCELLENT
(2)IF (ILE15 > 13.37) AND (ASN159 > 4.97) AND (LEU218 > 2.63) AND (ILE258 <= 10.81) AND (VAL163 > 11.45) AND (GLY263 <= 6.08)
THEN FEB = EXCELLENT.
Rule (1) indicates that an excellent conformation of InhA to ETH imply in the receptor residue Isoleucine15 being far from the binding pocket (in distances greater than 13.37 Å) and, at the same time, the residue Asparagine159 being relatively close to the receptor (in distances lesser than 4.97 Å). Inspection of the InhA  ETHNADH crystal structure [13] confirms this finding.
Conversely, rule (2) indicates that a conformation of InhA to ETH may also be excellent if residue Asparagine159 is far from the receptor (a distance greater than 4.97 Å), though assuming that the remaining residues presented in rule (2) meet their distance requirements. Figure 5 shows the residueattribute mapping for the ETH ligand. In general the agreement between predicted and experimental binding modes is very satisfactory, as illustrated by the mapping of rule (2) to the experimental structure (Figure 5B). Only two residues, ILE258 and GLY263, did not satisfy the rule. These exceptions can be explained by the fact that the rules are based on a fullyflexible (FFR) model of the InhA receptor. We believe that structural differences between snapshots in the FFR model create a space in the receptor’s binding cavities, which differs from the one we see in the rigid, crystal structure. The FFR model of InhA can therefore accommodate a more diverse range of ligand conformations and orientations.
Figure 5. Residueattribute mapping for the ETH decision tree. Figure 5 is composed as follows: (A) The main chain of the FFRInhA receptor is represented by grey ribbons. The amino acids residues identified by the decision tree are represented as spacefilled atoms. For clarity, only the residues classified as Excellent (green) and Good (orange) are shown. (B) The main chain of the experimental structure (PDB ID: 2HI9) is shown in transparent grey ribbons together with the mapped residues of rule (2) (Excellent) of the decision tree and their distances (in Å) to the ligand. The ETH ligand is represented by stick models (carbon: grey; nitrogen: blue; oxygen: red; sulphur: yellow).
All the information presented so far is exactly what we expect from a comprehensible model generated from the flexiblereceptor molecular docking data: helping to significantly reduce the number of snapshots in future docking experiments. Hence the importance of providing comprehensible models, like those produced by decision trees and decision rules. At the same time, it must be pointed out that very large decision trees (such as those induced by C4.5) are not of easy interpretation, since the specialist would need several hours to visually inspect interesting rules in a 500node decision tree. Whenever two decision tree models of similar performance are compared, one should prefer the smaller one, as stated by the Occam’s Razor principle. Recall that HEADDT evolved algorithms that were able to induce significantly more accurate and comprehensible decision trees than C4.5.
Conclusions
In this work, we proposed a hyperheuristic algorithm capable of automatically designing decisiontree induction algorithms tailored to specific domains, namely HEADDT. We investigated its efficiency in the prediction of the free energy of binding of a drug candidate with enzyme InhA associated with Mycobacterium tuberculosis. More specifically, we performed several experiments with flexiblereceptor docking data for rational drug design, a bioinformatics application of known importance.
We compared the algorithms automatically designed by HEADDT to traditional stateoftheart decisiontree induction algorithm C4.5 [20]. We assessed the performance of HEADDT through two different measures: accuracy and tree size. The experimental analysis suggested that HEADDT can generate algorithms which perform significantly better than C4.5. These algorithms can also induce significantly smaller trees, regarding the domain of flexiblereceptor docking data. The resulting decision trees were analyzed by a biologist, who was able to extract several interesting rules for helping future docking experiments. For instance, the biologist verified that three residues selected by the decision tree model (PHE41, GLY192, PRO193) are determinant to the inhibition of InhA by ETH.
We believe that these results indicate that HEADDT can be an effective algorithm for domainspecific applications of decision trees, specially bioinformatics. As future work, we plan to investigate whether a more sophisticated search system, such as grammarbased genetic programming, can outperform our current HEADDT implementation. We intend to employ HEADDT in further experiments of flexiblereceptor molecular docking data, for helping in the discovery of new drug candidates to Mycobacterium tuberculosis. In addition, considering the good performance of HEADDT over flexiblereceptor docking data, we intend to test it in other relevant bioinformatics problems.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RB implemented the method and wrote the manuscript. RB, MB, and AC designed the experiments, evaluated the experimental results, and assessed their statistical significance. AW, KM, DR, and ONS generated and preprocessed the biological data, and also interpreted the resulting decision trees. ONS interpreted and validated the biological findings. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), for funding this research.
References

Lyne PD: Structurebased virtual screening: an overview.
Drug Discov Today 2002, 7:10471055. PubMed Abstract  Publisher Full Text

Adams C, Brantner V: Spending on new drug development.
Health Econ 2010, 19:130141. PubMed Abstract  Publisher Full Text

Huang SY, Zou X: Ensemble docking of multiple protein structures: Considering protein structural variations in molecular docking.
Proteins 2006, 66:399421. Publisher Full Text

Verkhivker GM, Bouzida D, Gehlhaar DK, Rejto PA, Freer ST, Rose PW: Computational detection of the bindingsite hot spot at the remodeled human growth hormonereceptor interface.
Proteins 2003, 53(2):201219. PubMed Abstract  Publisher Full Text

Lin JH, Perryman AL, Schames JR, McCammon JA: The relaxed complex method: Accommodating receptor flexibility for drug design with an improved scoring scheme.
Biopolymers 2003, 68:4762. PubMed Abstract  Publisher Full Text

Machado KS, Winck AT, Ruiz DD, Norberto de Souza O: Mining flexiblereceptor docking experiments to select promising protein receptor snapshots.
BMC Genomics 2010, 11(5):113. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Machado KS, Winck AT, Ruiz DD, Norberto de Souza O: Mining flexiblereceptor molecular docking data.
WIREs Data Mining Knowl Discov 2011, 1(6):532541. Publisher Full Text

Dessen A, Quemard A, Blanchard J, Jacobs W, Sacchettini J: Crystal Structure and Function of the Isoniazid Target of Mycobacterium tuberculosis.
Science 1995, 267:16381641. PubMed Abstract  Publisher Full Text

Schroeder E, Basso L, Santos D, Norberto de Souza O: Molecular Dynamics Simulation Studies of the WildType, I21V, and I16T Mutants of IsoniazidResistant Mycobacterium tuberculosis Enoyl Reductase (InhA) in Complex with NADH: Toward the Understanding of NADHInhA Different Affinities.
Biophys J 2005, 89:876884. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Machado KS, Schroeder EK, Ruiz DD, Norberto de Souza O: Automating molecular docking with explicit receptor flexibility using scientific workflows.

Kuo M, Morbidoni H, Alland D, Sneddon S, Gourlie B, Staveski M, Leonard M, Gregory J, Janjigian A, Yee C, Musser J, Kreiswirth B, Iwamoto H, Perozzo R, Jacobs W, Sacchettini J, Fodock D: Targeting tuberculosis and malaria through inhibition of Enoyl Reductase: compound activity and structural data.
J Biol Chem 2003, 278(23):2085120859. PubMed Abstract  Publisher Full Text

Oliveira JS, Sousa EHS, Basso LA, Palaci M, Dietze R, Santos DS, Moreira I: An inorganic iron complex that inhibits wildtype and an Isoniazidresistant Mutant 2transenoylACP (CoA) Reductase from Mycobacterium tuberculosis.

Wang F, Langley R, Gulten G, Dover L, Besra G, Jacobs WJ, Sacchettini J: Mechanism of thioamide drug action against tuberculosis and leprosy.
J Exp Med 2007, 204:7378. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Middlebrook G: Sterilization of tubercle bacilli by isonicotinic acid hydrazide and the incidence of variants resistant to the drug in vitro.
Am Rev Tuberc 1952, 65:765767. PubMed Abstract

Freundlich J, Wang F, Vilcheze C, Gulten G, Langley R, Schiehser G, Jacobus D, Jacobs WJ, Sacchettini J: Triclosan derivatives: towards potent inhibitors of drugsensitive and drugresistant Mycobacterium tuberculosis.
Chem Med Chem 2009, 4(2):241248. PubMed Abstract  Publisher Full Text

Andres C, Hutter M: CNS Permeability of drugs predicted by a Decision Tree.
QSAR Comb Sci 2006, 25(4):305309. Publisher Full Text

Lee S, Yang J, Oh KW: Prediction of molecular bioactivity for drug design using a decision tree algorithm.
Discovery Science ’03 2003, 344351. PubMed Abstract

Han L, Wang Y, Bryant S: Developing and validating predictive decision tree models from mining chemical structural fingerprints and highthroughput screening data in PubChem.
BMC Bioinformatics 2008, 9:401. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Blower PE, Cross KP: Decision tree methods in pharmaceutical research.
Curr Top Med Chem 2006, 6:3139. PubMed Abstract  Publisher Full Text

Quinlan JR: C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann; 1993.

Barros RC, Basgalupp MP, de Carvalho AC, Freitas AA: A hyperheuristic evolutionary algorithm for automatically designing decisiontree algorithms. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference GECCO ’12. New York: ACM; 2012:12371244.

Murthy SK: Automatic construction of decision trees from data: a multidisciplinary survey.
Data Min Knowl Disc 1998, 2(4):345389. Publisher Full Text

Tan PN, Steinbach M, Kumar V: Introduction to Data Mining. Boston: AddisonWesley; 2005.

Barros RC, Cerri R, Jaskowiak PA, de Carvalho ACPLF: A bottomup oblique decision tree induction algorithm.
11th International Conference on Intelligent Systems Design and Applications 2011, 450456.

Barros RC, Basgalupp MP, de Carvalho ACPLF, Freitas AA: A survey of evolutionary algorithms for decisiontree induction.

Barros RC, Ruiz DD, Basgalupp MP: Evolutionary model trees for handling continuous classes in machine learning.
Inf Sci 2011, 181:954971. Publisher Full Text

Barros RC, Basgalupp MP, Ruiz DD, de Carvalho ACPLF, Freitas AA: Evolutionary model tree induction.

Basgalupp MP, Barros RC, de Carvalho ACPLF, Freitas AA, Ruiz DD: LEGALTree: a lexicographic multiobjective genetic algorithm for de.

Basgalupp MP, de Carvalho ACPLF, Barros RC, Ruiz DD, Freitas AA: Lexicographic multiobjective evolutionary induction of decision trees.
Int J Bioinspired Comput 2009, 1(1/2):105117. Publisher Full Text

Mach Learn 2001, 45:532. Publisher Full Text

Jensen L, Gupta R, Staerfeldt HH, Brunak S: Prediction of human protein function according to gene ontology categories.
Bioinformatics 2003, 19(5):635642. PubMed Abstract  Publisher Full Text

Vinayagama A, Konig R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S: Applying support vector machines for gene ontology based gene function prediction.
BMC Bioinformatics 2004, 5:116. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Weinert WR, Lopes H: Neural networks for protein classification.
Appl Bioinformatics 2004, 3:4148. PubMed Abstract  Publisher Full Text

Freitas AA, Wieser DC, Apweiler R: On the importance of comprehensible classification models for protein function prediction.

Barros RC, Basgalupp MP, de Carvalho ACPLF, Freitas AA: Towards the automatic design of decision tree induction algorithms. In Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary computation (GECCO 2011), GECCO ’11. New York: ACM; 2011:567574.

Burke EK, Hyde MR, Kendall G, Ochoa G, Ozcan E, Woodward JR: Exploring hyperheuristic methodologies with genetic programming. In Colaborative Computational Intelligence. Berlin / Heidelberg: Springer; 2009:177201.

Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Wadsworth: Monterey; 1984.

Gleser M, Collen M: Towards automated medical decisions.
Comput Biomed Res 1972, 5(2):180189. PubMed Abstract  Publisher Full Text

Mingers J: Expert systems  rule induction with statistical data.

De Mántaras RL: A distancebased attribute selection measure for decision tree induction.
Mach Learn 1991, 6:8192. Publisher Full Text

Martin J: An exact probability metric for decision tree splitting and stopping.
Mach Learn 1997, 28(2):257291. Publisher Full Text

Chandra B, Varghese PP: Moving towards efficient decision tree construction.
Inf Sci 2009, 179(8):10591069. Publisher Full Text

Chandra B, Kothari R, Paul P: A new node splitting measure for decision tree construction.
Pattern Recogn 2010, 43(8):27252731. Publisher Full Text

Mingers J: An empirical comparison of selection measures for decisiontree induction.

Taylor PC, Silverman BW: Block diagrams and splitting criteria for classification trees.
Stat Comput 1993, 3:147161. Publisher Full Text

Jun B, Kim C, Song YY, Kim J: A new criterion in selection and discretization of attributes for the generation of decision trees.
IEEE T Pattern Anal 1997, 19(2):13711375. Publisher Full Text

Fayyad U, Irani K: The attribute selection problem in decision tree generation.
National Conference on Artificial Intelligence 1992, 104110.

Ching J, Wong A, Chan K: Classdependent discretization for inductive learning from continuous and mixedmode data.
IEEE T Pattern Anal 1995, 17(7):641651. Publisher Full Text

Quinlan JR: Simplifying decision trees.
Int J Man Mach Stud 1987, 27:221234. Publisher Full Text

Cestnik B, Bratko I: On estimating probabilities in tree pruning. In European Working Session on Learning. Berlin / Heidelberg: Springer; 1991:138150.

Morris GM, Goodsell DS, Halliday R, Huey R, Hart W, Belew RK, Olson AJ: Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function.
J Comput Chem 1998, 19(14):16391662. Publisher Full Text

Winck A, Machado K, Norberto de Souza O, Ruiz DD: Supporting intermolecular interaction analyses of flexiblereceptor docking simulations.
International Conference on Applied Computing 2010, 183190.

Machado K, Winck A, Ruiz DD, Norberto de Souza O: Comparison of discretization methods of flexiblereceptor docking data for analyses by decision trees.
International Conference on Applied Computing 2010, 223229.

Nadeau C, Bengio Y: Inference for the generalization error.
Mach Learn 2003, 52(3):239281. Publisher Full Text

Schroeder E, Norberto de Souza O, Santos D, Blanchard J, Basso L: Drugs that inhibit mycolic acid biosynthesis in Mycobacterium tuberculosis.
Curr Pharm Biotechnol 2002, 3(3):197225. PubMed Abstract  Publisher Full Text