Abstract
Background
Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on highdimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization.
Results
On 24 biomedical datasets obtained from highthroughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.
Conclusions
On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to highdimensional data.
Background
With the advent of highthroughput techniques, such as DNA microarrays and mass spectrometry, transcriptomic and proteomic studies are generating an abundance of highdimensional biomedical data. The analysis of such data presents significant analytical and computational challenges, and increasingly data mining techniques are being applied to these data with promising results [14]. A typical task in such analysis, for example, entails the learning of a mathematical model from gene expression or protein expression data that predicts well a phenotype, such as disease or health. In data mining, such a task is called classification and the model that is learned is termed a classifier. The variable that is predicted is called the target variable (or simply the target), which in statistical terminology is referred to as the response or the dependent variable. The features used in the prediction are called the predictor variables (or simply the predictors), which are referred to as the covariates or the independent variables in statistical terminology.
A large number of data mining methods have been developed for classification; several of these methods are unable to use continuous data and require discrete data [13]. For example, most rule learning methods that induce sets of IFTHEN rules and several of the popular methods that learn Bayesian networks require data that are discrete. Some methods that accept continuous data, as for example methods that learn classification trees, discretize the data internally during learning. Other methods, such as the naïve Bayes classifier, that accept both continuous and discrete data, may perform better with discrete data [3,4]. A variety of discretization methods have been developed for converting continuous data to discrete data [511], and one that is commonly used is Fayyad and Irani's (FI) discretization method [9].
In this paper, we present an efficient Bayesian discretization method and evaluate its performance on several highdimensional transcriptomic and proteomic datasets, and we compare its performance to that of the FI discretization method. The remainder of this paper is structured as follows. The next section provides some background on discretization and briefly reviews the FI discretization method. The results section describes the efficient Bayesian discretization (EBD) method and gives the results of an evaluation of EBD and FI on biomedical transcriptomic and proteomic datasets. The final section discusses the results and draws conclusions.
Discretization
Numerical variables may be continuous or discrete. A continuous variable is one which takes an infinite number of possible values within a range or an interval. A discrete variable is one which takes a countable number of distinct values. A discrete variable may take few values or a large number of values. Discretization is a process that transforms a variable, either discrete or continuous, such that it takes a fewer number of values by creating a set of contiguous intervals (or equivalently a set of cut points) that spans the range of the variable's values. The set of intervals or the set of cut points produced by a discretization method is called a discretization.
Discretization has several advantages. It broadens the range of classification algorithms that can be applied to datasets since some algorithms cannot handle continuous attributes. In addition to being a necessary preprocessing step for classification methods that require discrete data, discretization has been shown to increase the accuracy of some classifiers, increase the speed of classification methods especially on highdimensional data, and provide better human interpretability of models such as IFTHEN rule sets [8,10,11]. The impact of discretization on the performance of classifiers is not only due to the conversion of continuous values to discrete ones, but also due to filtering of the predictor variables [4]. Variables that are discretized to a single interval are effectively filtered out and discarded by classification methods since they are not predictive of the target variable. Due to redundancy and noise in the predictor variables in highdimensional transcriptomic and proteomic data, such filtering of variables has the potential to improve classification performance. Even classification methods like Support Vector Machines and Random Forests that handle continuous variables directly and are robust to high dimensionality of the data may benefit from discretization [4]. The main disadvantage of discretization is the loss of information entailed in the process that has the potential to reduce performance of classifiers if the information loss is relevant for classification. However, this theoretical concern may or may not be a practical one, depending on the particular machinelearning situation.
Discretization methods can be classified as unsupervised or supervised. Unsupervised methods do not use any information about the target variable in the discretization process while supervised methods do. Examples of unsupervised methods include the EqualWidth method, which partitions the range of variable's values into a userspecified number of intervals and the EqualFrequency method, which partitions the range of variable's values into a userspecified fraction of instances per interval. Compared to unsupervised methods, supervised methods tend to be more sophisticated and typically yield classifiers that have superior performance [8,10,11]. Most supervised discretization methods consist of a score to measure the goodness of a set of intervals (where goodness is a measure of how well the discretized predictor variable predicts the target variable), and a search method to locate a goodscoring set of intervals in the space of possible discretizations. The commonly used FI method is an example of a supervised method.
A second way to categorize discretization methods is as univariate versus multivariate methods. Univariate methods discretize a continuousvalued variable independently of all other predictor variables in the data, while multivariate methods take into consideration the possible interactions of the variable being discretized with the other predictor variables. Multivariate methods are rarely used in practice since they are computationally more expensive than univariate methods and have been developed for specialized applications [12,13]. The FI discretization method is a typical example of a univariate method.
We now introduce terminology that will be useful for describing discretization. Let D be a dataset of n instances consisting of the list ((X_{1}, Z_{1}), (X_{2}, Z_{2}), ..., (X_{k}, Z_{k}), ..., (X_{n}, Z_{n})) that is sorted in ascending order of X_{k}, where X_{k }is a real value of the predictor variable X and Z_{k }is the associated integer value of the target variable Z. For example, suppose that the predictor variable represents the expression level of a gene that takes real values in the range 0 to 5.0 and the target variable represents the phenotype that takes the values: healthy or diseased (Z = 0 or Z = 1, respectively). Then, an example dataset D is ((1.2, 0), (1.4, 0), (1.6, 0), (3.7, 1), (3.9, 1), (4.1, 1)). Let S_{a, b }be a list of the first elements of D, starting at the a^{th }pair in D and ending at the b^{th }pair. Thus, for the above example, S_{4, 6 }= (3.7, 3.9, 4.1). For brevity, we denote by S the list S_{1, n}. Let T_{b }be a set that represents a discretization of S_{1, b}. For the above example of D, a possible 2interval discretization is T_{6 }= {S_{1, 3}, S_{4, 6}} = {(1.2, 1.4, 1.6), (3.7, 3.9, 4.1)}. Equivalently, this 2interval discretization denotes a cut point between 1.6 and 3.7, and typically the midpoint is chosen, which is 2.65 in this example. Thus, all values below 2.65 are considered as a single discrete value and all values equal or greater than 2.65 are considered another discrete value. For brevity, we denote by T a discretization T_{n }of S.
Fayyad and Irani's (FI) Discretization Method
Fayyad and Irani's discretization method is a univariate supervised method that is widely used and has been cited over 2000 times according to Google Scholar^{1}. The FI method consists of i) a score that is the entropy of the target variable induced by the discretization of the predictor variable, and ii) a greedy search method that recursively discretizes each partition at a cutpoint that minimizes the joint entropy of the two resulting subintervals until a stopping criterion based on the minimum description length (MDL) is met.
For a list S_{a, b }derived from a predictor variable X and a target variable Z that takes J values, the entropy Ent(S_{a, b}) is defined as:
where, P(Z = z_{j}) is the proportion of instances in S_{a, b }where the target takes the value z_{j}. The entropy of Z can be interpreted as a measure of its uncertainty or disorder. Let a cutpoint C split the list S_{a, b }into the lists S_{a, c }and S_{c + 1, b }to create a 2interval discretization {S_{a, c}, S_{c + 1, b}}. The entropy Ent(C; S_{a, b}) induced by C is given by:
where, S_{a, b} is the number of instances in S_{a, b}, S_{a, c} is the number of instances in S_{a, c}, and S_{c + 1, b} is the number of instances in S_{c + 1, b}. The FI method selects the cut point C from all possible cut points that minimizes Ent(C; S_{a, b}) and then recursively selects a cut point in each of the newly created intervals in a similar fashion. As partitioning always decreases the entropy of the resulting discretization, the process of introducing cut points is terminated by a MDLbased stopping criterion. Intuitively, minimizing the entropy results in intervals where each interval has a preponderance of one value for the target.
Overall, the FI method is very efficient and runs in O(n log n) time, where n is the number of instances in the dataset. However, since it uses a greedy search method, it does not examine all possible discretizations and hence is not guaranteed to discover the optimal discretization, that is, the discretization with the minimum entropy.
Minimum Optimal Description Length (MODL) Discretization Method
To our knowledge, the closest prior work to the EBD algorithm, which is introduced in this paper, is the MODL algorithm that was developed by Boulle [5]. MODL is a univariate, supervised, discretization algorithm. Both MODL and EBD use dynamic programming to search over discretization models that are scored using a Bayesian measure. EBD differs from MODL in two important ways. First, MODL assumes uniform prior probabilities over the discretization, whereas EBD allows an informative specification of both structure and parameter priors, as discussed in the next section. Thus, although EBD can be used with uniform prior probabilities as a special case, it is not required to do so. If we have background knowledge or beliefs that may influence the discretization process, EBD provides a way to incorporate them into the discretization process.
Second, the MODL optimal discretization algorithm has a run time that is O(n^{3}), whereas the EBD optimal discretization algorithm has a run time of O(n^{2}), where n is the number of instances in the dataset. In essence, EBD uses a more efficient form of dynamic programming, than does MODL. Their difference in computational time complexity can have significant practical consequences in terms of which datasets are feasible to use. A dataset with, for example, 10,000 instances might be practical to use in performing discretization using EBD, but not using MODL.
While heuristic versions of MODL have been described [5], which give up optimality guarantees in order to improve computational efficiency, and heuristic versions of EBD could be developed that further decrease its time complexity as well, the focus of the current paper is on optimal discretization.
In the next section, we introduce the EBD algorithm and then describe an evaluation of it on a set of bioinformatics datasets.
Results
An Efficient Bayesian Discretization Method
We now introduce a new supervised univariate discretization method called efficient Bayesian discretization (EBD). EBD consists of i) a Bayesian score to evaluate discretizations, and ii) a dynamic programming search method to locate the optimal discretization in the space of possible discretizations. The dynamic programming method examines all possible discretizations and hence is guaranteed to discover the optimal discretization, that is, the discretization with the highest Bayesian score.
Bayesian Score
We first describe a discretization model and define its parameters. As before, let X and Z denote the predictor and target variables, respectively, let D be a dataset of n instances consisting of the list ((X_{1}, Z_{1}), (X_{2}, Z_{2}), ..., (X_{k}, Z_{k}), ..., (X_{n}, Z_{n})), as described above, and let S denote a list of the first elements of D. A discretization model M is defined as:
where, W is the number of intervals in the discretization, T is a discretization of S, and Θ is defined as follows. For a specified interval i, the distribution of the target variable P(Z  W = i) is modeled as a multinomial distribution with the parameters {θ_{i 1},θ_{i2},...,θ_{ij},...,θ_{iJ}} where j indexes the distinct values of Z. Considering all the intervals, Θ = {θ_{ij}} over 1 ≤ i ≤ I and 1 ≤ j ≤ J and Θ specifies all the multinomial distributions for all the intervals in M. Given data D, EBD computes a Bayesian score for all possible discretizations of S and selects the one with the highest score.
We now derive the Bayesian score used by EBD to evaluate a discretization model M. The posterior probability P(M  D) of M is given by Bayes rule as follows:
where P(M) is the prior probability of M, P(D  M) is the marginal likelihood of the data D given M, and P(D) is the probability of the data. Since P(D) is the same for all discretizations, the Bayesian score evaluates only the numerator on the right hand side of Equation 3 as follows:
The marginal likelihood P(D  M) in Equation 4 is derived using the following equation:
where Θ are the parameters of the multinomial distributions as defined above. Equation 5 has a closedform solution under the following assumptions: (1) the values of the target variable were generated according to i.i.d. sampling from P(Z  W = i), which is modeled with a multinomial distribution, (2) the distribution P(Z  W = i) is modeled as being independent of the distribution P(Z  W = h) for all values of i and h such that i ≠ h, (3) for all values i, prior belief about the distribution P(Z  W = i) is modeled with a Dirichlet distribution with hyperparameters α_{ij}, and (4) there are no missing data. The closedform solution to the marginal likelihood is given by the following expression [14,15]:
where Γ(·) is the gamma function, n_{i }is the number of instances in the interval i, n_{ij }is the number of instances in the interval W_{i }that have targetvalue j, α_{ij }are the hyperparameters in a Dirichlet distribution which define the prior probability over the θ_{ij }parameters, and . The hyperparameters can be viewed as prior counts, as for example from a previous (or a hypothetical) dataset of instances in the interval i that belong to the value j. For the experiments described in this paper, we set all the α_{ij }to 1, which can be shown to imply that a priori we assume all possible distributions of P(Z  W = i) to be equally likely, for each interval i.^{2 }If all α_{ij }= 1, then all α_{i }= J. With these values for the hyperparameters, and using the fact that Γ(n) = (n1)!, Equation 6 becomes the following:
The term P(M) in Equation 4 specifies the prior probability on the number of intervals and the location of the cut points in the discretization model M; we call these the structure priors. The structure priors may be chosen to penalize complex discretization models with many intervals to prevent overfitting. In addition to the structure priors, the marginal likelihood P(D  M) includes a specification of the prior probabilities on the multinomial distribution of the target variable in each interval; we call these the parameter priors. In Equation 6, the alphas specify the parameter priors.
The prior probability P(M) is modeled as follows. Let X_{k }denote a real value of the predictor variable, as described above, and Z_{k }denote the associated integer value of the target variable. Let Prior(k) be the prior probability of there being at least one cut point between X_{k }and X_{k + 1}. In the Methods section, we describe the use of a Poisson distribution with mean λ to implement Prior(k), where λ is a structure prior parameter. Consider the prior probability for an interval i that represents the sequence in a discretization model M. In general, we assume that the prior probability for interval i is independent of the prior probabilities for the other intervals in M. The prior probability for interval i in terms of the Prior function is defined as follows:
Expression 8 gives the prior probability that no cut points are present between any consecutive pairs of values of X in the sequence and at least one cut point is present between the values and . Using the above notation and assumptions, and substituting Equations 7 and 8 into Equation 4, we obtain the specialized EBD score:
The above score assumes that the n values of X in the dataset D are all distinct. However, the implementation described below easily relaxes that assumption.
Dynamic Programming Search
The EBD method finds the discretization that maximizes the score given in Equation 9 using dynamic programming to search the space of possible discretizations. The pseudocode for the EBD search method is given in Figure 1. It is globally optimal in that it is guaranteed to find the discretization with the highest score. Additional details about the search method used by EBD and its time complexity are provided in the Methods section.
Figure 1. Pseudocode for the efficient Bayesian discretization (EBD) method. The EBD method uses dynamic programming and runs in O(n^{2}) time as indicated by the two for loops (n is the number of instances in the dataset).
The number of possible discretizations for a predictor variable X in a dataset with n instances is 2^{n1}, and this number is typically too large for each discretization to be evaluated in a brute force manner. The EBD method addresses this problem by the use of dynamic programming that at every stage uses previously computed optimal solutions to subproblems. The use of dynamic programming reduces considerably the number of possible discretizations that have to be evaluated explicitly without sacrificing the ability to identify the optimal discretization.
An example of the application of the EBD method on the example dataset D = ((1.2, 0), (1.4, 0), (1.6, 0), (3.7, 1), (3.9, 1), (4.1, 1)) is given in Figure 2. Although there are 2^{5 }= 32 possible discretizations for a dataset of six instances, as in this example, EBD explicitly evaluates only 6 of them in determining the highest scoring discretization.
Figure 2. An example of the application of the efficient Bayesian discretization (EBD) method. This example shows the progression of the EBD method when applying the pseudocode given in Figure 1 to the dataset of six instances that is introduced in the main text. An asterisk denotes the discretization with the highest EBD score in a given iteration, as indexed by a. There are 2^{5 }= 32 possible discretizations for a dataset of six instances; for this dataset EBD explicitly evaluates only the 6 discretizations shown in bold font.
As described in the Methods section, the EBD algorithm runs in O(n^{2}) time, where n is the number of instances of a predictor X. Although EBD is slower than FI, it is still feasible to apply EBD to highdimensional data with a large number of variables.
Evaluation of the Efficient Bayesian Discretization (EBD) Method
We evaluated the EBD method and compared its performance to the FI method on 24 biomedical datasets (see Table 1) using five measures: accuracy, area under the Receiver Operating Characteristic curve (AUC), robustness, stability, and the mean number of intervals per variable (a measure of model complexity). The last three measures evaluate the discretized predictors directly while the first two measures evaluate the performance of classifiers that are learned from the discretized predictors. We performed this comparison using the FI method, because it is so commonly used (1) in practice and (2) as a standard algorithmic benchmark for discretization methods.
Table 1. Description of datasets
For computing the evaluation measures we performed 10 × 10 crossvalidation (10fold crossvalidation done ten times to generate a total of 100 training and test folds). For a pair of training and test folds, we learned a discretization model for each variable (using either FI or EBD) for the training fold only and applied the intervals from the model to both the training and test folds to generate the discretized variables. For the experiments, we set λ, which is user specified parameter introduced in Figure 1 and in Equation 10 (see the Methods section) to be 0.5. The parameter λ is the expected number of cut points in the discretization of the variables in the domain. Our previous experience with discretizing some of the datasets used in the experiments with FI indicated that the majority of the variables in these datasets have 1 or 2 intervals (that correspond to 0 or 1 cut points). We chose λ to be 0.5 as the average of 0 and 1 cut points.
We used two classifiers in our experiments, namely, C4.5 and naïve Bayes (NB). C4.5 is a popular tree classifier that accepts both continuous and discrete predictors and has the advantage that the classifier can be interpreted as a set of rules. The NB classifier is simple, efficient, robust, and accepts both continuous and discrete predictors. It assumes that the predictors are conditionally independent of each other given the target value. Given an instance, it applies Bayes theorem to compute the probability distribution over the target values. This classifier is very effective when the independence assumptions hold in the domain; however, even if these assumptions are violated, the classification performance is often excellent, even when compared to more sophisticated classifiers [16].
Accuracy is a widely used measure of predictive performance (see the Methods section). The mean accuracies for EBD and FI for C4.5 and NB are given in Table 2. EBD has higher mean accuracy on 17 datasets for each of C4.5 and NB, respectively. FI has higher mean accuracy on 4 datasets and 3 datasets for C4.5 and NB, respectively. EBD and FI have the same mean accuracy on 4 datasets and 3 datasets for C4.5 and NB, respectively. Overall, EBD shows an increase in accuracy of 2.02% and 0.76% for C4.5 and NB, respectively. This increased performance is statistically significant at the 5% significance level on the Wilcoxon signed rank test for both C4.5 and NB.
Table 2. Accuracies for the EBD and FI discretization methods
The AUC is a measure of the discriminative performance of a classifier that accounts for datasets that have a highly skewed distribution over the target variable (see the Methods section). The mean AUCs for EBD and FI for C4.5 and NB are given in Table 3. For C4.5, EBD has higher mean AUC on 17 datasets, FI has higher mean AUC on 5 datasets, and both discretization methods have the same mean AUC on 2 datasets. For NB, EBD has higher mean AUC than FI on 16 datasets, lower mean AUC on 6 datasets, and the same mean AUC on two datasets. Overall, EBD shows an improvement in AUC of 1.07% and 1.12% for C4.5 and NB, respectively, and both increases in AUC are statistically significant at the 5% level on the Wilcoxon signed rank test.
Table 3. AUCs for the EBD and FI discretization methods
Robustness is the ratio of the accuracy on the test dataset to that on the training dataset expressed as a percentage (see the Methods section). The mean robustness for EBD and FI for C4.5 and NB are given in Table 4. For C4.5, EBD has higher mean robustness on 10 datasets, FI has higher mean robustness on 11 datasets, and both have equivalent mean robustness on three datasets. For NB, EBD has better performance than FI on 9 datasets, worse performance on 13 datasets, and similar performance on two datasets. Overall, EBD shows a small decrease in mean robustness of 0.26% and 0.68% for C4.5 and NB, respectively, that are not statistically significant at the 5% level on the Wilcoxon signed rank test.
Table 4. Robustness for the EBD and FI discretization methods
Stability quantifies how different training datasets affect the variables being selected (see the Methods section). The mean stabilities for EBD and FI are given in Table 5. Overall, EBD has higher stability than FI, but only at an overall average of 0.02, which nevertheless is statistically significant at the 5% significance level on the Wilcoxon signed rank test.
Table 5. Stabilities for the EBD and FI discretization methods
Table 6 gives the mean number of intervals obtained by EBD and FI. The first column gives for each dataset the proportion of predictor variables that were discretized into a single interval, that is, there were no cut points. Such predictors are considered uninformative and are not used for learning a classifier. The second column gives for each dataset the mean number of intervals among those predictors that were discretized to more than one interval. The third column reports the mean number of intervals over all predictors, including intervals that contain no cut points. Overall, the application of EBD resulted in more predictors with more than one interval, relative to the application of FI, by an overall average of 9%. Also, the mean number of intervals per predictor was greater for EBD than for FI, but this difference was not statistically significant at the 5% level on the Wilcoxon signed rank test. This implies that while the average for the EBD complexity is slightly greater (1.27 versus 1.16 intervals per predictor), overall, EBD and FI are similar in terms of complexity of the discretizations produced.
Table 6. Mean number of intervals per predictor variable for the EBD and FI discretization methods
The results of the statistical comparison of the EBD and FI discretization methods using the Wilcoxon paired samples signed rank test are given in Table 7. As shown in the table, the accuracy and AUC of C4.5 and NB classifiers were statistically significantly better at the 5% level when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.
Table 7. Statistical comparison of EBD and FI discretization methods
Running Times
We conducted the experiments on an AMD X2 4400 + 2.2 GHz personal computer with 2GB of RAM that was running Windows XP. For the 24 datasets included in our study, on average to discretize all the predictor variables in a dataset, EBD took 20 seconds per training fold while FI took 5 seconds per training fold.
Discussion
We have developed an efficient Bayesian discretization method that uses a Bayesian score to evaluate a discretization and employs dynamic programming to efficiently search and identify the optimal discretization. We evaluated the performance of EBD on several measures and compared it to the performance of FI. Table 8 shows the number of wins, draws and losses when comparing EBD to FI on accuracy, AUC, stability and robustness. On both accuracy and AUC, which are measures of discrimination performance, EBD demonstrated statistically significant improvement over FI. EBD was more stable than FI, which indicates that EBD is less sensitive to the variability of the training datasets. FI was moderately better in terms of robustness, but not statistically significantly so. On average, EBD produced slightly more intervals per predictor variable, as well as a greater proportion of predictors that had more than one interval. Thus, EBD produced slightly more complex discretizations than FI.
Table 8. Summary of wins, draws and losses of EBD versus FI
A distinctive feature of EBD is that it allows the specification of parameter and structure priors. Although we used noninformative parameter priors in the evaluation reported here, EBD readily supports the use of informative prior probabilities, which enables users to specify background knowledge that can influence how a predictor variable is discretized. The alpha parameters in Equation 6 are the parameter priors. Suppose there are two similar biomedical datasets A and B containing the same variables, but different populations of individuals, and we are interested in discretizing the variables. The data in A could provide information for defining the parameter priors in Equation 6 before its application to the data in B. There is a significant amount of flexibility in defining this mapping for using data in a similar (but not identical) biomedical dataset to influence the discretization of another dataset. The lambda parameter in Equation 10 (described in the Methods section) allows the user to provide a structure prior. This is where prior knowledge might be particularly helpful by specifying (probabilistically) the expected number of cut points per predictor variable. Although we have presented a structure prior that is based on a Poisson distribution, the EBD algorithm can be readily adapted to use other distributions. In doing so, the main assumption is that a structure prior of an interval can be composed as a product of the structure priors of its subintervals.
The running times show that although EBD runs slower than FI, it is sufficiently fast to be applicable to realworld, highdimensional datasets. Overall, our results indicate that EBD is easy to implement and is sufficiently fast to be practical. Thus, we believe EBD is an effective discretization method that can be useful when applied to highdimensional biomedical data.
We note that EBD and FI differ in both in the score used for evaluating candidate discretizations and in the search method employed. As a result, the differences in performance of the two methods may be due to the score, the search method, or a combination of the two. A version of FI could be developed that uses dynamic programming to minimize its cost function, namely entropy, in a manner directly parallel to the EBD algorithm that we introduce in this paper. Such a comparison, however, is beyond the scope of the current paper. Moreover, since the FI method was developed and is implemented widely using greedy search, we compared EBD to it rather than to a modified version of FI using dynamic programming search. It would be interesting in future research to evaluate the performance of a dynamic programming version of FI.
Conclusions
Highdimensional biomedical data obtained from transcriptomic and proteomic studies are often preprocessed for analysis that may include the discretization of continuous variables. Although discretization of continuous variables may result in loss of information, discretization offers several advantages. It broadens the range of data mining methods that can be applied, can reduce the time taken for the data mining methods to run, and can improve the predictive performance of some data mining methods. In addition, the thresholds and intervals produced by discretization have the potential to assist the investigator in selecting biologically meaningful intervals. For example, the intervals selected by discretization for a transcriptomic variable provide a starting point for defining normal, over, and underexpression for the corresponding gene.
The FI discretization method is a popular discretization method that is used in a wide range of domains. While it is computationally efficient, it is not guaranteed to find the optimal discretization for a predictor variable. We have developed a Bayesian discretization method called EBD that is guaranteed to find the optimal discretization (i.e., the discretization with the highest Bayesian score) and is also sufficiently computationally efficient to be applicable to highdimensional biomedical data.
Methods
Biomedical Datasets
The performance of EBD was evaluated on a total of 24 datasets that included 21 publicly available transcriptomic datasets and two publicly available proteomic datasets that were acquired on the Surface Enhanced Laser/Desorption Ionization Time of Flight (SELDITOF) mass spectrometry platform. Also included was a University of Pittsburgh proteomic dataset that contains diagnostic data on patients with Amyotrophic Lateral Sclerosis; this data were acquired on the SELDITOF platform [17]. The 24 datasets along with their types, number of instances, number of variables, and the majority target value proportions are given in Table 1. The 23 publicly available datasets used in our experiments have been extensively studied in prior investigations [1734].
Additional Details about the EBD Algorithm
In this section, we first provide additional details about the Prior probability function that is used by EBD. Next, we discuss details of the EBD pseudocode that appears in Figure 1.
Let D be a dataset of n instances consisting of the list ((X_{1}, Z_{1}), (X_{2}, Z_{2}), ..., (X_{k}, Z_{k}), ..., (X_{n}, Z_{n})) that is sorted in ascending order of X_{k}, where X_{k }is a real value of the predictor variable and Z_{k }is the associated integer value of the target variable. Let λ be the mean of a Poisson distribution that represents the expected number of cut points between X_{1 }and X_{n }in discretizing X to predict Z. Note that zero, one, or more than one cut points can occur between any two consecutive values of X in the training set. Let Prior(k) be the prior probability of there being at least one cut point between values X_{k }and X_{k + 1 }in the training set. For k from 1 to n1, we define the EBD Prior function as follows:
where, d(a, b) = X_{b } X_{a }represents the distance between the two values X_{a }and X_{b }of X, and X_{b }is greater than X_{a}. When k = 0 and k = n, boundary conditions occur. We need an interval below the lowest value of X in the training set and above the highest value. Thus, we define Prior(0) = 1, which corresponds to the lowest interval, and Prior(n) = 1, which corresponds to the highest interval.
The EBD pseudocode shown in Figure 1 works as follows. Consider finding the optimal discretization of the subsequence S_{1, a }for a being some value between 1 and n.^{3 }Assume we have already found the highest scoring discretization of X for each of the subsequences S_{1,1}, S_{1,2}, ..., S_{1,a1}. Let V_{1}, V_{2}, ..., V_{a1 }denote the respective scores of these optimal discretizations. Let Score_{i }be the score of subsequence S_{i, a }when it is considered as a single interval, that is, it has no internal cut points; this term is denoted as the variable Score_ba in Figure 1. For all b from a to 1, EBD computes V_{b  1 }× Score_ba, which is the score for the highest scoring discretization of S_{1, a }that includes S_{b, a }as a single interval. Since this score is derived from two other scores, we call it a composite score. The fact that this composite score is a product of two scores follows from the decomposition of the scoring measure we are using, as given by Equation 9. In particular, both the prior and the marginal likelihood components of that score are decomposable. Over all b, EBD chooses the maximum composite score, which corresponds to the optimal discretization of S_{1, a}; this score is stored in V_{a}. By repeating this process for a from 1 to n, EBD derives the optimal discretization of S_{1, n}, which is our overall goal.
Several lines of the pseudocode in Figure 1 deserve comments. Line 8 incrementally builds a frequency (count) distribution for the target variable, as the subsequence S_{b, a }is extended. Line 11 determines if a better discretization has been found for the subsequence S_{1, a}. If so, the new (higher) score and its corresponding discretization are stored in V_{a }and T_{a}, respectively. Line 15 incrementally updates P to maintain a prior that is consistent with there being no cut points in the subsequence S_{b a}.
We can obtain the time complexity of EBD as follows. The statements in lines 1 and 2 clearly require O(1) run time. The outer loop, which starts at line 3, executes n times. In that loop lines 35 require O(1) time per execution, and line 6 requires O(J) time per execution, where J is the number of values of the target variable. Thus, the statements in the outer loop require a total of O(J·n) time. The inner loop, which starts at line 7, loops O(n^{2}) times. In it lines 8 and 9 require O(J) time, and the remaining lines require O(1) time. Thus, the statements in the inner loop require a total of O(J·n^{2}) of run time.^{4 }Therefore, the overall time complexity of EBD is O(J·n^{2}). Assuming there is an upper bound on the value of J, then the complexity of EBD is simply O(n^{2}).
The numbers computed within EBD can become very small. Thus, it is most practical to use logarithmic arithmetic. A logarithmic version of EBD, called lnEBD, is given in Additional file 1.
Additional file 1. Logarithmic Version of EBD. Contains pseudocode for a logarithmic version of EBD.
Format: DOC Size: 27KB Download file
This file can be viewed with: Microsoft Word Viewer
Discretization and Classification
For the FI discretization method, we used the implementation in the Waikato Environment for Knowledge Acquisition (WEKA) version 3.5.6 [35]. We implemented the EBD discretization method in Java so that it can be used in conjunction with WEKA. For our experiments, we used the J4.8 classifier (which is WEKA's implementation of C4.5) and the naïve Bayes classifier as implemented in WEKA. Given an instance for which the target value is to be predicted, both classifiers compute the probability distribution over the target values. In our evaluation, the distribution over the target values was used directly; if a single target value was required, the target variable was assigned the value that had the highest probability.
Evaluation Measures
We conducted experiments for the EBD and FI discretization methods using 10 × 10 crossvalidation. The discretization methods were evaluated on the following five measures: accuracy, area under the Receiver Operating Characteristic curve (AUC), robustness, stability, and the average number of intervals per variable.
Accuracy is a widely used performance measure for evaluating a classifier and is defined as the proportion of correct predictions of the target made by the classifier relative to the number of test instances (samples). The AUC is another commonly used discriminative measure for evaluating classifiers. For a binary classifier, the AUC can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen instance that has a positive target value than it will to a randomly chosen instance with a negative target value. For datasets in which the target takes more than two values, we used the method described by Hand and Till [36] for computing the AUC.
Robustness is defined as the ratio of the accuracy on the test dataset to that on the training dataset expressed as a percentage [5]. It assesses the degree of overfitting of a discretization method.
Stability measures the sensitivity of a variable selection method to differences in training datasets, and it quantifies how different training datasets affect the variables being selected. Discretization can be viewed as a variable selection method, in that variables with a nontrivial discretization are selected while variables with a trivial discretization are discarded when the discretized variables are used in learning a classifier. A variable has a trivial discretization if it is discretized to a single interval (i.e., has no cut points) while it has a nontrivial discretization if it is discretized to more than one interval (i.e., has at least one cutpoint).
We used a stability measure that is an extension of the measure developed by Kuncheva [37]. To compute stability, first a similarity measure is defined for two sets of variables that, for example, would be obtained from the application of a discretization method to two training datasets on the same variables. Given two sets of selected variables, v_{i }and v_{j}, the similarity score we used is given by the following equation:
where, k_{i }is the number of variables in v_{i}, k_{j }is the number of variables in v_{j}, r is the number of variables that are present in both v_{i }and v_{j}, n is the total number of variables, min(k_{i}, k_{j}) is the smaller of k_{i }or k_{j }and represents the largest value r can attain, and is the expected value of r that is obtained by modeling r as a random variable with a hypergeometric distribution. This similarity measure computes the degree of commonality between two sets with an arbitrary number of variables, and it varies between 1 and 1 with 0 indicating that the number of variables common to the two sets can be obtained simply by random selection of k_{i }or k_{j }variables from n variables, and 1 indicating that the two sets are contain the same variables. When v_{i }or v_{j }or both have no variables, or both v_{i }and v_{j }contain all predictor variables, Sim(v_{i}, v_{j}) is undefined, and we assume the value of the similarity measure to be 0.
Experimental Methods
In performing cross validation, each training set (fold) contains a set of variables that are assigned one or more cutpoints; we can consider these as the selected predictor variables for that fold. We would like to measure how similar are the selected variables among all the training folds. For a single run of 10fold cross validation, the similarity scores of all possible pairs of folds are calculated using Equation 11. With 10fold cross validation, there are 45 pairs of folds, and stability is computed as the average similarity over all these pairs. For the ten runs of 10fold crossvalidation, we averaged the stability scores obtained from the ten runs to obtain an overall stability score. The stability score varies between 1 and 1; a better discretization method will be more stable and hence have a higher score.
For comparing the performance of the discretization methods, we used the Wilcoxon paired samples signed rank test. This is a nonparametric procedure concerning a set of paired values from two samples that tests the hypothesis that the population medians of the samples are the same [38]. In evaluating discretization methods, it is used to test whether two such methods differ significantly in performance on a specified evaluation measure.
Authors' contributions
JLL developed the computer programs, performed the experiments, and drafted the manuscript. GFC and SV designed the EBD method and helped to draft and revise the manuscript. VG assisted JLL in obtaining the biomedical datasets and in the design of the experiments. SV helped JLL in the selection of the evaluation measures. All authors read and approved the final manuscript.
Endnotes
^{1 }This is based on a search with the phrase "Fayyad and Irani's discretization" that we performed on December 24, 2010.
^{2 }However, in general we can use background knowledge and belief to set the values of the α_{ij}.
^{3 }Technically, we should use the term n' here, as it is defined in Figure 1, but we use n for simplicity of notation.
^{4 }We note that line 13 requires some care in its implementation to achieve O(1) time complexity, but it can be done by using an appropriate data structure. Also, the MarginalLikelihood function requires computing factorials from 1! to as high as (J1 + n)!; these factorials can be precomputed in O(n) time and stored for use in the MarginalLikelihood function.
Acknowledgements
We thank the Bowser Laboratory at the University of Pittsburgh for the use of the Amyotrophic Lateral Sclerosis proteomic dataset. This research was funded by grants from the National Library of Medicine (T15LM007059, R01LM06696, R01LM010020, and HHSN276201000030C), the National Institute of General Medical Sciences (GM071951), and the National Science Foundation (IIS0325581 and IIS0911032).
References

Cohen WW: Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning; Tahoe City, CA. Morgan Kaufmann; 1995:115123.

Gopalakrishnan V, Ganchev P, Ranganathan S, Bowser R: Rule learning for diseasespecific biomarker discovery from clinical proteomic mass spectra.
Springer Lecture Notes in Computer Science 2006, 3916:93105. Publisher Full Text

Yang Y, Webb G: On why discretization works for NaiveBayes classifiers.
Lecture Notes in Computer Science 2003, 2903:440452. Publisher Full Text

Lustgarten JL, Gopalakrishnan V, Grover H, Visweswaran S: Improving classification performance with discretization on biomedical datasets.
Proceedings of the Fall Symposium of the American Medical Informatics Association; Washington, DC 2008, 445449.

Boullé M: MODL: A Bayes optimal discretization method for continuous attributes.
Machine Learning 2006, 65:131165. Publisher Full Text

Brijs T, Vanhoof K: Costsensitive discretization of numeric attributes.
In Second European Symposium on Principles of Data Mining and Knowledge Discovery; September 2326 Edited by Zytkow JM, Quafafou M. 1998, 102110.

Butterworth R, Simovici DA, Santos GS, OhnoMachado L: A greedy algorithm for supervised discretization.
Journal of Biomedical Informatics 2004, 37:285292. PubMed Abstract  Publisher Full Text

Dougherty J, Kohavi R, Sahami M: Supervised and unsupervised discretization of continuous features.
In Proceedings of the Twelfth International Conference on Machine Learning; Tahoe City, California Edited by Prieditis A, Russell SJ. 1995, 194202.

Fayyad UM, Irani KB: Multiinterval discretization of continuousvalued attributes for classification learning.
Proceedings of the Thirteenth International Joint Conference on AI (IJCAI93); Chamberry, France 1993, 10221027.

Kohavi R, Sahami M: Errorbased and entropybased discretization of continuous features. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Portland, Oregon. AAAI Press; 1996:114119.

Liu H, Hissain F, Tan CL, Dash M: Discretization: An enabling technique.
Data Mining and Knowledge Discovery 2002, 6:393423. Publisher Full Text

Monti S, Cooper GF: A multivariate discretization method for learning Bayesian networks from mixed data. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence; Madison, WI. Morgan and Kaufmann; 1998:404413.

Bay SD: Multivariate discretization of continuous variables for set mining. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining; Boston, MA. ACM; 2000.

Cooper GF, Herskovits E: A Bayesian method for the induction of probabilistic networks from data.

Heckerman D, Geiger D, Chickering DM: Learning Bayesian networks: The combination of knowledge and statistical data.

Domingos P, Pazzani M: On the optimality of the simple Bayesian classifier under zeroone loss.
Machine Learning 1997, 29:103130. Publisher Full Text

Ranganathan S, Williams E, Ganchev P, Gopalakrishnan V, Lacomis D, Urbinelli L, Newhall K, Cudkowicz ME, Brown RH Jr, Bowser R: Proteomic profiling of cerebrospinal fluid identifies biomarkers for amyotrophic lateral sclerosis.
Journal of Neurochemistry 2005, 95:14611471. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America 1999, 96:67456750. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia.
Nature Genetics 2002, 30:4147. PubMed Abstract  Publisher Full Text

Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JMG, Iannettoni MD, Orringer MB, Hanash S: Geneexpression profiles predict survival of patients with lung adenocarcinoma.
Nature Medicine 2002, 8:816824. PubMed Abstract  Publisher Full Text

Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.
Proceedings of the National Academy of Sciences of the United States of America 2001, 98:1379013795. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.
Science 1999, 286:531537. PubMed Abstract  Publisher Full Text

Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J: Geneexpression profiles in hereditary breast cancer.
New England Journal of Medicine 2001, 344:539548. PubMed Abstract  Publisher Full Text

Iizuka N, Oka M, YamadaOkabe H, Nishida M, Maeda Y, Mori N, Takao T, Tamesa T, Tangoku A, Tabuchi H, Hamada K, Nakayama H, Ishitsuka H, Miyamoto T, Hirabayashi A, Uchimura S, Hamamoto Y: Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection.
Lancet 2003, 361:923929. PubMed Abstract  Publisher Full Text

Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.
Nature Medicine 2001, 7:673679. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN: Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification.
Cancer Research 2003, 63:16021607. PubMed Abstract  Publisher Full Text

Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumour outcome based on gene expression.
Nature 2002, 415:436442. PubMed Abstract  Publisher Full Text

Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors.
Nature Genetics 2003, 33:4954. PubMed Abstract  Publisher Full Text

Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, MullerHermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, LopezGuillermo A, et al.: The use of molecular profiling to predict survival after chemotherapy for diffuse LargeBCell Lymphoma.
New England Journal of Medicine 2002, 346:19371947. PubMed Abstract  Publisher Full Text

Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning.
Nature Medicine 2002, 8:6874. PubMed Abstract  Publisher Full Text

Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior.
Cancer Cell 2002, 1:203209. PubMed Abstract  Publisher Full Text

Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling.
Proceedings of the National Academy of Sciences of the United States of America 2001, 98:1078710792. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF Jr, Hampton GM: Molecular classification of human carcinomas by use of gene expression signatures.
Cancer Research 2001, 61:73887393. PubMed Abstract  Publisher Full Text

van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2005.

Hand DJ, Till RJ: A simple generalisation of the area under the ROC curve for multiple class classification problems.
Machine Learning 2001, 45:171186. Publisher Full Text

Kuncheva LI: A stability index for feature selection. In Proceedings of the 25th IASTED International MultiConference: Artificial intelligence and applications; Innsbruck, Austria. ACTA Press; 2007.

Rosner B: Fundamentals of Biostatistics. 6th edition. Cengage Learning; 2005.