Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Research article

Application of an efficient Bayesian discretization method to biomedical data

Jonathan L Lustgarten12, Shyam Visweswaran1*, Vanathi Gopalakrishnan1 and Gregory F Cooper1

  • * Corresponding author: Shyam Visweswaran shv3@pitt.edu

  • † Equal contributors

Author Affiliations

1 Department of Biomedical Informatics and the Intelligent Systems Program, University of Pittsburgh, Suite M-183 Vale, Parkvale Building, 200 Meyran Avenue, Pittsburgh, PA 15260, USA

2 University of Pennsylvania School of Veterinary Medicine, 3800 Spruce Street, Philadelphia, PA 19104, USA

For all author emails, please log on.

BMC Bioinformatics 2011, 12:309  doi:10.1186/1471-2105-12-309

Published: 28 July 2011

Abstract

Background

Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization.

Results

On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.

Conclusions

On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data.