Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

Multiple-input multiple-output causal strategies for gene selection

Gianluca Bontempi1*, Benjamin Haibe-Kains2, Christine Desmedt3, Christos Sotiriou3 and John Quackenbush2

  • * Corresponding author: Gianluca Bontempi gbonte@ulb.ac.be

  • † Equal contributors

Author Affiliations

1 Machine Learning Group, Computer Science Department, Université Libre de Bruxelles, Belgium

2 Computational Biology and Functional Genomics Laboratory, Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, USA

3 Breast Cancer Translational Research Laboratory, Department of Medical Oncology, Institut Jules Bordet, Université Libre de Bruxelles, Belgium

For all author emails, please log on.

BMC Bioinformatics 2011, 12:458  doi:10.1186/1471-2105-12-458

Published: 25 November 2011

Abstract

Background

Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting.

Results

We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection.

Conclusions

Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.