Log on / register
Feedback | Support | My details
Open AccessHighly AccessMethodology article

Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data

Carsten O Daub1,4 email, Ralf Steuer2 email, Joachim Selbig1 email and Sebastian Kloska1,3 email

1Max Planck Institute of Molecular Plant Physiology, Potsdam, 14424, Germany

2Nonlinear Dynamics Group, Institute of Physics, University of Potsdam, Potsdam, 14415, Germany

3Scienion AG, Volmerstrasse 7a, Berlin, 12489, Germany

4Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, 17177, Sweden

author email corresponding author email

BMC Bioinformatics 2004, 5:118doi:10.1186/1471-2105-5-118

Published: 31 August 2004

Abstract

Background

The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size.

Results

In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two large-scale gene expression datasets and the results are compared to those obtained using other similarity measures.

A C++ source code of our algorithm is available for non-commercial use from kloska@scienion.de upon request.

Conclusion

The utilisation of mutual information as similarity measure enables the detection of non-linear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an ad-hoc basis without further justification, are thereby extended.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.