Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Cluster analysis for DNA methylation profiles having a detection threshold

Paul Marjoram1, Jing Chang1, Peter W Laird2 and Kimberly D Siegmund1*

Author Affiliations

1 Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA

2 Norris Cancer Center and Departments of Surgery and Biochemistry & Molecular Biology, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA

For all author emails, please log on.

BMC Bioinformatics 2006, 7:361  doi:10.1186/1471-2105-7-361

Published: 26 July 2006

Abstract

Background

DNA methylation, a molecular feature used to investigate tumor heterogeneity, can be measured on many genomic regions using the MethyLight technology. Due to the combination of the underlying biology of DNA methylation and the MethyLight technology, the measurements, while being generated on a continuous scale, have a large number of 0 values. This suggests that conventional clustering methodology may not perform well on this data.

Results

We compare performance of existing methodology (such as k-means) with two novel methods that explicitly allow for the preponderance of values at 0. We also consider how the ability to successfully cluster such data depends upon the number of informative genes for which methylation is measured and the correlation structure of the methylation values for those genes. We show that when data is collected for a sufficient number of genes, our models do improve clustering performance compared to methods, such as k-means, that do not explicitly respect the supposed biological realities of the situation.

Conclusion

The performance of analysis methods depends upon how well the assumptions of those methods reflect the properties of the data being analyzed. Differing technologies will lead to data with differing properties, and should therefore be analyzed differently. Consequently, it is prudent to give thought to what the properties of the data are likely to be, and which analysis method might therefore be likely to best capture those properties.