This article is part of the supplement: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011
A nonparametric Bayesian approach for clustering bisulfate-based DNA methylation profiles
1 School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221116, China
2 Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249, USA
3 Department of Biostatistics, University of Texas Health Science Center at San Antonio, San Antonio, TX 78229, USA
BMC Genomics 2012, 13(Suppl 6):S20 doi:10.1186/1471-2164-13-S6-S20Published: 26 October 2012
DNA methylation occurs in the context of a CpG dinucleotide. It is an important epigenetic modification, which can be inherited through cell division. The two major types of methylation include hypomethylation and hypermethylation. Unique methylation patterns have been shown to exist in diseases including various types of cancer. DNA methylation analysis promises to become a powerful tool in cancer diagnosis, treatment and prognostication. Large-scale methylation arrays are now available for studying methylation genome-wide. The Illumina methylation platform simultaneously measures cytosine methylation at more than 1500 CpG sites associated with over 800 cancer-related genes. Cluster analysis is often used to identify DNA methylation subgroups for prognosis and diagnosis. However, due to the unique non-Gaussian characteristics, traditional clustering methods may not be appropriate for DNA and methylation data, and the determination of optimal cluster number is still problematic.
A Dirichlet process beta mixture model (DPBMM) is proposed that models the DNA methylation expressions as an infinite number of beta mixture distribution. The model allows automatic learning of the relevant parameters such as the cluster mixing proportion, the parameters of beta distribution for each cluster, and especially the number of potential clusters. Since the model is high dimensional and analytically intractable, we proposed a Gibbs sampling "no-gaps" solution for computing the posterior distributions, hence the estimates of the parameters.
The proposed algorithm was tested on simulated data as well as methylation data from 55 Glioblastoma multiform (GBM) brain tissue samples. To reduce the computational burden due to the high data dimensionality, a dimension reduction method is adopted. The two GBM clusters yielded by DPBMM are based on data of different number of loci (P-value < 0.1), while hierarchical clustering cannot yield statistically significant clusters.