School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221116, China

Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249, USA

Department of Biostatistics, University of Texas Health Science Center at San Antonio, San Antonio, TX 78229, USA

Abstract

Background

DNA methylation occurs in the context of a CpG dinucleotide. It is an important epigenetic modification, which can be inherited through cell division. The two major types of methylation include hypomethylation and hypermethylation. Unique methylation patterns have been shown to exist in diseases including various types of cancer. DNA methylation analysis promises to become a powerful tool in cancer diagnosis, treatment and prognostication. Large-scale methylation arrays are now available for studying methylation genome-wide. The Illumina methylation platform simultaneously measures cytosine methylation at more than 1500 CpG sites associated with over 800 cancer-related genes. Cluster analysis is often used to identify DNA methylation subgroups for prognosis and diagnosis. However, due to the unique non-Gaussian characteristics, traditional clustering methods may not be appropriate for DNA and methylation data, and the determination of optimal cluster number is still problematic.

Method

A Dirichlet process beta mixture model (DPBMM) is proposed that models the DNA methylation expressions as an infinite number of beta mixture distribution. The model allows automatic learning of the relevant parameters such as the cluster mixing proportion, the parameters of beta distribution for each cluster, and especially the number of potential clusters. Since the model is high dimensional and analytically intractable, we proposed a Gibbs sampling "no-gaps" solution for computing the posterior distributions, hence the estimates of the parameters.

Result

The proposed algorithm was tested on simulated data as well as methylation data from 55 Glioblastoma multiform (GBM) brain tissue samples. To reduce the computational burden due to the high data dimensionality, a dimension reduction method is adopted. The two GBM clusters yielded by DPBMM are based on data of different number of loci (P-value < 0.1), while hierarchical clustering cannot yield statistically significant clusters.

Background

DNA methylation profiles has become an alternative molecular footprint for classification. It occurs in the context of a CpG dinucleotide. It is an important epigenetic modification, which can be inherited through cell division. In this chemical modification of the cytosine nucleotide, the 5-carbon position is enzymatically modified by the addition of a methyl group such that cytosines can occur in a methylated or unmethylated state. CpG islands are usually not methylated in normal tissues but frequently become hypermethylated in cancer

To this end, clustering analysis is often used to identify methylation subgroups that are distinct from one another in data

In a response to the aforementioned limitations, we proposed here a nonparametric Dirichlet process beta mixture model (DPBMM) method for clustering DNA methylation expression profiles produced by Illumina Infinium Beadchip. DPBMM makes use of Dirichlet process mixture to place a prior

Methods

Problem formulation

Model DNA methylation profiles with beta mixture distribution

For a two-color hybridization based array such as Illumina Infinium array, the measurements are associated with the percentage of the methylated alleles, which is called the "beta" values because it can be described by a mixture of beta distributions

Examples of beta distributions

**Examples of beta distributions**. Beta densities with large hyperparameters (

Consider the problem of clustering _{1}, _{2}, ..., _{n}_{i }_{i1}, _{i2}, ..., _{iL}_{i }_{il }_{kl }_{kl}

where _{i }

where _{kl }_{kl }_{l }_{i }

Dirichlet process mixture model

The Dirichlet process is an nonparametric extension of the original Dirichlet distribution. Let _{i }_{i}_{i }

where _{i}_{0 }and a precision parameter

Graphical model

**Graphical model**. The model for the Bayesian estimation is built following the principles of graphical model.

where _{0 }is such that _{0 }and has a parametric form, _{0}. The DP of mixtures (DPM) are proposed to model the clustering effect in data. Compared with other clustering models, DPM is very attractive because it allows the cluster number _{i }_{i}_{i}_{i }_{0}, and with positive probability some of the _{i }_{i }_{i }_{0}. Let

Inference

Let Φ = {Φ_{1}, Φ_{2}, ..., Φ_{K}_{1}, ..., _{m}_{1}, ..., _{m}_{i }_{i }_{l}_{i }_{1}, ..., _{k}_{1}, ..., _{m}_{i}_{l }_{α }_{α}_{β }_{β}_{α }_{β}_{0 }as ^{2 }

There are some useful expression of a Dirichlet process, such as Chinese Restaurant Process(CRP) _{i}_{1}, ..., _{i-1}, _{i}_{-i }

and,

Then the conditional posterior distribution for sampling _{i }

Thus the conditional posterior distribution for sampling Φ_{i }

It is obvious that _{0 }is not conjugate with _{i,0 }cannot be evaluated analytically and drawing samples from _{i }

As to

The final Gibbs sampling steps can be summarized by the following steps:

Gibbs sampling for DPBMM

Iterate the following steps and for the

1. For each sample _{i }_{-i }_{i }_{i }_{i }_{-i }

2. For _{i }

For _{i }_{0}.

3. Sample

4. Based on Step 1, we can get the value of

Due to the large number of parameters, the initial values for parameters

Results

Test on simulated data

We conducted simulations to test our proposed method. For the first case, the simulated data set is generated based on the model described in (2) with _{α}_{β }

The overall precision

F metrics is used to evaluate the clustering result by combining

Figure

Clustering evaluation on simulation data set

**Clustering evaluation on simulation data set**. The result is based on the simulated data with 4 dimensions. Figure 3(a) shows the number of clusters

For our second case, we used two simulated data set from

Number of classes obtained for RPMM and DPBMM applied to simulated data (Case I: 5 classes).

**Method**

**J**

**Median**

**Mean**

**SD**

RPMM

25

8

7.7

2.0

50

5

5.6

1.32

DPBMM

25

5

5.16

0.93

50

5

5.29

1.43

Number of classes obtained for RPMM and DPBMM applied to simulated data (Case II: 4 classes).

**Method**

**J**

**Median**

**Mean**

**SD**

RPMM

5

2

2.0

0.10

10

2

2.4

2.38

DPBMM

5

7

6.9

1.04

10

4

4.09

1.60

Clustering evaluation based on different

**Clustering evaluation based on different J**. Figure 4(a) shows the F metric vs. recall curve of

Test on real data

We then applied our proposed DPBMM clustering on the GBM methylation microarray dataset in The Cancer Genome Atlas (TCGA). This dataset consists of 74 patients assayed on Illumina HumanMethylation450 array. Samples for DPBMM clustering analysis were selected to have clinical annotations. At last, 55 patients were left for consideration. Twenty-seven patients were alive at the time of last follow up, whereas twenty-eight patients experienced disease progression since last follow-up. The median follow up time was 198 days (range, 2-953 days). Each sample includes up to 485,577 CpG dinucleotides spanning gene-associated elements as well as intergenic regions. The associated detection P-value reported together with the methylation expression data is used as a quality control measure of probe performance. Following the probe excluding method in

Since the small sample, large dimensional property of methylation array, many loci in the data set have low variance and may not contribute to clustering. it is safer only to consider loci that change significantly

**Top 20 variable loci (ranked by variance through samples) selected from the methylation profiles of the 55 GBM samples**.

Click here for file

**The number of uncovered clusters and P-value of overall survival analysis for J ∈ {1, 2, ..., 20}**. P-value is used to test the Kaplan-Meier confidence.

Click here for file

Estimated clustering structure based on DPBMM and Hierarchical clustering

**Estimated clustering structure based on DPBMM and Hierarchical clustering**. 55 samples from TCGA are separated into two clusters on the basis of Illumina methylation expression array. The samples (columns) are arranged according to the estimated clusters by DPBMM while the locus (rows) according to hierarchical clustering.

Kaplan-Meier estimate of survival analysis based on uncovered structure of DPBMM method (J = 11)

**Kaplan-Meier estimate of survival analysis based on uncovered structure of DPBMM method (J = 11)**. The figure shows the survival functions of the two clusters obtained based on the top 11 variable locus (P-value = 0.03) by DPBMM, which is more significant than the corresponding result of hierarchical clustering (P-value = 0.51).

The computation time is always an issue for Gibbs sampling methods. Our simulation is carried out on a Linux based high-performance computer cluster. Each processing core is equipped with 2GB RAM. Figure

The computation time resulting from the real data study for

**The computation time resulting from the real data study for J ∈ {1, 2, ..., 20}**. The figure shows the computation time resulting from the real data study for

Discussion

We discuss next a few distinct features of DPBMM. First, in accordance with the fact that "beta" values in DNA methylation array data fall in the range of zero to one, we assume mixtures of beta distribution for the data. It can provide more flexible shapes, thus can describe data of various types. This is different from traditional Gaussian mixture model based clustering methods such as K-means. Second, since most existing methods can not determine the number of clusters automatically, we adopted a Dirichlet process prior for cluster assignment. Thus, we get a non-conjugate Dirichlet process beta mixture model, whose parameters are hard to estimate. A Gibbs sampling and "no-gap" sampling solution is developed to overcome this difficulty. This is different from traditional parametric methods, whose result also relies on a model parameter, which is usually determined in a model selection process.

The limitation of the proposed methods are mainly as follows. First, the algorithm is based on Gibbs sampling, which is somewhat a resource-heavy MCMC method, therefore, the computation time is still heavy. Second, the model is computationally too slow to apply to methylation data of genome scale. We need to reduce the dimensionality to keep DPBMM computationally affordable.

In the future, it would be interesting to develop more effective dimension reduction method for DPBMM. It would also be interesting to integrate the information from different data sources such as gene expression and copy numbers variation into one model for cluster analysis.

Conclusions

An infinite Dirichlet process beta mixture model was proposed to unveil the latent cluster structure from Illumina Infinium methylation profiles. By utilizing a Dirichlet process prior for cluster assignment, the number of clusters is determined. A Gibbs sampling and "no-gaps" sampling solution was developed to infer the relevant parameters automatically. The effectiveness and validity of the model and the proposed Gibbs sampler were evaluated on simulated data and on real data. The results demonstrated that DPBMM could yield the cluster structure automatically with better accuracy.

Availability

MATLAB code is available at

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LZ, JM, and YH conceived the idea. LZ, JM, and YH worked out the detailed algorithms and derivations. LZ, JM and HL implemented the algorithm and performed the testing. LZ, JM, HL, and YH wrote the paper.

Acknowledgements

Based on “Clustering DNA methylation expressions using nonparametric beta mixture model”, by Lin Zhang, Jia Meng, Hui Liu and Yufei Huang which appeared in

The work of L. Zhang is supported by "the Fundamental Research Funds for the Central Universities" (2010QNA50). The work of H. Liu is supported by "the Fundamental Research Funds for the Central Universities" (2010QNA47). The work of Y. Huang is supported by Qatar National Research Fund (09-874-3-235).

This article has been published as part of