Abstract
Background
Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence. One of the most commonly studied epigenetic alterations is cytosine methylation, which is a well recognized mechanism of epigenetic gene silencing and often occurs at tumor suppressor gene loci in human cancer. Arrays are now being used to study DNA methylation at a large number of loci; for example, the Illumina GoldenGate platform assesses DNA methylation at 1505 loci associated with over 800 cancerrelated genes. Modelbased cluster analysis is often used to identify DNA methylation subgroups in data, but it is unclear how to cluster DNA methylation data from arrays in a scalable and reliable manner.
Results
We propose a novel modelbased recursivepartitioning algorithm to navigate clusters in a beta mixture model. We present simulations that show that the method is more reliable than competing nonparametric clustering approaches, and is at least as reliable as conventional mixture model methods. We also show that our proposed method is more computationally efficient than conventional mixture model approaches. We demonstrate our method on the normal tissue samples and show that the clusters are associated with tissue type as well as age.
Conclusion
Our proposed recursivelypartitioned mixture model is an effective and computationally efficient method for clustering DNA methylation data.
Background
Epigenetics is the study of heritable changes in gene function that cannot be explained by changes in DNA sequence [1]. One of the most commonly studied epigenetic alterations is cytosine methylation, which occurs in the context of a CpG dinucleotide. Concentrations of CpGs known as CpG islands, when sufficiently methylated, are associated with transcriptional gene silencing tantamount to "one hit" as part of Knudson's two hit hypothesis of tumor suppressor gene inactivation [2]. DNA methylation associated gene silencing is a well recognized epigenetic mechanism that often occurs at tumor suppressor gene loci in human cancer. Hundreds of reports of methylation induced silencing at tumor suppressor genes in virtually all types of human cancer have been published [3,4].
While there has been a tremendous effort to characterize epigenetic alterations in cancer, surprisingly little work has been done in diseasefree tissues. There is a basic need for epigenetic profiling of normal tissues to better understand the contribution of these profiles to tissue specificity, especially in the context of phenotypically important CpGs, where deregulation is associated with human diseases such as cancer. While efforts to characterize the methylation profiles of normal tissues in humans and mice have begun, and certain themes are slowly becoming apparent, relatively few reports have emerged [411]. Most CpGs or CpG regions have been found to have a bimodal distribution of methylation profiles, either hypo or hypermethylated. Another theme is the disproportionate location of cell or tissue type dependent differentially methylated regions in nonCpG island sites [4,8]. Furthermore, the Human Epigenome Project showed that the human major histocompatability loci have differential methylation across various human tissues, but that differential methylation does not necessarily lead to differential expression [8]. It is therefore critical to first outline the basalstate of phenotypically important epigenetic marks that are known to contribute to cancer in order to have a background for comparison to other normal and diseased tissues. An approach which characterizes the basal epigenetic state is best suited to foster the discovery of epigenetic profiles that are associated with particular disease states, patient characteristics, exposures or other covariates that may contribute to pathogenesis.
Cluster analysis is often used to identify methylation subgroups in data [12,13] and, in particular, Siegmund et al. (2004) argue that modelbased clustering techniques are often superior to nonparametric approaches [13]. Largescale methylation arrays are now available for studying methylation genomewide; the GoldenGate methylation platform from the manufacturer Illumina (San Diego, CA) simultaneously measures cytosine methylation at 1505 phenotypicallyimportant loci associated with over 800 cancerrelated genes. The result of the array is a sequence of "beta" values, one for each locus, calculated as the average of approximately 30 replicates (approximately 30 beads per site per sample) of the quantity max(M, 0)/(U + M + Q), where U is the fluorescent signal from an unmethylated allele on a single bead, M is that from a methylated allele, and Q is a constant chosen to ensure that the quantity is welldefined; an absolute value is used in the denominator of the formula to compensate for negative signals due to background subtraction. Note that each beta value is an approximately continuous variable lying between zero and one, where zero represents an unmethylated locus and one represents a methylated locus. As such, the beta value is appropriately modeled with a beta distribution; we further motivate the choice of beta distribution in the Methods section below. A data set consisting of such sequences produces a highdimensional dataanalysis problem which poses challenges for traditional clustering approaches. In addition, analysis of heterogeneous tissue data can lead to a large number of clusters, as we demonstrate below, which presents further challenges for clustering techniques. For example, nonparametric approaches rely on a choice of metric, which may be difficult to justify in the context of high dimensions and numerous clusters. On the other hand, in modelbased clustering, multimodality of the data likelihood may lead to numerical instability or difficulty in determining the best solution [14].
We propose a novel method for modelbased clustering of data of the type produced by Illumina GoldenGate arrays. Our method makes use of a beta mixture model [15]. Although one could use BIC (or similar quantities) to select the number of clusters in the data set, we propose a recursivepartitioning algorithm that provides the number of clusters and a reliable solution in a shorter amount of time than sequential attempts with different numbers of assumed clusters. This is similar in spirit to the idea of recursive partitioning used in Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH, [16]), in which clusters are recursively partitioned using a nonparametric algorithm such as PAM [17]. Our method is also an unsupervised variant of Hierarchical Mixtures of Experts [18], a fuzzy version of CART [19]. We also propose a method for reducing the number of loci considered in the analysis, and selecting the optimal number using an "augmented" BIC statistic. We also present a simulation study comparing its properties to those of competitor methods. Finally, we demonstrate the methodology on GoldenGate methylation array data obtained from 217 normal tissue samples.
Results
Throughout the text, we assume a beta mixture model. For subject i, let Y_{i }= (Y_{i1}, ... Y_{iJ}) be a vector of J continuous outcomes falling between 0 and 1, and let there be n such vectors. We posit a mixture model having K classes, so that subject i belongs to class C_{i }∈ {1, ..., K} and conditional on class membership, each outcome Y_{ij }is an independent Betadistributed variable with parameters α_{kj }and β_{kj }depending on both class k and dimension j. The objective of our method is to obtain posterior class membership probabilities w_{ik }= P(C_{i }= k  Y_{i}). Each class has a profile that is obtained from the collection of parameters α_{kj }and β_{kj}, or equivalently the first two moments:
Simulation
Table 1 displays the classification error and computation time resulting from a simulation study described below in the Methods section. In both cases simulated, the mixture models outperformed the nonparametric methods in terms of classification error. For Case I, based on normal tissue data and described below, our proposed recursivepartitioned mixture model outperformed all other methods, including the standard sequentiallyfit mixture models where different values of K are assumed and the results compared via Bayesian Information Criterion (BIC). In addition, two different implementations of our method, each using different BIC criteria for splitting, produced identical results, including classification. For Case II, based on artificial parameters representing extremes of mean and variability, all mixture models performed equally well, and both versions of our proposed method produced the same results. In general, the mixture models had longer computation time than the nonparametric methods (hierarchical clustering and HOPACH); however, we note that the mixture models were implemented as interpreted code in R, while the nonparametric methods were precompiled programs with R interfaces. Note that the recursivelypartitioned mixture model was anywhere from 3 to 8 times faster than the sequentially fit mixture model. Table 2 summarizes the number of classes found with hierarchical clustering in combination with dynamic tree cutting, with both versions of HOPACH, and with our proposed method using the standard BIC. For the cases considered, the mixture models almost always found the correct number of classes if J was high enough, while the nonparametric methods often had difficulty finding the correct number of classes. We obtained qualitatively similar results in additional simulations employing different distribution assumptions, though we omit the results.
Table 1. Classification error and computation time for various clustering methods applied to simulated data.
Table 2. Number of classes obtained for various clustering methods applied to simulated data
In the Methods section we propose a variant of BIC to compare model fit for different numbers J of loci. For Case I, the mixture models always minimized the "augmented BIC" at J = 1000, while for Case II, the mixture models always minimized the augmented BIC at J = 25. For Case I, nearly all 1413 dimensions were at least somewhat informative; it is interesting to note that J was always minimized at its highest value for this case. For Case II, the number of informative dimensions was J = 20, so the minimum J was closest to the true number of informative markers among the J considered in this simulation. In additional simulations that used a finer mesh of J, J was minimized at 20. Similar results were obtained when the classes were less balanced (e.g. Case I with class probabilities respectively 0.15, 0.30, 0.2, 0.25, and 0.1).
Normal Tissue
We applied the recursivelypartitioned mixture model algorithm to the normal tissue described below in the Methods section. For this analysis, we attempted to split a node only if the weight assigned to the node was greater than 5. The augBIC_{J }statistic, which we propose as a comparative measure of model fit for different numbers J of loci, was minimized at J = 1413, and the algorithm found 23 classes, whose profiles [mean values calculated using (1)] are depicted in Figure 1. All posterior class membership probabilities were indistinguishable from 0 or 1 within numerical error. Table 3 displays the crossclassification between mixture model latent class and tissue sample type. Blood samples were completely separated from other solid tissue samples. In addition, adult blood samples were completely separated from newborn blood samples obtained from Guthrie cards. Placenta samples were also separated from other tissues aside from a single pleura sample. For the most part, head and neck tissue and brain were separated from other samples, but were poorly distinguished between each other. These results were consistent with a Random Forests analysis [20] (8.9% total classification error, additional results not shown), in which we found blood perfectly classified, low classification error for placenta, and some confusion among head and neck tissue and brain tissue.
Table 3. Crossclassification of sample type with latent classes obtained from proposed method
Figure 1. Profiles of latent classes among normal tissue samples. Average value (equation 1) depicted by color: yellow = 1.0, black = 0.5, blue = 0.0. Classes are separated by yellow dividing line, with height indicating the relative proportion of subjects within each class. Loci are ordered by their position in a dendrogram obtained via hierarchical clustering.
Using a permutation test with chisquare statistic, the P value for a hypothesis of no association between class and sample type was less than 0.0001. Thus, our proposed method found clusters relevant to sample type. In addition, a permutation test using a KruskalWallis test statistic produced P < 0.0001 for a hypothesis of no difference in mean age among the classes. Interestingly, when the clustering and subsequent hypothesis test was restricted to blood, the P < 0.0001 for a hypothesis of no difference in mean age among latent classes. Between the two classes found among the adult liquid blood samples, age was significantly different (P < 0.0039), consistent with known associations between age and methylation.
Note that the recursion path contains information about relative similarity. For example, in Figure 1, the classes (0...) seem rather distinct from the classes (1...). There is a band of loci that show high methylation in two classes at the top of the figure, indexed as 11101 and 11100, corresponding to brain tissues, but not in the other classes in the left node (1...), mostly corresponding to blood samples; this band also occurs at the bottom of Figure 1, in classes such as 01010, which also include brain tissues. However, as in any hierarchical clustering procedure, the ordering of the children within a node is meaningless, so that classes (11...) and (10...) could have their positions swapped. There is a wider band of loci to the right, which though the distinction may be subtle, appears to drive the distinction between (0...) and (1...). However, because the initial levels of recursion may constitute only a crude splitting of the data, the tree representing the recursion path likely has greater meaning only at lower levels of recursion.
We also analyzed the normal tissue methylation data using HOPACH. The greedy version of the algorithm produced only 4 classes. The "best" version produced 9 clusters, which are crossclassified with tissue type in Table 4 and with the latent classes obtained from our proposed method in Table 5. As Table 5 shows, the classes found by our proposed method were, for the most part, subsets of the 9 classes found using HOPACH, with a few exceptions that involve minor disagreements in classification. While the apparent compactness of the HOPACH classification seems, at first glance, more attractive than the classification produced by our modelbased method, we remark that is has a few subtle problems. It has three singletons, clusters 6, 8, and 9, which could be grouped together with cluster 7 to comprise a class that entirely represents placental tissue. While a similar criticism could be made of our proposed method with respect to classification of blood, some of the classes have verifiable meaning; for example, the classes indexed 1100 and 1101 distinguish age among blood samples taken from adults. HOPACH classification also associates one head and neck tissue sample with blood, and two cervical samples with numerous other tissues in the 5^{th }class. The 5^{th }class associates numerous tissue types, and comprises 6 different classes produced by our proposed method. Table 6 shows the crossclassification of tissue type with these 6 classes for the subjects comprising the 5^{th }HOPACH class. The classification correctly isolates two of three cervix samples, and has a tendency to distinguish pleura from lung samples. Using a permutation test with chisquare statistic, the P value for a hypothesis of no association between class and sample type among the 6 classes from our modelbased method was less than 0.0001, demonstrating that the classes produced by the mixture model have additional information with respect to tissue type. In order to compare the predictive ability of the two classification schemes overall, we applied the Random Forest algorithm to indicator variables representing HOPACH clusters (using all 9 variables for every bootstrap) and to indicator variables representing our modelbased classification (using all 23 variables for every bootstrap). In the former case we obtained a misclassification rate of 18.43%, and in the latter case a misclassification rate of 17.97%, the difference being the misclassification of one less sample using the modelbased method. Employing the Random Forest algorithm in a similar manner to predict age, we obtained a meansquaredresidual of 190.1 for HOPACH and 164.8 for the modelbased classification, with variance explained equal to 80.6% and 83.2% respectively. Thus, the modelbased classification seems to offer modest improvements over HOPACH in ability to make biological distinctions.
Table 4. Crossclassification of sample type with clusters obtained from HOPACH
Table 5. Crossclassification of latent classes obtained from proposed method with clusters obtained from HOPACH
Table 6. Crossclassification of sample type with latent classes obtained from proposed method among subjects within the 5^{th }class obtained by HOPACH
Discussion
Our proposed method is a modelbased version of the HOPACH algorithm [16]: we recursively use a betamixture model [15] to propose a split of an existing cluster, preserving the split only when it is judged on the basis of BIC to better fit the data. We have found that the integratedclassificationlikelihood BIC (ICLBIC) [15] and BIC produce practically identical results; this result is not unexpected, since the two differ only in a term involving entropy and a perfect classification results in zero entropy. A highdimensional data set will tend to result in approximately perfect classification of most subjects, even though each subject has the opportunity to be portioned out over multiple clusters (a distinction between our method and nonparametric methods such as HOPACH). Consequently, in the deeper levels of recursion, the algorithm will not depend on the distinction between BIC and ICLBIC. For the earlier recursion stages, we assume that there is a strong preference for splitting, which overcomes the entropy penalty in the ICLBIC.
Siegmund et al. (2004) argue that modelbased clustering is preferred in this context over hierarchical clustering [13], a finding that bears out in our simulations. One reason for the superior performance, at least in a highdimensional context, is that the metric used to characterize the differences in nonparametric contexts may be relatively insensitive to differences in particular dimensions. This may play a role in the apparent differences in classification of normal tissue between our proposed method and HOPACH.
Kmeans have been used recently to cluster methylation outcomes [12], though the work of van der Laan and Pollard (2003) seems to suggest that HOPACH may yield results that are superior to Kmeans. In particular, with Kmeans it is difficult to know how many classes are inherent in the data without resamplingbased methods such as the gap statistic [21], with implications for scalability. Also, the "curse of dimensionality" would tend to degrade the performance of procedures such as Kmeans when there are a large number of clusters and the observed data is of high dimension. In general, nonparametric methods such as the fanny algorithm [17] rely on tuning parameters that are difficult to optimize without resampling. An additional problem with nonparametric procedures is that they typically consider only the first moment (mean) of the underlying distributions, ignoring the secondmoment (variance) which for DNA methylation as measured by the GoldenGate assay, may play a critical role in distinguishing tissues.
We propose a dimensionreduction strategy which simply ranks candidate dimensions on the basis of some criterion such as variance, fits the top J dimensions in a mixture model, and employs an augmented version of BIC to compare model fit across different values of J. This is a departure from the penalizedlikelihood methods of the kind described in [22], which would become computationally difficult for truly highdimensional data. Our approach is similar in spirit to supervised principal components methods such as [23]. Interestingly, for the normal tissue data, all 1413 loci were found to be informative. The implication is that methylation at even the least variable locus, COL6A1_P283_F (a locus on chromosome 21 within a gene that encodes the extracellular matrix component collagen VI), contains information about tissue type. In fact, boxplots showing the distribution of COL6A1_P283_F methylation demonstrate great heterogeneity in apparent distribution by tissue type [see Additional file 1], even though all methylation average beta values were less than 0.05. The KruskalWallis P value for testing the association of COL6A1_P283_F methylation with tissue type was P < 0.0001. While this difference could be a subtle artifact of platetoplate variability and heterogeneity in plate sample composition, it also may be that the average beta measured by the GoldenGate assay is in fact an average of methylation status over different cell types: that is, tissue samples may consist of cells having heterogeneous methylation states, with each tissue type having a unique mixture. A separate paper describing analysis of normal tissue in detail is under preparation.
Additional file 1. Distribution of DNA methylation average beta values by tissue type at least variable locus.
Format: JPEG Size: 60KB Download file
We remark that we did not normalize the different plates used in laboratory analysis, for reasons described below. However, when we used the analytical methods of this paper to classify methylation profiles within plate strata and within tissue type strata, we found much greater variability between tissue types than between plates (results not shown). Our results are consistent with expectations developed from an understanding of the current literature on DNA methylation in normal tissues, though we cannot entirely rule out subtle biases introduced by variation between plates. It may be possible to incorporate plate effects in our proposed methodology by modeling the mean response (e.g., see [22]), but the computational considerations are beyond the scope of the present paper.
Conclusion
In summary, our method appears to have good properties both with respect to classification error and computation time. It achieves these properties by combining the strengths of modelbased and hierarchical methods, navigating the underlying clusters quickly through recursive partitioning, but doing so in a way that makes use of a reasonable probability model. This model is also used to compare different dimensions J of input, thus refining the discriminative ability in a scalable manner.
Methods
Normal Tissue Data
Our proposed method is motivated by methylation array data obtained for normal tissue. We extracted DNA from 217 normal tissue samples, modified with sodium bisulfite, and processed them on the Illumina GoldenGate methylation platform. Tissues were assembled by a collaborative, multiinstitutional network of principal investigators conducting molecular epidemiologic studies of human cancer. Participating institutions include the International Mesothelioma Program at Brigham and Women's Hospital, Brown University, DartmouthHitchcock Medical Center, University of California – San Francisco, Brain Tumor SPORE program, University of Massachusetts – Lowell, and the University of Minnesota. Tissues were obtained through Institutional Review Board approved studies already underway at these institutions, or purchased from the National Disease Research Interchange (NDRI). A variety of normal tissue types were assembled: bladder (n = 5), blood (n = 85), brain (n = 12), cervix (n = 3), head and neck (n = 11), kidney (n = 6), lung (n = 53), placenta (n = 19), pleura (n = 18), and small intestine (n = 5). All tissue samples were from adults except n = 55 samples of Guthrie card derived blood samples from newborns. Figure 2 illustrates the methylation pattern for all 217 subjects and 1413 loci passing qualityassurance procedures (median detection Pvalue < 0.05).
Figure 2. Unadjusted Average Beta values obtained from Illumina GoldenGate methylation platform for 1413 tumor suppressor loci on 217 normal tissue samples. Yellow = 1.0, black = 0.5, blue = 0.0. Autosomal chromosomes are grouped to aid visualization. For each chromosome group, loci are ordered by their position in a dendrogram produced by hierarchical clustering. Similarly, within tissue sample groups, samples are ordered by their position in a hierarchical clustering dendrogram.
These data are part of a larger project comparing normal tissue to tumors for many different types of tumor (e.g. bladder, brain, leukemia, lung, mesothelioma, etc.); each of these comparisons constitutes a distinct set of analyses deserving of a separate, focused analysis, so we omit the details from the present manuscript. In addition, a more comprehensive analysis of normal tissue will appear in a separate manuscript targeted to biologists. Within logistical constraints (e.g. availability of normal tissue, which is difficult to obtain for some tissue types, and time and budget constraints) we randomized tissue type to 14 plates (96 wells each). However, because of the great number of different tissue types, the difficulty in collecting normal tissue of specific types (recall that methylation profiles differ profoundly by tissue type), and heterogeneity of both normal and tumor samples, we expect the platetoplate heterogeneity to be dominated the heterogeneity of sample type on each plate. When we used the methods of this paper to classify methylation profiles, stratifying first by plate and examining associations with sample type, then stratifying by distinct tissuetype and examining associations with plate, we found that there are unequivocal differences between sample type on a single plate, and generally no differences between plate within a given sample type, with the possible exception of placenta, whose methylation values have greater variability and therefore may have more sensitivity to random heterogeneity with small sample size. On the other hand, attempts to normalize plates using standard techniques may result in overcorrection, since platetoplate variability is mostly driven by sample heterogeneity, and corrections will tend to average out the distinctions of interest.
Note that in addition to providing average "beta" values, the GoldenGate platform also provides averages and (average fluorescence for methylated and unmethylated alleles). For our data, the quantities were close to the reported average "beta" values, with the standard deviation of the difference between the two less than 0.01, a fact which has implications below.
Recursivepartitioning for a Beta Mixture Model
Under the assumption that and approximately follow gamma distributions on the same scale, the quantity , which is close to the reported average "beta" value, follows a beta distribution. In addition, the mean of correlated Bernoulli variables can be modeled by (1), with an appropriate reparameterization of scale parameters [24]; there is consequently a biological motivation for using a beta distribution. For each single subject i, we assume class membership C_{i }∈ {1, ..., K}. For methylation data Y_{ij }at locus j from subject i, we assume the distribution
which depends on both class k and locus j through the parameters α_{kj }and β_{kj}. Under the assumption that C_{i }= k with probability η_{k}, , and that methylation at each locus is independent conditional on class membership, the likelihood contribution from subject i is
With observed data D = {y_{1}, ..., y_{n}}, we then maximize the fulldata loglikelihood,
with respect to the set of all parameters (α, β, η) to be estimated:
α = (α_{11}, ..., α_{1J}, α_{21}, ..., α_{KJ}), β = (β_{11}, ..., β_{1J}, β_{21}, ..., β_{KJ}), and η = (η_{1}, ..., η_{K1}). This is easily achieved using an ExpectationMaximization (EM) algorithm [25]. Briefly, we initialize the procedure with an n × K matrix of weights W = (w_{ik}) whose rows sum to one. The rows represent initial guesses at class membership probabilities for each subject. For each k, we set and maximize the quantity
where Q_{k }is constant with respect to parameters, to obtain the α and β parameters corresponding to class k. Reversing the order of summation in (3) shows that each dimension j can be maximized separately. We subsequently update the weights as follows:
iterating until ℓ(α, β, η) does not change. The final weight w_{ik }represents the posterior probability that subject i belongs to class k, i.e. w_{ik }= P(C_{i }= kY_{1}, ..., Y_{n}). As for most finitemixture methods, we might decide on the number of classes K by fitting mixture models for a range of possible values of K, computing the BIC statistic
and selecting the value of K corresponding to the minimum BIC. In the context of beta mixture models, a modified version of BIC is available [15]: the ICLBIC is defined as the sum of (4) and the entropy term (where 0 log (0) = 0). The entire operation has approximate complexity , where K_{max }is the maximum number of classes attempted. The square term arises under the assumption that for a single model with K classes, the complexity will be of order nJK.
Because likelihoods for modelbased clustering algorithms can be multimodal [14,26], commercial mixture model software packages often use multiple starting values for fitting the model, and subsequently choose the estimates corresponding to the maximum likelihood. However, careful choice of starting values can often minimize the effort [22,26]. One option is to use hierarchical clustering to find K clusters (cutting the clustering dendrogram at the appropriate height), and constructing a weight matrix W corresponding to these clusters. Another, similar, option is to use a fuzzy clustering algorithm such as the fanny algorithm [17] available in the R package cluster.
We now propose a recursive method that, on average, has complexity nJKlog(K), where K is the true number of classes. Consider the following weightedlikelihood version of (2),
When ω_{i }≡ 1 for all i, (2) and (5) are equivalent. When 0 <ω_{i }< 1, subject i only partially contributes to estimation, and when ω_{i }= 0, subject i is excluded entirely from consideration in the model. The EM algorithm described above is easily modified by multiplying each w_{ik }by ω_{i }in (3), where the interpretation is that the classes under consideration are only a partial set, and that subject i belongs to one of these classes only with probability ω_{i}. If we begin by fitting a 2class model to the entire data set, the result is two sets of posterior weights representing the posterior probabilities of membership in each of the two classes. Under the assumption that each of these classes can be further split, and that each subject belongs to the subsequent splits only with probability equal to the weight assigned to the unsplit class, we apply the weightedlikelihood EM algorithm to obtain the two classes corresponding to the new split.
To make this idea more precise, define a concatenation operation τ on a sequence of binary values r = (q_{1}, ..., q_{T}), as τ(r, q) = (q_{1}, ..., q_{T}, q). This provides a natural notation for recursive binary partitioning, where longer sequences represent deeper levels of recursion. The first twoclass model, initialized by nonparametric cluster analysis, results in two sets of weights, and . For any sequence r, a mixture model can be attempted using the weighted EM algorithm with weights . If the EM algorithm fails, then we terminate the recursion at that point, but if the EM algorithm succeeds, we can set new weights , , and continue the recursion. Note that at each level of recursion, the weights become smaller; since a mixture model becomes unstable with small weights (corresponding to small numbers of pseudosubjects), the recursion ultimately terminates completely at a set of leaf nodes corresponding to unsplit classes. We can stabilize this process by terminating the recursion if the sum of the weights is less than some prespecified value (e.g. 5). We can also terminate early if the split leads to a less parsimonious representation of the data. To this end, we propose the following weighted versions of BIC:
where the first set of parameters, defining wtdBIC_{2}, are obtained from the twoclass mixture model and the second set of parameters, defining wtdBIC_{1}, are obtained from a oneclass model. Alternatively, one may add an entropy term to wtdBIC_{2}(r) to produce the ICL version of this BIC value; note that for a oneclass model, entropy is always zero. If wtdBIC_{2}(r) is greater than wtdBIC_{1}(r), we terminate the recursion at node r. The worstcase complexity of this algorithm is nlog(n)J. However, at deeper levels of recursion, twoclass models will tend to fit poorly relative to singleclass models, and most nodes will terminate before descending to the deepest levels. We have demonstrated empirically that the proposed method tends to terminate with the number of leaf classes equal to the true number of classes, so that the average complexity is typically of approximate order nJK log(K). Furthermore, in the deeper classes, subjects whose weights are negligible can be dropped entirely from the weighted EM algorithm, so that the complexity of the nodelevel fit at deeper levels is much less than n.
Dimension reduction
Noninformative loci may lead to excessive noise in the solution. Regularization methods may be used to constrain the degreesoffreedom, leading to more precise solutions [22,27]. However, in extremely high dimensions, it can also lead to increased computation time and curtail scalability. We propose an alternative, where all L starting loci are ordered with respect to variance, and the J most variable loci are selected for inclusion in the recursive algorithm described above. A final BIC value can be obtained using (4) by considering all leaflevel unsplit classes as distinct clusters, with class prevalence parameter vector η obtained by summing the final weights and dividing by n. However, this BIC is not comparable across different values of J. Note that the exclusion of L  J loci is equivalent to the assumption that all K classes have identical distributions for the excluded loci. Thus, beta distributions can be fit to each excluded locus using maximumlikelihood, and the resulting parameter estimates included in a final BIC statistic. Specifically, the likelihood for the full data , where we assume the dimensions have been ordered by descending variance and represents data excluded from the mixture model analysis, can be expressed as
The "augmented" BIC is now
where BIC_{J }is the BIC computed for just the J selected loci. The augmented BIC is now comparable across different values of J. As we have demonstrated, augBIC_{J }leads to sensible dimension reduction. Again, an ICL version of augBIC_{J }may be used instead.
Simulation
We conducted simulations to compare the properties of our proposed method with similar competing methods. For our first case (Case I), each simulated data set consisted of n = 100 subjects, each having 1413 continuous responses lying in the unit interval. Each subject was a member of one of 5 classes, each class occurring with 0.2 probability. The classes were defined by betadistribution parameters for each of L = 1413 methylation loci that were autosomal and passed qualityassurance, obtained by fitting a beta model on each locus to one of five data sets from our normal data: adult blood, newborn blood, placenta, lung/pleura, and everything else. Figure 3(A) illustrates a typical data set generated from these parameters. For each data set, we conducted 7 analyses, each using the J most variable loci, J ∈ {25, 50, 500, 1000}. The first analysis used hierarchical clustering, implemented using hclust in the R cluster package, with Euclidean metric and average linkage, and assigned 5 classes by cutting the resulting dendrogram at the appropriate height using the cutree function in the same package. The second analysis used the same clustering procedure, but determined the number of classes using the Dynamic Tree Cutting algorithm [28], implemented via the R package dynamicTreeCut with default settings. The third analysis used HOPACH (R hopach package) to select the "best" classes as defined in the function settings. The fourth analysis used HOPACH with classes obtained by the "greedy" version of the algorithm. The fifth analysis fit 6 sequential mixture models (1 ≤ K ≤ 6), each initialized two different ways (hierarchical clustering and the fanny algorithm), selecting the value of K producing the lowest BIC. The fifth and sixth analyses were applications of our proposed method, using ICLBIC and BIC respectively. In the last three types of analysis, we recorded the value of J that produced the best augmented BIC. Note that each mixture model analysis tended to produce perfect classification, so that an ICLBIC version of the timeconsuming fourth analysis was deemed unnecessary.
Figure 3. Examples of simulated data. Yellow = 1.0, black = 0.5, blue = 0.0. True classes indicated and separated by yellow dividing line. Height of region indicates the relative number of subjects in each class.
For our second case (Case II), which represented a lowerdimensional setting (L = 200) with greater variation in variance of individual beta distributions, we considered 100 subjects from 4 classes, described as follows. Four sets of 10 "informative" beta parameters were drawn randomly at the beginning of the simulation study: a_{1j }~ Gamma(10,10), b_{1j }~ Gamma(10,10); a_{2j }~ Gamma(400,10), b_{2j }~ Gamma(100,10); a_{3j }~ Gamma(100,10), b_{3j }~ Gamma(400,10); and a_{4j }~ Gamma(100,1), b_{4j }~ Gamma(250,1). These were used to construct four classes of 20 informative dimensions: α_{1 }= (a_{2}, a_{1}), α_{2 }= (a_{2}, a_{4}), α_{1 }= (a_{3}, a_{1}), α_{1 }= (a_{3}, a_{4}), where a_{1 }= (a_{1j}), and similarly for the β_{k }parameters with b_{l }= (b_{lj}). Each such 20dimensional parameter was augmented with a set of 180 "noninformative" parameters, constructed as 60 copies of the vector (100,1,50) for α_{k }and 60 copies of the vector (1,100,50) for β_{k}. The class probabilities were respectively 0.2,0.3,0.2, and 0.3. Although the pattern corresponding to this collection of parameters may be difficult to visualize from the description, Figure 3(B) shows a typical data set generated under these conditions, and reveals a small set of informative markers, some having distinctions in mean and others in variability. Similar analyses were conducted forth is simulation, except with J ∈ {5, 10, 25, 50}, and 4 classes assumed for hierarchical clustering.
Misclassification rate was assessed for all simulated data sets and analyses. Each estimated class was matched to true class by minimizing the distance between the J means calculated according to (1). When the number of estimated classes was greater than the true number, multiple estimated classes were assigned to a single matching true class, thus generating no misclassification error when the estimated class merely partitioned the true class. When the number of estimated classes was fewer than the true number, subjects within true classes that failed to match to an estimated class were considered misclassified. In the latter case, coarsening of the true classes would lead to the smaller absorbed class being judged as misclassified. In the Results section, we showed that HOPACH tends to overestimate the number of classes for the cases we considered, so our strategy, which favors inappropriate partitioning over inappropriate coarsening, is conservative with respect to comparison with HOPACH in this set of simulations.
We conducted additional simulations, for which we omit detailed results. We used normal distributions having the same mean and standard deviation implied by the betabased simulation described in the text. For assessing classification error, estimated classes were matched to their true classes by matching the fitted alpha and beta parameters to the corresponding parameterization of mean and standard deviation, i.e. inverting (1), in the manner described above. We conducted another simulation based on a different configuration: 200 items, each generated as a random normal with mean and standard deviations as described below, but subsequently transformed as n^{1 }{r_{i } 0.5}, where r_{i }is the rank of the ith value, to obtain a rankbased transformation of the data. The means were generated as 25 random standard normal variables (once at the start of the simulation study) for each of 4 classes, concatenated with 175 zeroes; and the standard deviations were generated as a Gamma(5, 500) (once at the start of the simulation study) for each of 4 classes, concatenated with 175 repeats of 0.01. Note that this configuration represents a substantial departure from the beta distribution, since the marginal distribution of the data are forced to have a uniform distribution. For assessing classification error, estimated classes were matched to their true classes by matching the fitted alpha and beta parameters to the corresponding parameterization of conditional mean and standard deviation, obtained via simulation.
Validation of normal tissue results by Random Forest
To validate results, we conducted a supervised Random Forest analysis [20] using the R package randomForest with 20,000 trees and a value of (the default) for the mtry parameter (number of randomly selected variables per bootstrap). We also tried values of 14 and 76 for mtry, obtaining slightly greater classification error. Both tissue type (categorical with 11 levels) and age (continuous) were used as response variables. Detailed discussion of these results will appear in a separate manuscript written for biologists. In addition, we conducted a Random Forest analysis on the classification indices for the 9 HOPACH classes and 23 mixture model classes described in the Results section, with mtry = 9 and mtry = 23, respectively, and 20,000 trees.
Availability
An implementation of the methods described in this paper is available [see Additional file 2]. The R environment is required for the software. Additional supporting functions are available from the authors upon request.
Additional file 2. RPMM Software Public Release 11. Compressed folder containing R code with examples on simulated data.
Format: ZIP Size: 114KB Download file
Authors' contributions
EAH developed statistical methodology, with critical intellectual input from RFY. BCC prepared laboratory samples and coordinated analysis with Illumina; BCC and CJM participated in testing software on methylation data obtained from real tissues. CJM, MRK, MW, HHN, JW, SZ, JKW, and KTK contributed tissue samples. All eleven authors participated in weekly conference calls and discussed the biological implications of the analytical methods.
Acknowledgements
Supported by: Mesothelioma Applied Research Foundation grant, NIH/NIEHS Training grant in environmental health sciences T32ES007155, NIH/NIEHS grant P42ES05947, and NCI grants CA48061, CA126939, CA078609, CA100679, CA121147, and FAMRI.
References

Russo V, Martienssen RA, Riggs AD: Epigenetic mechanisms of gene regulation. Cold Spring Harbor Laboratory Press; 1996.

Knudson AG: Chasing the cancer demon.
Annu Rev Genet 2000, 34:119. PubMed Abstract  Publisher Full Text

Jones PA, Baylin SB: The fundamental role of epigenetic events in cancer.
Nat Rev Genet 2002, 3:415428. PubMed Abstract  Publisher Full Text

Sakamoto H, Suzuki M, Abe T, Hosoyama T, Himeno E, Tanaka S, Greally JM, Hattori N, Yagi S, Shiota K: Cell typespecific methylation profiles occurring disproportionately in CpGless regions that delineate developmental similarity.
Genes Cells 2007, 12:11231132. PubMed Abstract  Publisher Full Text

Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, et al.: DNA methylation profiling of human chromosomes 6, 20 and 22.
Nat Genet 2006, 38:13781385. PubMed Abstract  Publisher Full Text

Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, HeineSuner D, Cigudosa JC, Urioste M, Benitez J, et al.: Epigenetic differences arise during the lifetime of monozygotic twins.
Proc Natl Acad Sci USA 2005, 102:1060410609. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Frigola J, Song J, Stirzaker C, Hinshelwood RA, Peinado MA, Clark SJ: Epigenetic remodeling in colorectal cancer results in coordinate gene suppression across an entire chromosome band.
Nat Genet 2006, 38:540549. PubMed Abstract  Publisher Full Text

Rakyan VK, Hildmann T, Novik KL, Lewin J, Tost J, Cox AV, Andrews TD, Howe KL, Otto T, Olek A, et al.: DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project.
PLoS Biol 2004, 2:e405. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schilling E, Rehli M: Global, comparative analysis of tissuespecific promoter CpG methylation.
Genomics 2007, 90:314323. PubMed Abstract  Publisher Full Text

Shann YJ, Cheng C, Chiao CH, Chen DT, Li PH, Hsu MT: GenomeWide Mapping and Characterization of Hypomethylated Sites in Human Tissues and Breast Cancer Cell Lines.
Genome Res 2008. PubMed Abstract  Publisher Full Text

Song F, Smith JF, Kimura MT, Morrow AD, Matsuyama T, Nagase H, Held WA: Association of tissuespecific differentially methylated regions (TDMs) with differential gene expression.
Proc Natl Acad Sci USA 2005, 102:33363341. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shen L, Kondo Y, Guo Y, Zhang J, Zhang L, Ahmed S, Shu J, Chen X, Waterland RA, Issa JPJ: Genomewide profiling of DNA methylation reveals a class of normally methylated CpG island promoters.
PLOS Genetics 2007, 3:e181. Publisher Full Text

Siegmund KD, Laird PW, LairdOffringa IA: A comparison of cluster analysis methods using DNA methylation data.
Bioinformatics 2004, 20:18961904. PubMed Abstract  Publisher Full Text

Stephens M: Dealing with label switching in mixture models.
Journal of the Royal Statistical Society Series B 2000, 62:795809. Publisher Full Text

Ji Y, Wu C, Liu P, Wang J, Coombes KR: Applications of betamixture models in bioinformatics.
Bioinformatics 2005, 21:21182122. PubMed Abstract  Publisher Full Text

Laan MJ, Pollard KS: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap.
Journal of Statistical Planning and Inference 2003, 117:275303. Publisher Full Text

Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.

Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001.

Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Boca Raton, Florida: Chapman & Hall; 1984.

Machine Learning 2001, 45:532. Publisher Full Text

Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a dataset via the gap statistic.
J Royal Statist Soc B 2001, 63:411423. Publisher Full Text

Houseman EA, Coull BA, Betensky RA: Featurespecific penalized latent class analysis for genomicdata.
Biometrics 2006, 62:10621070. PubMed Abstract  Publisher Full Text

Bair E, Tibshirani R: SemiSupervised Methods to Predict Patient Survival from Gene Expression Data.
PLoS Biol 2004, 2:15449173. Publisher Full Text

Lefkopoulou M, Moore D, Ryan L: The analysis of multiple correlated binary outcomes: application to rodent teratology experiments.
Journal of the American Statistical Association 1989, 84:810815. Publisher Full Text

Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm (with discussion).

Leroux BG, Puterman ML: MaximumPenalizedLikelihood Estimation for Independent and MarkovDependent Mixture Models.
Biometrics 1992, 48:545558. PubMed Abstract  Publisher Full Text

Fraley C, Raftery AE: Bayesian regularization for normal mixture estimation and modelbased clustering. Department of Statistics, University of Washington; 2005.

Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R.
Bioinformatics 2008, 24:719720. PubMed Abstract  Publisher Full Text