The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the “optimal” number of clusters in bio-molecular data. A major problem with stability-based methods is the detection of multi-level structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chi-square based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed.
To assess the statistical significance and to discover multi-level structures in bio-molecular data, a new method based on Bernstein's inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method.
The Bernstein test, due to its loose assumptions, is more sensitive than the chi-square test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein inequality-based test is also presented. The proposed methods can be applied to discover multiple structures and to assess their significance in different types of bio-molecular data.
Unsupervised cluster analysis of bio-molecular data is one of the main and well-established research lines in bioinformatics . Classes of co-expressed genes, classes of functionally related proteins, or subgroups of patients with malignancies differentiated at bio-molecular level can be discovered through clustering algorithms, and several other tasks related to the analysis of bio-molecular data require the development and application of unsupervised clustering techniques [2-4]. Anyway, in most bioinformatics problems, we need to assess the reliability of the discovered clusters, as well as the proper selection of the “natural” number of clusters underlying the data .
Recently, several methods based on the concept of stability have been proposed to estimate the “optimal” number of clusters [6,7]: multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures may be applied to randomly perturb the data, ranging from bootstrapping techniques , to noise injection into the data  or random projections into lower dimensional subspaces [10,11].
A major problem with stability-based methods is the detection of multi-level structures underlying the data (e.g. hierarchical subclasses of diseases, or hierarchical functional classes of genes). For instance, it is possible that data exhibit a hierarchical structure, with subclusters inside other clusters, and we need to detect these multi-level structures, possibly estimating their reliability and statistical significance. In , it is proposed a χ2-based statistical test of hypothesis to assess the significance of the “optimal” number of clusters and to discover multiple structures simultaneously present in bio-molecular data; however, by this approach, on one hand some assumptions about the distribution of the similarity measures are needed to estimate the reliability of the obtained clusterings, and on the other hand test results depend on the choice of user-defined parameters.
In this contribution we propose a distribution-free approach that does not assume any “a priori” distribution of the similarity measures, and that does not require any user-defined additional parameter. The proposed approach is based on the classical Bernstein inequality , and for its loose assumptions about the distribution of the data may in principle be applied to any unsupervised model order selection problem. More precisely the proposed stability-based method may be applied to several tasks related to the unsupervised analysis of complex bio-molecular data: (a) the assessment of the reliability of a given clustering solution; (b) the clustering model order selection, that is the discovery of the “natural” number of clusters in the data; (c) the assessment of the statistical significance of a given clustering solution; (d) the discovery of multiple structures underlying the data, i.e. the detection of multiple reliable clustering solutions at a given significance level.
In the following sections we summarize the characteristics of the stability-based procedures for the assessment of the reliability of clusterings, and we introduce our proposed method based on the Bernstein inequality.
Model order selection through stability based procedures
Let be C a clustering algorithm, ρ(D) a given random perturbation procedure applied to a data set D and sim a suitable similarity measure between two clusterings (e.g. the Jaccard similarity ). Among the random perturbations we recall random projections from a high dimensional to a low dimensional subspace , or bootstrap procedures to sample a random subset of data from the original data set D. Fixing an integer k (the number of clusters), we define Sk (0 ≤ Sk ≤ 1) as the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm C to data pairs D1 and D2 obtained by randomly and independently perturbing the original data D.
If Sk is concentrated close to 1, then the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. This idea, mutuated by a qualitative method proposed in , can be formalized using the integral g(k) of the cumulative distribution Fk of Sk:
If g(k) is close to 0 then the values of the random variable Sk are close to 1 and hence the k-clustering is stable, while for larger values of g(k) the k-clustering is less reliable. This observation comes from the following fact:
Let be fk(s) the probability density function of Sk; then
Hence, g(k) ≃ 0 implies Var[Sk] ≃ 0. As a consequence, g(k) or equivalently E[Sk] can be used as a good index of the reliability of the k-clusterings (clusterings with k clusters). E[Sk] may be estimated by the empirical mean ξk of n replicated similarity measures between pairs of perturbed clusterings:
where Skj represents the similarity between two k-clusterings obtained through the application of the algorithm C to a pair of perturbed data.
We may perform a sorting of the ξk:
where p is an index permutation such that ξp(1) ≥ ξp(2) ≥ … ≥ ξp(H). In this way we obtain an ordering of the clusterings, from the most to the least reliable one.
Exploiting this ordering, we proposed a χ2-based statistical test to detect and to estimate the statistical significance of multiple-structures discovered by clustering algorithms . The main drawbacks of this approach consists in an implicit normality assumption for the distribution of the Sk (random variables that measure the similarity between two perturbed k-clusterings, see above), and in a user defined threshold parameter that determines when two k-clusterings can be considered similar and “stable”. Indeed, in general we have no guarantee that the Sk random variables are normally distributed; moreover the “optimal” choice of the threshold parameter seems to be application dependent and may affect the overall test results.
In this contribution, to address these problems we propose a new statistical method that, adopting a stability-based approach, makes no assumptions about the distribution of the random variables and does not require any user-defined threshold parameter.
Hypothesis testing based on Bernstein inequality
We briefly recall the Bernstein inequality, because this inequality is used to build-up our proposed hypothesis testing procedure.
Bernstein inequality. If Y1, Y2, …, Yn are independent random variables s.t. 0 ≤ Yi ≤ 1, with then
Using the Bernstein inequality, we would estimate if for a given r, 2 ≤ r ≤ H, there exists a statistically significant difference between the reliability of the best p(1) clustering and the p(r) clustering (eq. 3). In other words we may state the null hypothesis H0 and the alternative hypothesis in the following way:
H0: p(1) clustering is not more reliable than p(r) clustering, that is E[Sp(1)] ≤ E[Sp(r)]
Ha: p(1) clustering is more reliable than p(r) clustering, that is E[Sp(1)] > E[Sp(r)]
To this end, consider the following random variables:
We start considering the first and last ranked clustering p(1) and p(H). In this case the above null hypothesis H0 becomes: E[Sp(1)] ≤ E[Sp(H)], or equivalently E[Sp(1)] − E[Sp(H)] = E[PH] ≤ 0. The distribution of the random variable XH (eq. 5) is in general unknown; anyway note that in the Bernstein inequality no assumption is made about the distribution of the random variables Yi (eq. 4). Hence, fixing a parameter Δ ≥ 0, considering true the null hypothesis E[PH] ≤ 0, and using Bernstein inequality, we have:
Considering an instance (a measured value) of the random variable XH, if we let we obtain the following probability of type I error:
If we reject the null hypothesis: a significant difference between the two clusterings is detected at α significance level and we continue by testing the p(H − 1) clustering. More in general if the null hypothesis has been rejected for the p(H − r + 1) clustering, 1 ≤ r ≤ H − 2 then we consider the p(H − r) clustering, and using the Boole inequality, we can estimate the type I error:
As in the previous case, if Perr(H − r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and p(H − r) clustering and we iteratively continue the procedure estimating Perr(H − r − 1).
This procedure stops if either of these cases succeeds:
I) The null hypothesis is rejected till r = H − 2, that is ∀r, 1 ≤ r ≤ H − 2, Perr(H − r) < α: all the possible null hypotheses have been rejected and the only reliable clustering at α-significance level is the top ranked one, that is the p(1) clustering.
II) The null hypothesis cannot be rejected for r ≤ H − 2, that is, ∃r, 1 ≤ r ≤ H − 2, Perr(H − r) ≥ α: in this case the clusterings that are significantly less reliable than the top ranked p(1) clustering are the p(r + 1), p(r + 2),…, p(H) clusterings.
Note that in this second case we cannot state that there is no significant difference between the first r top-ranked clusterings, since the upper bound provided by the Bernstein inequality is not guaranteed to be tight. To answer to this question, we may apply the χ2-based hypothesis testing proposed in  to the remaining top ranked clusterings to establish which of them are significant at α level, but in this case we need to assume that the similarity measures between pairs of clusterings are distributed according to a normal distribution.
If we assume that the Xi random variables (eq. 5) are (at least approximately) independent, we can obtain a variant of the previous Bernstein inequality-based approach, that we name Bernstein ind. for brevity. By this approach we should in principle obtain lower p values, thus assuring lower false positive rates than the Bernstein test without independence assumptions.
With these independence assumptions the null hypothesis H0 and the alternative hypothesis for the Bernstein ind. test can be formulated as follows:
H0: ∃i, 2 ≤ i ≤ r ≤ H such that E[Sp(1)] ≤ E[Sp(r)]: it does exist at least one p(i)-clustering equally or more reliable than the first one in the group of the first r ordered clusterings.
Ha: ∀i, 2 ≤ i ≤ r ≤ H, E[Sp(1)] > E[Sp(r)]: all the clusterings in the group of the first r ordered clusterings are less reliable than the first one.
If we assume that the null hypothesis is true, using the independence among the Xi random variables, we may obtain the type I error:
Starting from r = H, if Perr(r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and the other first r-clustering and we iteratively continue the procedure estimating Perr(r − 1). As in the Bernstein test, the procedure is iterated until we remain with a single clustering (and this will be the only significant one), or until Perr(r) ≥ α and in this case we cannot reject the null hypothesis and the first r clusterings can be considered equally reliable. Note that, strictly speaking, in this case we can only say that at least one of the first r clusterings is equally or more reliable than the first one.
Results and discussion
In this section we apply the Bernstein test to synthetic and DNA microarray data analysis, and compare it to the previously proposed χ2-based test . For the experiments we used the mosclust R package , and all the data used in the experiments are available from the authors.
Analysis of hierarchical structures in synthetic data
In order to show the effectiveness of the proposed approach we consider synthetic data with a priori known multi-level hierarchical structure. To this end we generated a two-dimensional synthetic data set with a three-level hierarchical structure (Fig. 1) using the clusterv R package : at a first level three large clusters are present in the data (black ovals); at a second level we have a 6-clustering (red ovals) and finally a third-level 12-clustering may be detected (blue ovals).
Figure 1. Synthetic data set: a three-level hierarchical structure with 3, 6 and 12 clusters.
We performed analysis of the stability of the clusters (see Section Model order selection through stability based procedures), by applying random subsample techniques to perturb the data (100 subsample pairs each composed by 80 % of the data randomly drawn without replacement) and Ward's hierarchical clustering algorithm  with dendrogram cuts from k = 2 to k = 15 clusters. Then we computed the similarity indices for each k from 2 to 15: their empirical cumulative distribution is shown in Fig. 2. From Fig. 2 we may observe that 3 and 6-clusterings similarities are closely concentrated near 1, thus showing that these clusterings are clearly detectable by the hierarchical clustering algorithm. Indeed both χ2-based and Bernstein-based test of hypothesis select these clusterings at 10−5 significance level. Nevertheless, the Bernstein test selects also the 7-clustering (false positive) and the 12-clustering (true positive) (Table 1). As a second experiment we considered a 1000-dimensional synthetic multivariate gaussian distributed data set. These data are characterized by a two-level hierarchical structure: at a first level we have two main clusters with inside each one three other clusters, thus resulting in a second level 6-clustering. Each of the six second-level clusters is distributed according to a hyperspherical gaussian distribution and each cluster contains only 20 examples, thus resulting in a sparse data set (low number of examples in a high dimensional space). We applied the Prediction Around Medoids clustering algorithm , and we perturbed the data through Bernoulli random projections , from a 1000 to a 479-dimensional subspace, considering the reliability of clusterings composed from 2 to 10 clusters. In this case both the χ2-based and the Bernstein based iterative procedures correctly detect 2 and 6-clusterings at 10 −4 significance level. With these high dimensional data the Bernstein test is not subject to false positives, but also the χ2 test correctly detects all the structures underlying the data.
Table 1. Synthetic data: comparison of the χ2 and Bernstein inequality-based tests. Bernstein ind. stands for the Bernstein test with assumption of independence between the random variables representing the empirical mean of the similarity measures.
Figure 2. Synthetic data set: empirical cumulative distribution functions of the similarity measures for different number of clusters (Hierarchical clustering).
Discovery of multi-level structures in DNA microarray data
As an example of the application of the Bernstein test to the discovery of multiple structures in bio-molecular data, we consider two classical DNA microarray data sets: Leukemia and Lymphoma. The Leukemia data set is composed by a group of 25 acute myeloid leukemia (AML) samples and another group of 47 acute lymphoblastic leukemia (ALL) samples, that can be subdivided into 38 B-Cell and 9 T-Cell subgroups, resulting in a two-level hierarchical structure.
We applied both resampling and random projections to lower dimensional subspaces to perturb the original data using the R package mosclust  that implements the Bernstein-based test and the stability measures described in Sect. Model order selection through stability based procedures.
Fig. 3 shows the empirical cumulative distributions of the similarity values and Table 2 the p values of the clusterings sorted according to their ξ values (eq. 2), using Bernoulli random projections . Our proposed procedure detects the 2 – clustering as the most reliable at 0.01 significance level, according to the fact that two biologically meaningful groups (ALL, acute lymphoblastic leukemia and AML, acute myeloid leukemia) are present in the data. Choosing a significance level α = 10−5 we cannot reject the null hypothesis that a 2-clustering is less or equally reliable than a 3-clustering: in this case 2 structures (2 and 3-clusterings) are detected as simultaneously present in the data, reflecting the biological fact that ALL can be subdivided into two subclasses (B-cell and T-cell ALL).
Table 2. Leukemia data set: empirical means (ξ) and p values computed according to the Bernstein inequality.
Figure 3. Leukemia data set. Empirical cumulative distributions of the similarity measures for different numbers of clusters k.
The results obtained with two variants of the χ2 and Bernstein based statistical tests are compared in Fig. 4 (k-means algorithm) and Fig. 5 (PAM, Prediction Around Medoids algorithm ) : log p values are represented in ordinate, while in abscissa the number of clusters are sorted according to the empirical mean of the corresponding pairwise similarities (eq. 2). In both figures a straight horizontal dashed line represents a significance level α = 0.001: k-clusterings above the dashed line are significant, that is their reliability significantly differ from the k-clusterings below the dashed horizontal line. Note that the k-means (Fig. 4) and PAM (Fig. 5) clustering algorithms provide a slightly different ranking of the similarity indices, but in most cases 2 and 3 clusterings are considered significantly more reliable than the others, according to the biological characteristics of the data. The Bernstein test, due to its more general assumptions (no particular distributions and no independence are assumed for the random variables that represent the similarity between clusterings) is less selective (in the sense that it may consider reliable a larger number of k-clusterings) than the χ2-based test that make assumptions about the distribution of the random variables. This is confirmed by the fact that Bernstein p values decrease more slowly with respect to the χ2 test (Fig. 4 and 5), thus resulting in a better sensitivity to multiple structures present in the data. The main drawback of this behaviour is the larger probability of false positives. Note that the Bernstein ind. test shows an intermediate trend between the Bernstein and χ2 test (red lines in Fig. 4 and 5): the assumption of independence between the random variables yields a more selective Bernstein inequality-based test less subject to false positives, but potentially less sensitive to multiple structures underlying the data.
Figure 4. K-means clustering: log p value computed for χ2-based and Bernstein-based statistical tests. Ordinate: log p value; abscissa: number of clusters sorted according the computed similarity means.
Figure 5. PAM clustering: log p value computed for χ2-based and Bernstein-based statistical tests. Ordinate: log p value; abscissa: number of clusters sorted according the computed similarity means.
The Lymphoma gene expression data set  comprises three different lymphoid malignancies: Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL) and Chronic Lymphocytic Leukemia (CLL). The data provides expression levels for 4026 genes , with 62 samples subdivided in 42 DLBCL, 11 CLL and 9 FL. We performed pre-processing of the data according to , replacing missing values with 0 and then normalizing the data to zero mean and unit variance across genes. We considered both resampling techniques and random projections to perturb the data. In particular, data have been resampled by randomly drawing 80% of the available data without replacement, and data have been projected using Bernoulli projections with ε = 0.2 corresponding to 413-dimensional subspaces. Fig. 6 and 7 show the empirical cumulative distribution of the similarity measures for different numbers of clusters, using the hierarchical Ward's clustering algorithm and respectively Bernoulli random projections (Fig. 6) and subsampling perturbation techniques (Fig. 7). Considering random projections both the Bernstein and χ2-based tests correctly select 2 and 3-clusterings at 0.001 significance level. On the contrary, using subsampling techniques only the Bernstein inequality based test select as significant both 2 and 3-clusterings, while the χ2 tests select only the 2-clustering (Table 3). These results confirm that the Bernstein test is more sensitive to multiple structures underlying the data.
Table 3. Lymphoma data: comparison of the χ2 and Bernstein inequality-based tests. t represents the threshold level for the χ2-based test.
Figure 6. Lymphoma data set. Empirical cumulative distribution functions of the similarity measures for different number of clusters k. Perturbation technique: random projections
Figure 7. Lymphoma data set. Empirical cumulative distribution functions of the similarity measures for different number of clusters k. Perturbation technique: resampling
Considering the Leukemia and Lymphoma data sets, the proposed Bernstein test achieves results competitive with state-of-the-art stability methods proposed in the literature. Indeed the Model Explorer algorithm, based on subsampling techniques, correctly detect only the 2-clustering structure both in Leukemia and Lymphoma. Another subsampling-based method (Figure of Merit) detects 2, 8 and 19-clusterings in Leukemia and 2 and 9-clusterings in Lymphoma. Stability methods that apply supervised algorithms to assess the quality of the discovered clusterings correctly detect only a 3-clustering in Leukemia and a 2-clustering in Lymphoma[6,24]. Our previously proposed χ2-based test correctly detects both 2 and 3-clusterings in both data sets, if random projections are used as perturbation method, but it fails to detect the 3-clustering in Lymphoma when subsampling techniques are applied. On the contrary, the Bernstein test discovers both the two-level structures in Leukemia and Lymphoma, independently of the applied perturbation method.
The experimental results with both synthetic and gene expression data support the hypothesis that the Bernstein test is more sensitive to multiple structures underlying the data. Indeed in the first experiment with synthetic data it correctly predicts also the third level of structure, that is the 12-clustering; on the other hand it is subject to false positives, as shown by the wrong discovery of a 7-clustering (Table 1). These results are confirmed by the fact that Bernstein p values decrease more slowly with respect to the χ2 test (Fig. 4 and 5): in this way for a given significance level it is likely that the Bernstein test selects larger sets of structures underlying the data. The risk of an increased rate of false positives may be balanced by the assumption of independence between the random variables, yielding to the proposed Bernstein ind. test (eq. 8), less subject to false positives, but potentially less sensitive to multiple structures underlying the data.
In real applications to complex bio-molecular data, we suggest to apply both Bernstein-based and χ2-based procedures: structures discovered by both tests are likely to be significant, and Bernstein-based tests can discover potential structures not detectable with the more selective χ2-based test. Moreover the computational burden due to the application of the χ2 and Bernstein-based iterative procedures is irrelevant with respect to the execution of clustering algorithms.
We proposed a test of hypothesis based on Bernstein inequality to estimate if there is a significant difference between the reliability of different clusterings performed on the same data. Our proposed method can be applied to discover multiple or hierarchical structures, using different clustering algorithms and different perturbation methods. Even if in our experiments we applied the Bernstein test to the analysis of gene expression data, this approach may be in principle applied to discover multiple structures in any type of complex bio-molecular data. Indeed no user-defined parameters are required, and very loose assumptions are made about the distribution of the data and the distribution of the similarity values used to estimate the stability of the discovered clusterings, thus assuring a reliable application of the method to a large range of bioinformatics problems.
Our experiments with synthetic and gene expression data show that Bernstein-based tests are more sensitive than χ2-based tests to multiple structures embedded in the data: in this way not self-evident structures may be detected too, as well as subtle relationships between the data. A drawback of the Bernstein test is its larger expected rate of false positives, but assuming independence between the empirical means of the similarity values a new test (Bernstein ind.), less subject to false positives, has been proposed.
Developments of this work could consist in the adaptation and application of the proposed methods to large scale bioinformatics problems, to discover multiple structures underlying the data when a very large number of clusters is potentially involved.
The authors declare that they have no competing interests.
The authors equally contributed to this paper.
We would like to thank the anonymous reviewers for their comments and suggestions. This work has been developed in the context of CIMAINA Center of Excellence, and it has been funded by the Italian COFIN project Linguaggi formali ed automi: metodi, modelli ed applicazioni.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 2, 2008: Italian Society of Bioinformatics (BITS): Annual Meeting 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S2
Nature Genetics 2003, 33:90-96.
janPubMed Abstract | Publisher Full Text
Machine Learning 2003, 52:91-118. Publisher Full Text
J. Amer. Statist. Assoc. 1963, 58:13-30. Publisher Full Text
ACM Computing Surveys 1999, 31(3):264-323. Publisher Full Text
Achlioptas D: Database-friendly random projections. In Proc. ACM Symp. on the Principles of Database Systems, Contemporary Mathematics. Edited by Edited by Buneman P. New York, NY, USA: ACM Press; 2001:274-281.
Ben-Hur A, Ellisseeff A, Guyon I: A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing. Volume 7. Edited by Edited by Altman R, Dunker A, Hunter L, Klein T, Lauderdale K, Lihue, Hawaii, USA. World Scientific; 2002::6-17.
J. Am. Stat. Assoc. 1963, 58:236-244. Publisher Full Text
Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson J, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.