Abstract
Background
The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the “optimal” number of clusters in biomolecular data. A major problem with stabilitybased methods is the detection of multilevel structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chisquare based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed.
Results
To assess the statistical significance and to discover multilevel structures in biomolecular data, a new method based on Bernstein's inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method.
Conclusions
The Bernstein test, due to its loose assumptions, is more sensitive than the chisquare test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein inequalitybased test is also presented. The proposed methods can be applied to discover multiple structures and to assess their significance in different types of biomolecular data.
Background
Unsupervised cluster analysis of biomolecular data is one of the main and wellestablished research lines in bioinformatics [1]. Classes of coexpressed genes, classes of functionally related proteins, or subgroups of patients with malignancies differentiated at biomolecular level can be discovered through clustering algorithms, and several other tasks related to the analysis of biomolecular data require the development and application of unsupervised clustering techniques [24]. Anyway, in most bioinformatics problems, we need to assess the reliability of the discovered clusters, as well as the proper selection of the “natural” number of clusters underlying the data [5].
Recently, several methods based on the concept of stability have been proposed to estimate the “optimal” number of clusters [6,7]: multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures may be applied to randomly perturb the data, ranging from bootstrapping techniques [8], to noise injection into the data [9] or random projections into lower dimensional subspaces [10,11].
A major problem with stabilitybased methods is the detection of multilevel structures underlying the data (e.g. hierarchical subclasses of diseases, or hierarchical functional classes of genes). For instance, it is possible that data exhibit a hierarchical structure, with subclusters inside other clusters, and we need to detect these multilevel structures, possibly estimating their reliability and statistical significance. In [7], it is proposed a χ^{2}based statistical test of hypothesis to assess the significance of the “optimal” number of clusters and to discover multiple structures simultaneously present in biomolecular data; however, by this approach, on one hand some assumptions about the distribution of the similarity measures are needed to estimate the reliability of the obtained clusterings, and on the other hand test results depend on the choice of userdefined parameters.
In this contribution we propose a distributionfree approach that does not assume any “a priori” distribution of the similarity measures, and that does not require any userdefined additional parameter. The proposed approach is based on the classical Bernstein inequality [12], and for its loose assumptions about the distribution of the data may in principle be applied to any unsupervised model order selection problem. More precisely the proposed stabilitybased method may be applied to several tasks related to the unsupervised analysis of complex biomolecular data: (a) the assessment of the reliability of a given clustering solution; (b) the clustering model order selection, that is the discovery of the “natural” number of clusters in the data; (c) the assessment of the statistical significance of a given clustering solution; (d) the discovery of multiple structures underlying the data, i.e. the detection of multiple reliable clustering solutions at a given significance level.
Methods
In the following sections we summarize the characteristics of the stabilitybased procedures for the assessment of the reliability of clusterings, and we introduce our proposed method based on the Bernstein inequality.
Model order selection through stability based procedures
Let be C a clustering algorithm, ρ(D) a given random perturbation procedure applied to a data set D and sim a suitable similarity measure between two clusterings (e.g. the Jaccard similarity [13]). Among the random perturbations we recall random projections from a high dimensional to a low dimensional subspace [14], or bootstrap procedures to sample a random subset of data from the original data set D[8]. Fixing an integer k (the number of clusters), we define S_{k} (0 ≤ S_{k} ≤ 1) as the random variable given by the similarity between two kclusterings obtained by applying a clustering algorithm C to data pairs D_{1} and D_{2} obtained by randomly and independently perturbing the original data D.
If S_{k} is concentrated close to 1, then the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. This idea, mutuated by a qualitative method proposed in [15], can be formalized using the integral g(k) of the cumulative distribution F_{k} of S_{k}[7]:
If g(k) is close to 0 then the values of the random variable S_{k} are close to 1 and hence the kclustering is stable, while for larger values of g(k) the kclustering is less reliable. This observation comes from the following fact:
Proof:
Let be f_{k}(s) the probability density function of S_{k}; then
Moreover:
☐.
Hence, g(k) ≃ 0 implies Var[S_{k}] ≃ 0. As a consequence, g(k) or equivalently E[S_{k}] can be used as a good index of the reliability of the kclusterings (clusterings with k clusters). E[S_{k}] may be estimated by the empirical mean ξ_{k} of n replicated similarity measures between pairs of perturbed clusterings:
where S_{kj} represents the similarity between two kclusterings obtained through the application of the algorithm C to a pair of perturbed data.
We may perform a sorting of the ξ_{k}:
where p is an index permutation such that ξ_{p(1)} ≥ ξ_{p(2)} ≥ … ≥ ξ_{p(H)}. In this way we obtain an ordering of the clusterings, from the most to the least reliable one.
Exploiting this ordering, we proposed a χ^{2}based statistical test to detect and to estimate the statistical significance of multiplestructures discovered by clustering algorithms [7]. The main drawbacks of this approach consists in an implicit normality assumption for the distribution of the S_{k} (random variables that measure the similarity between two perturbed kclusterings, see above), and in a user defined threshold parameter that determines when two kclusterings can be considered similar and “stable”. Indeed, in general we have no guarantee that the S_{k} random variables are normally distributed; moreover the “optimal” choice of the threshold parameter seems to be application dependent and may affect the overall test results.
In this contribution, to address these problems we propose a new statistical method that, adopting a stabilitybased approach, makes no assumptions about the distribution of the random variables and does not require any userdefined threshold parameter.
Hypothesis testing based on Bernstein inequality
We briefly recall the Bernstein inequality, because this inequality is used to buildup our proposed hypothesis testing procedure.
Bernstein inequality. If Y_{1}, Y_{2}, …, Y_{n} are independent random variables s.t. 0 ≤ Y_{i} ≤ 1, with then
Using the Bernstein inequality, we would estimate if for a given r, 2 ≤ r ≤ H, there exists a statistically significant difference between the reliability of the best p(1) clustering and the p(r) clustering (eq. 3). In other words we may state the null hypothesis H_{0} and the alternative hypothesis in the following way:
H_{0}: p(1) clustering is not more reliable than p(r) clustering, that is E[S_{p(1)}] ≤ E[S_{p(r)}]
H_{a}: p(1) clustering is more reliable than p(r) clustering, that is E[S_{p(1)}] > E[S_{p(r)}]
To this end, consider the following random variables:
We start considering the first and last ranked clustering p(1) and p(H). In this case the above null hypothesis H_{0} becomes: E[S_{p(1)}] ≤ E[S_{p(H)}], or equivalently E[S_{p(1)}] − E[S_{p(H)}] = E[P_{H}] ≤ 0. The distribution of the random variable X_{H} (eq. 5) is in general unknown; anyway note that in the Bernstein inequality no assumption is made about the distribution of the random variables Y_{i} (eq. 4). Hence, fixing a parameter Δ ≥ 0, considering true the null hypothesis E[P_{H}] ≤ 0, and using Bernstein inequality, we have:
Considering an instance (a measured value) of the random variable X_{H}, if we let we obtain the following probability of type I error:
If we reject the null hypothesis: a significant difference between the two clusterings is detected at α significance level and we continue by testing the p(H − 1) clustering. More in general if the null hypothesis has been rejected for the p(H − r + 1) clustering, 1 ≤ r ≤ H − 2 then we consider the p(H − r) clustering, and using the Boole inequality, we can estimate the type I error:
As in the previous case, if P_{err}(H − r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and p(H − r) clustering and we iteratively continue the procedure estimating P_{err}(H − r − 1).
This procedure stops if either of these cases succeeds:
I) The null hypothesis is rejected till r = H − 2, that is ∀r, 1 ≤ r ≤ H − 2, P_{err}(H − r) < α: all the possible null hypotheses have been rejected and the only reliable clustering at αsignificance level is the top ranked one, that is the p(1) clustering.
II) The null hypothesis cannot be rejected for r ≤ H − 2, that is, ∃r, 1 ≤ r ≤ H − 2, P_{err}(H − r) ≥ α: in this case the clusterings that are significantly less reliable than the top ranked p(1) clustering are the p(r + 1), p(r + 2),…, p(H) clusterings.
Note that in this second case we cannot state that there is no significant difference between the first r topranked clusterings, since the upper bound provided by the Bernstein inequality is not guaranteed to be tight. To answer to this question, we may apply the χ^{2}based hypothesis testing proposed in [7] to the remaining top ranked clusterings to establish which of them are significant at α level, but in this case we need to assume that the similarity measures between pairs of clusterings are distributed according to a normal distribution.
If we assume that the X_{i} random variables (eq. 5) are (at least approximately) independent, we can obtain a variant of the previous Bernstein inequalitybased approach, that we name Bernstein ind. for brevity. By this approach we should in principle obtain lower p values, thus assuring lower false positive rates than the Bernstein test without independence assumptions.
With these independence assumptions the null hypothesis H_{0} and the alternative hypothesis for the Bernstein ind. test can be formulated as follows:
H_{0}: ∃i, 2 ≤ i ≤ r ≤ H such that E[S_{p(1)}] ≤ E[S_{p(r)}]: it does exist at least one p(i)clustering equally or more reliable than the first one in the group of the first r ordered clusterings.
H_{a}: ∀i, 2 ≤ i ≤ r ≤ H, E[S_{p(1)}] > E[S_{p(r)}]: all the clusterings in the group of the first r ordered clusterings are less reliable than the first one.
If we assume that the null hypothesis is true, using the independence among the X_{i} random variables, we may obtain the type I error:
Starting from r = H, if P_{err}(r) < α we reject the null hypothesis: a significant difference is detected between the reliability of the p(1) and the other first rclustering and we iteratively continue the procedure estimating P_{err}(r − 1). As in the Bernstein test, the procedure is iterated until we remain with a single clustering (and this will be the only significant one), or until P_{err}(r) ≥ α and in this case we cannot reject the null hypothesis and the first r clusterings can be considered equally reliable. Note that, strictly speaking, in this case we can only say that at least one of the first r clusterings is equally or more reliable than the first one.
Results and discussion
In this section we apply the Bernstein test to synthetic and DNA microarray data analysis, and compare it to the previously proposed χ^{2}based test [7]. For the experiments we used the mosclust R package [16], and all the data used in the experiments are available from the authors.
Analysis of hierarchical structures in synthetic data
In order to show the effectiveness of the proposed approach we consider synthetic data with a priori known multilevel hierarchical structure. To this end we generated a twodimensional synthetic data set with a threelevel hierarchical structure (Fig. 1) using the clusterv R package [17]: at a first level three large clusters are present in the data (black ovals); at a second level we have a 6clustering (red ovals) and finally a thirdlevel 12clustering may be detected (blue ovals).
Figure 1. Synthetic data set: a threelevel hierarchical structure with 3, 6 and 12 clusters.
We performed analysis of the stability of the clusters (see Section Model order selection through stability based procedures), by applying random subsample techniques to perturb the data (100 subsample pairs each composed by 80 % of the data randomly drawn without replacement) and Ward's hierarchical clustering algorithm [18] with dendrogram cuts from k = 2 to k = 15 clusters. Then we computed the similarity indices for each k from 2 to 15: their empirical cumulative distribution is shown in Fig. 2. From Fig. 2 we may observe that 3 and 6clusterings similarities are closely concentrated near 1, thus showing that these clusterings are clearly detectable by the hierarchical clustering algorithm. Indeed both χ^{2}based and Bernsteinbased test of hypothesis select these clusterings at 10^{−5} significance level. Nevertheless, the Bernstein test selects also the 7clustering (false positive) and the 12clustering (true positive) (Table 1). As a second experiment we considered a 1000dimensional synthetic multivariate gaussian distributed data set. These data are characterized by a twolevel hierarchical structure: at a first level we have two main clusters with inside each one three other clusters, thus resulting in a second level 6clustering. Each of the six secondlevel clusters is distributed according to a hyperspherical gaussian distribution and each cluster contains only 20 examples, thus resulting in a sparse data set (low number of examples in a high dimensional space). We applied the Prediction Around Medoids clustering algorithm [19], and we perturbed the data through Bernoulli random projections [7], from a 1000 to a 479dimensional subspace, considering the reliability of clusterings composed from 2 to 10 clusters. In this case both the χ^{2}based and the Bernstein based iterative procedures correctly detect 2 and 6clusterings at 10 −4 significance level. With these high dimensional data the Bernstein test is not subject to false positives, but also the χ^{2} test correctly detects all the structures underlying the data.
Table 1. Synthetic data: comparison of the χ^{2} and Bernstein inequalitybased tests. Bernstein ind. stands for the Bernstein test with assumption of independence between the random variables representing the empirical mean of the similarity measures.
Figure 2. Synthetic data set: empirical cumulative distribution functions of the similarity measures for different number of clusters (Hierarchical clustering).
Discovery of multilevel structures in DNA microarray data
As an example of the application of the Bernstein test to the discovery of multiple structures in biomolecular data, we consider two classical DNA microarray data sets: Leukemia[20] and Lymphoma[21]. The Leukemia data set is composed by a group of 25 acute myeloid leukemia (AML) samples and another group of 47 acute lymphoblastic leukemia (ALL) samples, that can be subdivided into 38 BCell and 9 TCell subgroups, resulting in a twolevel hierarchical structure.
We applied both resampling and random projections to lower dimensional subspaces to perturb the original data using the R package mosclust [16] that implements the Bernsteinbased test and the stability measures described in Sect. Model order selection through stability based procedures.
Fig. 3 shows the empirical cumulative distributions of the similarity values and Table 2 the p values of the clusterings sorted according to their ξ values (eq. 2), using Bernoulli random projections [7]. Our proposed procedure detects the 2 – clustering as the most reliable at 0.01 significance level, according to the fact that two biologically meaningful groups (ALL, acute lymphoblastic leukemia and AML, acute myeloid leukemia) are present in the data. Choosing a significance level α = 10^{−5} we cannot reject the null hypothesis that a 2clustering is less or equally reliable than a 3clustering: in this case 2 structures (2 and 3clusterings) are detected as simultaneously present in the data, reflecting the biological fact that ALL can be subdivided into two subclasses (Bcell and Tcell ALL).
Table 2. Leukemia data set: empirical means (ξ) and p values computed according to the Bernstein inequality.
Figure 3. Leukemia data set. Empirical cumulative distributions of the similarity measures for different numbers of clusters k.
The results obtained with two variants of the χ^{2}[7] and Bernstein based statistical tests are compared in Fig. 4 (kmeans algorithm) and Fig. 5 (PAM, Prediction Around Medoids algorithm [19]) : log p values are represented in ordinate, while in abscissa the number of clusters are sorted according to the empirical mean of the corresponding pairwise similarities (eq. 2). In both figures a straight horizontal dashed line represents a significance level α = 0.001: kclusterings above the dashed line are significant, that is their reliability significantly differ from the kclusterings below the dashed horizontal line. Note that the kmeans (Fig. 4) and PAM (Fig. 5) clustering algorithms provide a slightly different ranking of the similarity indices, but in most cases 2 and 3 clusterings are considered significantly more reliable than the others, according to the biological characteristics of the data. The Bernstein test, due to its more general assumptions (no particular distributions and no independence are assumed for the random variables that represent the similarity between clusterings) is less selective (in the sense that it may consider reliable a larger number of kclusterings) than the χ^{2}based test that make assumptions about the distribution of the random variables. This is confirmed by the fact that Bernstein p values decrease more slowly with respect to the χ^{2} test (Fig. 4 and 5), thus resulting in a better sensitivity to multiple structures present in the data. The main drawback of this behaviour is the larger probability of false positives. Note that the Bernstein ind. test shows an intermediate trend between the Bernstein and χ^{2} test (red lines in Fig. 4 and 5): the assumption of independence between the random variables yields a more selective Bernstein inequalitybased test less subject to false positives, but potentially less sensitive to multiple structures underlying the data.
Figure 4. Kmeans clustering: log p value computed for χ^{2}based and Bernsteinbased statistical tests. Ordinate: log p value; abscissa: number of clusters sorted according the computed similarity means.
Figure 5. PAM clustering: log p value computed for χ^{2}based and Bernsteinbased statistical tests. Ordinate: log p value; abscissa: number of clusters sorted according the computed similarity means.
The Lymphoma gene expression data set [21] comprises three different lymphoid malignancies: Diffuse Large BCell Lymphoma (DLBCL), Follicular Lymphoma (FL) and Chronic Lymphocytic Leukemia (CLL). The data provides expression levels for 4026 genes [22], with 62 samples subdivided in 42 DLBCL, 11 CLL and 9 FL. We performed preprocessing of the data according to [21], replacing missing values with 0 and then normalizing the data to zero mean and unit variance across genes. We considered both resampling techniques and random projections to perturb the data. In particular, data have been resampled by randomly drawing 80% of the available data without replacement, and data have been projected using Bernoulli projections with ε = 0.2 corresponding to 413dimensional subspaces. Fig. 6 and 7 show the empirical cumulative distribution of the similarity measures for different numbers of clusters, using the hierarchical Ward's clustering algorithm and respectively Bernoulli random projections (Fig. 6) and subsampling perturbation techniques (Fig. 7). Considering random projections both the Bernstein and χ^{2}based tests correctly select 2 and 3clusterings at 0.001 significance level. On the contrary, using subsampling techniques only the Bernstein inequality based test select as significant both 2 and 3clusterings, while the χ^{2} tests select only the 2clustering (Table 3). These results confirm that the Bernstein test is more sensitive to multiple structures underlying the data.
Table 3. Lymphoma data: comparison of the χ^{2} and Bernstein inequalitybased tests. t represents the threshold level for the χ^{2}based test.
Figure 6. Lymphoma data set. Empirical cumulative distribution functions of the similarity measures for different number of clusters k. Perturbation technique: random projections
Figure 7. Lymphoma data set. Empirical cumulative distribution functions of the similarity measures for different number of clusters k. Perturbation technique: resampling
Considering the Leukemia and Lymphoma data sets, the proposed Bernstein test achieves results competitive with stateoftheart stability methods proposed in the literature. Indeed the Model Explorer algorithm, based on subsampling techniques, correctly detect only the 2clustering structure both in Leukemia and Lymphoma[15]. Another subsamplingbased method (Figure of Merit[23]) detects 2, 8 and 19clusterings in Leukemia and 2 and 9clusterings in Lymphoma. Stability methods that apply supervised algorithms to assess the quality of the discovered clusterings correctly detect only a 3clustering in Leukemia and a 2clustering in Lymphoma[6,24]. Our previously proposed χ^{2}based test correctly detects both 2 and 3clusterings in both data sets, if random projections are used as perturbation method, but it fails to detect the 3clustering in Lymphoma when subsampling techniques are applied. On the contrary, the Bernstein test discovers both the twolevel structures in Leukemia and Lymphoma, independently of the applied perturbation method.
The experimental results with both synthetic and gene expression data support the hypothesis that the Bernstein test is more sensitive to multiple structures underlying the data. Indeed in the first experiment with synthetic data it correctly predicts also the third level of structure, that is the 12clustering; on the other hand it is subject to false positives, as shown by the wrong discovery of a 7clustering (Table 1). These results are confirmed by the fact that Bernstein p values decrease more slowly with respect to the χ^{2} test (Fig. 4 and 5): in this way for a given significance level it is likely that the Bernstein test selects larger sets of structures underlying the data. The risk of an increased rate of false positives may be balanced by the assumption of independence between the random variables, yielding to the proposed Bernstein ind. test (eq. 8), less subject to false positives, but potentially less sensitive to multiple structures underlying the data.
In real applications to complex biomolecular data, we suggest to apply both Bernsteinbased and χ^{2}based procedures: structures discovered by both tests are likely to be significant, and Bernsteinbased tests can discover potential structures not detectable with the more selective χ^{2}based test. Moreover the computational burden due to the application of the χ^{2} and Bernsteinbased iterative procedures is irrelevant with respect to the execution of clustering algorithms.
Conclusions
We proposed a test of hypothesis based on Bernstein inequality to estimate if there is a significant difference between the reliability of different clusterings performed on the same data. Our proposed method can be applied to discover multiple or hierarchical structures, using different clustering algorithms and different perturbation methods. Even if in our experiments we applied the Bernstein test to the analysis of gene expression data, this approach may be in principle applied to discover multiple structures in any type of complex biomolecular data. Indeed no userdefined parameters are required, and very loose assumptions are made about the distribution of the data and the distribution of the similarity values used to estimate the stability of the discovered clusterings, thus assuring a reliable application of the method to a large range of bioinformatics problems.
Our experiments with synthetic and gene expression data show that Bernsteinbased tests are more sensitive than χ^{2}based tests to multiple structures embedded in the data: in this way not selfevident structures may be detected too, as well as subtle relationships between the data. A drawback of the Bernstein test is its larger expected rate of false positives, but assuming independence between the empirical means of the similarity values a new test (Bernstein ind.), less subject to false positives, has been proposed.
Developments of this work could consist in the adaptation and application of the proposed methods to large scale bioinformatics problems, to discover multiple structures underlying the data when a very large number of clusters is potentially involved.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
The authors equally contributed to this paper.
Acknowledgments
We would like to thank the anonymous reviewers for their comments and suggestions. This work has been developed in the context of CIMAINA Center of Excellence, and it has been funded by the Italian COFIN project Linguaggi formali ed automi: metodi, modelli ed applicazioni.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 2, 2008: Italian Society of Bioinformatics (BITS): Annual Meeting 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S2
References

Dopazo J: Functional Interpretation of Microarray Experiments.
OMICS 2006, 10(3):398410. PubMed Abstract  Publisher Full Text

Gasch P, Eisen M: Exploring the conditional regulation of yeast gene expression through fuzzy kmeans clustering.
Genome Biology 2002., 3(11) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dyrskjøt L, Thykjaer T, Kruhøffer M, Jensen J, Marcussen N, HamiltonDutoit S, Wolf H, Ørntoft T: Identifying distinct classes of bladder carcinoma using microarrays.
Nature Genetics 2003, 33:9096.
jan
PubMed Abstract  Publisher Full Text 
Kaplan N, Friedlich M, Fromer M, Linial M: A functional hierarchical organization of the protein sequence space.
BMC Bioinformatics 2004, 5:196. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Handl J, Knowles J, Kell D: Computational cluster validation in postgenomic data analysis.
Bioinformatics 2005, 21(15):32013215. PubMed Abstract  Publisher Full Text

Lange T, Roth V, Braun M, Buhmann J: Stabilitybased Validation of Clustering Solutions.
Neural Computation 2004, 16:12991323. PubMed Abstract  Publisher Full Text

Bertoni A, Valentini G: Model order selection for biomolecular data clustering.
BMC Bioinformatics 2007, 8(Suppl 2):S7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Monti S, Tamayo P, Mesirov J, Golub T: Consensus Clustering: A Resamplingbased Method for Class Discovery and Visualization of Gene Expression Microarray Data.
Machine Learning 2003, 52:91118. Publisher Full Text

McShane L, Radmacher D, Freidlin B, Yu R, Li M, Simon R: Method for assessing reproducibility of clustering patterns observed in analyses of microarray data.
Bioinformatics 2002, 18(11):14621469. PubMed Abstract  Publisher Full Text

Smolkin M, Gosh D: Cluster stability scores for microarray data in cancer studies.
BMC Bioinformatics 2003, 4():36. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bertoni A, Valentini G: Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses.
Artificial Intelligence in Medicine 2006, 37(2):85109. PubMed Abstract  Publisher Full Text

Hoeffding W: Probability inequalities for sums of independent random variables.
J. Amer. Statist. Assoc. 1963, 58:1330. Publisher Full Text

Jain A, Murty M, Flynn P: Data Clustering: a Review.
ACM Computing Surveys 1999, 31(3):264323. Publisher Full Text

Achlioptas D: Databasefriendly random projections. In Proc. ACM Symp. on the Principles of Database Systems, Contemporary Mathematics. Edited by Edited by Buneman P. New York, NY, USA: ACM Press; 2001:274281.

BenHur A, Ellisseeff A, Guyon I: A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing. Volume 7. Edited by Edited by Altman R, Dunker A, Hunter L, Klein T, Lauderdale K, Lihue, Hawaii, USA. World Scientific; 2002::617.

Valentini G: Mosclust: a software library for discovering significant structures in biomolecular data.
Bioinformatics 2007, 23(3):387389. PubMed Abstract  Publisher Full Text

Valentini G: Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data.
Bioinformatics 2006, 22(3):369370. PubMed Abstract  Publisher Full Text

Ward J: Hierarchical grouping to optimize an objective function.
J. Am. Stat. Assoc. 1963, 58:236244. Publisher Full Text

Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.

Golub T, et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.
Science 1999, 286:531537. PubMed Abstract  Publisher Full Text

Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson J, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling.
Nature 2000, 403:503511. PubMed Abstract  Publisher Full Text

Alizadeh A, et al.: The Lymphochip: a specialized cDNA microarray for genomicscale analysis of gene expression in normal and malignant lymphocytes.

Levine E, Domany E: Resampling method for unsupervised estimation of cluster validity.
Neural Computation 2001, 13(11):25732593. PubMed Abstract  Publisher Full Text

Dudoit S, Fridlyand J: A predictionbased resampling method for estimating the number of clusters in a dataset.
Genome Biol 2002, 3(7):RESEARCH0036. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text