Abstract
Background
Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, since the number of variables can be much higher than the number of observations.
Results
Here, we present a general framework to deal with the clustering of microarray data, based on a threestep procedure: (i) gene filtering; (ii) dimensionality reduction; (iii) clustering of observations in the reduced space. Via a nonparametric modelbased clustering approach we obtain promising results both in simulated and real data.
Conclusions
The proposed algorithm is a simple and effective tool for the clustering of microarray data, in an unsupervised setting.
Background
The analysis of gene expression microarray data using clustering techniques plays an important role, for instance, in the discovery, validation, and understanding of various classes and subclasses of cancer [1]. There are two ways of clustering a gene expression matrix [2,3]: (i) gene function may be inferred from clusters of genes similarly expressed across the samples and (ii) samples can form groups which show similar expression across the genes. Moreover, genes and samples can be clustered simultaneously, with their interrelationship represented by biclusters [4,5].
The clustering of the genes on the basis of the samples is a standard cluster analysis problem that can be effected by a variety of algorithms [1]. For a comprehensive review see [2].
A more challenging problem is the clustering of the samples on the basis of the genes, where the standard clustering techniques, such as kmeans or hierarchical clustering, fail to capture complex local structures, due to the highdimenionality of the data [2].
In recent years, computational improvement enabled new clustering techniques and contributed to the development of previously unfeasible methods. In this context, McLachlan et al. [1] propose a mixture modelbased approach to cluster microarray expression data. Their scheme accounts for gene selection through mixtures of t distributions, and dimensionality reduction through a mixture of factor analyzers. More precisely, they select a gene on the basis of a likelihood ratio statistic for testing one versus two components in the mixture model. In the second step of their algorithm, they cluster the samples by fitting a twocomponent mixture of factor analyzers.
Although their method sounds like a good approach to clustering samples in a highdimensional space, there are three main limitations. Firstly, the parametric assumptions about clusters distributions can be restrictive [6]; for example, two Gaussian random variables can result in a single mode (one cluster) or even a two component multivariate Gaussian mixture can lead to more than two modes [6]. Moreover, it requires prespecification of the number of the mixture components; this represents a serious limitation from an unsupervised perspective, which assumes that the true number of clusters is unknown. Finally, the number of parameters per component grows as the square of the dimension of the data [7], this is a major shortcoming in highdimensional data.
In this paper, we present a novel strategy, which consists in applying a clustering technique after gene filtering and dimensionality reduction, in order to exploit the most significant dimensions in the definition of the clusters. Our procedure can be thought of as a threestep algorithm: (i) gene filtering; (ii) dimensionality reduction; (iii) clustering in the reduced space.
Several authors outlined the importance of a gene filtering step prior to inferential procedures [8] or cluster analysis [9]. Tritchler et al. [9] empirically showed that principal components and cluster analysis are strongly affected by gene selection, and that filtering out uninformative genes can reduce bias in the clustering of samples. Furthermore, Johnstone and Lu [10] showed, from a theoretical point of view, that some initial reduction in dimensionality is desirable before applying a principal component analysis, when p is larger than n.
Traditional approaches to gene filtering are based on thresholding the mean or the variance of genes across samples. Bourgon et al. [8] found that genebygene filtering by overall variance increased the power of the subsequent ttest. Tritchler et al. [9] considered the covariance structure of the genes, defining filters that preserve the topology of the network.
Nevertheless, from a clustering point of view, these approaches could be unsafe: a gene should be considered relevant if it is important in the definition of the clusters; therefore, it seems more appropriate to retain those genes whose univariate distribution highlights a clear grouping among the observations rather than the ones with higher variance.
To evaluate our general strategy, we implement an algorithm based on a nonparametric modelbased clustering technique by Azzalini and Torelli [11], which we will refer to as pdfCluster (see the Methods Section for a brief introduction). We compare it with a traditional partition algorithm (i.e., kmeans), and with a similar strategy in which we use, instead of pdfCluster, its direct competitor, Mclust [7,12], a stateoftheart mixturemodelbased clustering tool. By using the nonparametric approach based on pdfCluster, we achieve improvements in clustering of samples both in simulated and in real experiments. To be consistent with microarray applications, we use here the typical microarray terminology: we denote by "genes" the p variables and by "samples" the n observations. Nonetheless, it should be clear that the proposed approach is not limited to microarray data, but, in principle, it could be applied to every set of continuous variables with "large p, small n".
Results and Discussion
A novel algorithm to clustering of expression data
As we said, clustering samples using expression data is a challenging statistical problem due to dimensionality issues. Therefore, in this context, it is unfeasible to directly apply a clustering technique to the whole data matrix. Here, we evaluate our strategy implementing an algorithm which exploits the selfdetection of number of clusters feature of pdfCluster.
The algorithm can be summarized as follows: (i) cluster samples using the univariate distribution of each gene and select for the subsequent analyses the p' genes, in which pdfCluster identifies two or more clusters; (ii) reduce dimensionality by selecting the first p" principal components; (iii) apply pdfCluster in the p"dimensional space. It is straightforward to see that this algorithm falls within the general framework defined in the Background Section.
As for step (i), i.e., gene filtering, we consider a gene relevant if its values in one category (healthy, say) are different from the ones in the other category (unhealthy, say) or categories. From another point of view, this means that the samples representing the healthy subjects are separated from the unhealthy ones, or, more simply, the samples are in different clusters. In this way, it seems reasonable to apply a cluster method to each gene, and retain as relevant those genes for which the method identifies distinct clusters. In a nonparametric framework, we can apply pdfCluster to each gene, taking advantage of the selfdetection number of clusters feature. We select only the genes for which the method detects two or more clusters (see Figure 1). We consider a clustering technique to define informative genes over the traditional variancebased approaches, because small overall variance does not necessarily imply a single cluster, and high overall variance is not always an indication of two of more clusters. Moreover, variancebased filters depend on the choice of an arbitrary threshold, which can be difficult to choose, since one typically does not know which portion of the genes is responsible for the clustering of the samples. As for step (ii), i.e., dimensionality reduction, considered if the selected genes are still too many, we propose to keep the first principal components, as in [11]. The principal component analysis is a very simple procedure which reduces the dimension of a data set of a large number of interrelated variables, preserving as much as possible of the data set variation. Since it has no requirements about the data distribution, it is consistent with our nonparametric strategy.
Figure 1. Example of gene selection in the filtering step of our procedure. Gene1 is crucial in cluster definition, while gene2 is not. The univariate distributions of the genes reflect this, as one can see from the Gaussian kernel density estimation reported along the axes.
In order to compare our approach to Mclust, we carried out a procedure analogous to the one described here, but using the normalmixture model both in step (i) and (iii). Note that this procedure differs from the one in McLachlan et al. [1], because they use a mixture of factor analyzers to select genes and reduce dimensionality.
Computational issues
The further dimensionality reduction in our step (ii) is necessary since pdfCluster, in order to compute the Delaunay triangulation, exploits the Quickhull algorithm [13]. Barber et al. [13] state that the Quickhull algorithm for finding the convex hull of a set of n points in ℝ^{p }requires at most O(n log n) operations if p ≤ 3, and O(n^{m}/m!) where m = ⌊p/2⌋ for p > 3. Azzalini and Torelli [11] observed that the computing time increases less than quadratically in n for any fixed p, but it increases more than exponentially in p for fixed n. Our experience is that for p = 1, the algorithm is very fast (less than one minute to run 20,000 times with an Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66 GHz). Therefore, there are no computationally related problems for step (i). Moreover, with p < 10 it takes a reasonable time (e.g. 11 mins in our machine with p = 9) to complete the procedure.
Number of principal components
In order to choose the number of principal components, we carried out a small simulation study (data not shown): we found that, with n = 100, pdfCluster performs at best, in terms of misclassification error, with 34 dimensions, while with p = 5 the misclassification error starts to grow. This is probably due to the extreme dispersion of the observations in higher dimensional spaces (the wellknown curse of dimensionality). The performances of pdfCluster are slightly better with 4 components, but the improvement does not justify the increased computational time (recall that the order of the number of operations needed to compute the Quickhull algorithm massively changes between p ≤ 3 and p > 3). Thus, for the subsequent analyses, we will retain 3 principal components.
Simulated data
In this Section, we evaluate our proposal by means of simulated data. For simulating data with structure similar to that of real microarray experiments, we use two schemes, i.e., the GammaGamma (GG) model [14] and the NormalUniform (NU) model [15].
In GG model we simulate data with two clusters (e.g. case/control), such that the majority of genes are equally expressed between the groups and a small fraction of them (5%) is differentially expressed. This mimics a classical experiment in which some diseased subjects are compared to healthy controls.
In NU model we simulate data with three clusters: cluster 1 consists of 40 samples with 150 upregulated and 50 downregulated genes; cluster 2 consists of 40 samples with 50 downregulated genes; cluster 3 consists of 20 samples with neither up nor downregulated genes. Note that cluster 2 and 3 are "closer" to each other than to cluster 1 and that cluster 3 has smaller sample size. This mimics a more elaborate design, e.g. two different types of a specific disease versus a normal control.
GG model
Table 1 shows that both pdfCluster and Mclust provide results surprisingly accurate in correct cluster recognition, low error rate and high sensitivity/specificity: this could be explained by an extreme distance between the two groups in the original pdimensional space.
Table 1. Simulation results for GG model
More interesting is the very different behaviour in the choice of the relevant genes: pdfCluster is very good in recognizing them, with a very low error rate (about 8%), while Mclust shows a very high error rate (about 78%). We simulated a relatively small number of "marker" genes; pdfCluster correctly discards the majority of genes as nonrelevant in the determination of the clusters, while Mclust seems to be too sensitive to outliers, erroneously capturing differences due to random noise.
NU model
As expected, Table 2 shows that in this model both pdfCluster and Mclust lead to higher classification errors than in GG model. Also in the gene filtering step, both methods have difficulties in finding the relevant genes.
Table 2. Simulation results for NU model
Mclust is able to recognize three clusters in 39% and two clusters in 34% of the simulations; pdfCluster recognizes three clusters in 19% and two clusters in 47% of the simulations. On the other hand, the mean error rate of the final classification is 0.135 for pdfCluster while for Mclust is 0.227. This is probably due to the fact that the cases in which pdfCluster correctly recognizes three clusters are those with the most separated clusters among the ones recognized by Mclust: obviously, in the less separated clusters cases, it is more difficult to allocate the samples.
Finally, it is worth noting that pdfCluster outperforms Mclust according to the gene selection error rate ("RG" row): as in previous simulation study, pdfCluster works better in recognizing which genes are effectively responsible for the determination of the clusters.
Sample Size
One issue with microarray data is often the low sample size. In order to evaluate its effect on the performance of our approach, we simulated data from the NU model, varying the sample size n. For different values of n, namely n = 10, 20, 50, 100, 200, we simulated B = 1,000 samples in a setting similar to the previous Section, i.e., 40% of the observations forming cluster 1, 40% forming cluster 2 and 20% cluster 3.
Table 3 shows the misclassification error rate for both pdfCluster and Mclust. Mclust performs badly for low and moderate sample size (n ≤ 50), reaching results comparable to that of pdfCluster only with a high number of observations (n = 200), which is rare in microarray studies. On the other hand, pdfCluster behavior is stable across different sample sizes, yielding good results even when n ≤ 20.
Table 3. Sample size
Real data
Along with simulations, we consider two benchmarking real datasets, studied before by several authors [1,1620], which we will refer to as the Colon data and the Leukaemia data (see Method Section for details on the datasets).
Colon data
As described above, we analyze the dataset, following three steps: (i) gene filtering, (ii) dimensionality reduction, (iii) clustering of samples. Namely, the first step of the procedure consists in applying the cluster algorithm to the univariate distribution of each gene. The genes that show two or more clusters are considered for the further steps.
In the first step, the pdfCluster algorithm is able to recognize 84 genes, which discriminate data into two or more groups. We proceed by considering the first three principal components of this reduced datamatrix. The procedure finds three clusters, summarized in Table 4 which clearly correspond to biologically meaningful groups. The first cluster consists of tumor tissues (with 3 misclassified samples), while clusters 2 and 3 comprise normal tissues (with 5 misclassified). It is worth noting that six out of the eight misallocated samples (tumor tissues 30, 33 and 36 and normal tissues 48, 58 and 60) are found to be misclassified in several previous analyses, including [1,17]. As stated, for instance, in [17], these six samples are likely to be wrongly labeled. Furthermore, Getz et al. [19] reported that there was a change in the protocol during the experiments: tumor samples 111 and normal samples 4151 were collected within the first protocol, while tumor samples 1240 and normal samples 5262 were collected within the second. Although for the tumor samples our approach did not recognize any difference between the protocols, cluster 2 and cluster 3 split normal tissues in two groups according to the protocols.
Table 4. Clusters found in Colon data
In the first step, Mclust is able to find 369 discriminant genes. We consider the first three principal components of this subspace for clustering. The procedure finds two clusters, with a rather high misclassification error (see Table 5). We also apply the kmeans algorithm to the entire dataset. The results of the three approaches are shown in Table 5. It can be seen that kmeans, exploited in the original pdimensional space, does not perform well. Moreover, pdfCluster outperforms (in terms of error rate) Mclust, if one considers cluster 2 and 3 together as the normal samples.
Table 5. Confusion matrices for Colon data
As stated before, McLachlan et al. [1] studied the same microarray dataset. They selected 446 relevant genes, achieving clusters that seem to recognize the change of protocol in the data structure, but fail to recognize the normal/tumor differences [1]. Nevertheless, they achieved results slightly better than ours (ER = 0.1) considering a particular subspace: they clustered genes in 20 groups and considered only the second group (consisting of 24 genes) to cluster data [1]. Although this approach leads to good results in this example, it seems difficult to reproduce the procedure in an unsupervised setting.
Leukaemia data
As stated in [18], the Leukaemia dataset presents two different problems: an easier one, consisting of separating ALL from AML (twoclass problem, hereafter) and a harder one, consisting also of recognizing the differences in Bcell and Tcell subclasses (threeclass problem).
Again, we consider the strategy previously described. In the filtering step, pdfCluster recognizes 313 discriminant genes. Note that the higher number of genes selected with respect to Colon data is consistent with the higher difficulty of the problem. We proceed by considering the first three principal components of this subspace. The pdfCluster algorithm finds two clusters, which clearly represent ALL and AML samples, with 4 AML samples classified as ALL and 5 ALL samples classified as AML, leading to a misclassification error rate of 0.125 (Table 6): pdfCluster is able to solve the twoclass problem, but it misses the threeclass problem.
Table 6. Confusion matrices for Leukaemia data
In the first step, Mclust fails to select relevant genes, recognizing 3,119 out of 3,892 genes as discriminant among the groups. Based on the first three principal components of the subspace spanned by these genes, Mclust clusters samples in four groups. We could interpret the merged clusters 12 as the ALL Bcell class, and cluster 4 as the AML class, while cluster 3 interpretation is less clear (Table 6). Although Mclust is able to find more than two clusters, it fails to distinguish between Bcell and Tcell classes, leading to hardly interpretable clusters.
The Leukaemia dataset has been studied by McLachlan et al. [1] as well. The authors found 2,015 relevant genes after the variable selection step. For the twoclass problem, their results were very good (only one sample misallocated), but they failed to solve the threeclass problem.
It should be noted that, unlike our algorithm, the procedure used in [1] needs prior specification of the number of clusters, which is not desirable in an unsupervised learning, especially in cancer tissue classification, where one of the main goals is to find new subclasses of tumors.
Conclusions
Modelbased approaches to clustering of data have received increasing attention in the last few years, as they provide a sound mathematicalbased method. Unfortunately, in microarray applications, the high dimensionality of the space makes the clustering of samples in the whole space unfeasible within a modelbased framework.
Here, we have discussed a general strategy for the clustering of microarray expression data, based on gene filtering and dimensionality reduction as preliminary steps in the cluster analysis.
We have discussed a nonparametric density estimationbased algorithm within this framework, showing promising results both in simulated data and in two real applications, with surprisingly good computational performances.
In our simulation experiments, we have found that pdfCluster leads to slightly better performances than Mclust. Moreover, the gene filtering step is much more effective using pdfCluster than using Mclust both in simulated and in real datasets. Here, "effective" means good results in terms of both dimension reduction (e.g. in Leukaemia data pdfCluster selected 313 genes versus the 3,119 selected by Mclust) and of correct selection (e.g. in GG model the gene selection error rates are 0.08 and 0.77, respectively).
Here we have used pdfCluster in order to select relevant genes. We underlined the assets of this chioce, but it is clear that any unsupervised technique able to discard the irrelevant genes can be used. Similarly, the choice of principal component analysis in the dimensionality reduction step is only one among several possible choices. Since there are no guarantees that the first principal components preserve the cluster structure in the reduction of original dimension of data [21], future efforts could be made in trying different approaches, such as the projection pursuit [21,22] or the principal curves [23]. Nevertheless, in our case the principal component analysis gives good results and provides a low dimensional dataset on which it is feasible to apply a modelbased technique such as pdfCluster.
All the statistical analyses and simulations have been performed with R [24] and with a public domain implementation of the "Quickhull" algorithm [13] available at http://www.qhull.org/ webcite.
The datasets used are both freely available as Bioconductor [25] packages ("colonCA" for Colon data and "golubEsets" for Leukaemia data).
Methods
Simulation models
GG model
The samples are assumed to be independently generated from Gamma distributions with a constant shape parameter α and genespecific random scale λ_{i}, i = 1, ..., p; λ_{i }is assumed to have a Gamma distribution with shape hyperparameter α_{0 }and scale hyperparameter ν. The genes are generated to be either "equally expressed" (i.e. one group) or "differentially expressed" (i.e. two groups) among the samples. We generated n = 100 samples and p = 2,000 genes, each with probability 0.05 of being differentially expressed. We fixed parameter values as suggested by [26]. We applied our algorithm to the data matrix obtained, selecting a number of relevant genes and using the first three principal components as input for the pdfCluster algorithm. We repeated this procedure B = 5,000 times.
NU model
The model deals with kclass classification of samples, for general k. It is based on a mixture of Normal and Uniform distributions. We exploit the model to simulate gene expressions for a threeclass problem, similar to that of the leukaemia data.
Let us denote with x_{ji }the measured intensity of gene j in sample i, j = 1, ..., p, i = 1, ..., n. We define three categories from which x_{ji }can arise and use e_{ji }to represent them. (i) e_{ji }= 1, i.e., gene j has abnormally low expression in sample i (downregulation); (ii) e_{ji }= 0, i.e., gene j has normal expression in sample i; (iii) e_{ji }= 1, i.e., gene j has abnormally high expression in sample i (upregulation). For each gene j,
Following [15], we use a Uniform distribution for f_{1,j }and f_{1, j }and a Normal distribution for f_{0,j}. More specifically,
where μ_{j }represents the geneeffect and α_{i }the sampleeffect for the normal expression level (see [15] for details). The authors justify the choice of the distributions arguing that, for normally expressed genes, the differences in observed values are due mainly to noise introduced in the experimental stage, while the Uniform distribution may reflect the failure of a biological mechanism that controls the expression level.
We simulated data from the model in a hierarchical framework, with the following initial parameter values:
where denotes the Gamma and ℰ the Exponential distribution. We simulated B = 5,000 datasets of n = 100 samples, p = 1,000 genes and m = 3 clusters defined as follows: cluster 1 consists of 40 samples with 150 upregulated and 50 downregulated genes; cluster 2 consists of 40 samples with 50 downregulated genes; cluster 3 consists of 20 samples with neither up nor downregulated genes.
Real data
Colon data
Alon et al. [16] used Affymetrix oligonucleotide arrays to measure the expression of 6,500 human genes in 40 tumor and 22 normal colon tissue samples. They focused on the subset of 2,000 genes with highest minimal intensity across the samples: the raw expression values of these 2,000 genes comprise our dataset. Following notation in [1], we named 140 the tumor samples and 4162 the normal samples. Before clustering the tissues, we preprocessed the raw intensities taking the logarithm and applying the quantile normalization [27], which is a standard choice for singlechannel microarray technology.
Leukaemia data
Golub et al. [20] studied the gene expression of two types of acute leukaemias, acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML). Gene expression levels were measured using Affymetrix oligonucleotide arrays containing 6,817 human genes. The dataset comprises 47 cases of ALL (38 Bcell and 9 Tcell) and 25 cases of AML. The classification of samples is more difficult in this example than in Colon data because it is much harder to classify between subclasses of the same plasticity than to distinguish between healthy and cancer tissues. Moreover, we have a typical hierarchical structure, since Bcell and Tcell are subclasses of the ALL class and are harder to separate than AML and ALL. Following [18], three preprocessing steps are applied to the intensity matrix: (a) thresholding, floor of 100 and ceiling of 16,000; (b) filtering, exclusion of genes with max/min ≤ 5 or (max  min) ≤ 500; (c) base 10 log transformation. This procedure left us with 3,892 genes.
Evaluation criteria
Both in simulated and in real data, we evaluate the performances of the methods by calculating the error rate (proportion of misclassified samples, ER), the sensitivity (SE) and the specificity (SP). Moreover, in the simulation studies, we record the frequency with which each method finds the correct number of clusters (CC), and we evaluate the performance of the methods in selecting discriminant genes, considering the error rate in the classification of relevant genes (RG), knowing a priori which genes have been generated to have different values among the groups.
Since cluster 2 and 3 of the NormalUniform model have been simulated to be close to each other, in this model we also consider the number of times in which each method is able to recognize two clusters (cluster 1 versus clusters 23) or three clusters.
The pdfCluster algorithm: an overview
In the literature, nonparametric cluster analyses based on mode identification have already been presented. See [6,7,2830]. The pdfCluster algorithm [11] starts from a quite simple idea, introduced by Hartigan in 1975 [31], who stated:
Clusters may be thought of as regions of high density separated from other such regions by regions of low density.
These regions are achieved by "cutting" the density function computed out of observations by a level c, that varies through the algorithm.
More formally, consider a pdimensional space, . Let x_{1}, ..., x_{n }be a vector of pdimensional observations, , for i = 1, ..., n. Starting from this vector, using a method of nonparametric density estimation, we can obtain , i.e. the empirical version of the density f(x).
There is not a specific method for the nonparametric density estimation related to pdfCluster, since the only restriction is that for all i = 1, ..., n. This restriction is not limiting, because almost all estimation techniques satisfy it. Following [11], we choose a kernel method with Gaussian kernel and constant smoothing parameter h = (h_{1}, ..., h_{p})^{⊤}, with , where s_{j }is the estimated standard deviation of the jth variable. This choice is related to the minimization of the asymptotic integrated mean square error [11]. As suggested by Azzalini and Torelli [11] we slightly shrink h toward zero, using a shrinkage factor of 3/4.
Cutting the computed at a level , they obtain m subspaces ℳ_{k}, k = 1, ..., m, of the sample space . Dropping the observations not belonging to , they select only those observations x_{i }such that . The observations belonging to the same ℳ_{k }are connected by the Delaunay triangulation (see, e.g., [32]) to form the "cluster cores". Finally, the unallocated observations are allocated by a classification method, based on nonparametric density estimation too: if x_{0 }is the unallocated observation, the estimated density based on the data already assigned to group k is computed, and x_{0 }is assigned to the group with highest ratio Finally, it is important to notice that pdfCluster selects by itself the number of clusters.
Authors' contributions
RDB and DR contributed equally to the paper. Both authors read and approved the final manuscript.
Acknowledgements
We are grateful to Prof. Adelchi Azzalini for the valuable advice and suggestions. We also would like to thank Dr. Giovanna Menardi, who kindly provided us with an updated version of the pdfCluster routines and Prof. Monica Chiogna and Dr. Chiara Romualdi for the useful comments. The work was funded by University of Padova, grant 094285 (Davide Risso) and by MIUR, grant 2008MRFM2 H (Riccardo De Bin).
References

McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data.
Bioinformatics 2002, 18:413422. PubMed Abstract  Publisher Full Text

Kerr G, Ruskin H, Crane M, Doolan P: Techniques for clustering gene expression data.
Computers in Biology and Medicine 2008, 38:283293. PubMed Abstract  Publisher Full Text

Slonim D: From patterns to pathways: gene expression data analysis comes of age.
Nature genetics 2002, 32:502508. PubMed Abstract  Publisher Full Text

Madeira S, Oliveira A: Biclustering algorithms for biological data analysis: a survey.
IEEE Transactions on computational Biology and Bioinformatics 2004, 2445. PubMed Abstract  Publisher Full Text

Li J, Ray S, Lindsay BG: A nonparametric statistical approach to clustering via mode identification.

Fraley C, Raftery AE: Modelbased clustering, discriminant analysis, and density estimation.
Journal of the American Statistical Association 2002, 97:611631. Publisher Full Text

Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for highthroughput experiments.
Proceedings of the National Academy of Sciences 2010, 107(21):9546. Publisher Full Text

Tritchler D, Parkhomenko E, Beyene J: Filtering Genes for Cluster and Network Analysis.
BMC Bioinformatics 2009, 10:193. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Johnstone IM, Lu AY: On Consistency and Sparsity for Principal Components Analysis in High Dimensions.
Journal of the American Statistical Association 2009, 104:682693. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Azzalini A, Torelli N: Clustering via nonparametric density estimation.
Statistics and Computing 2007, 17:7180. Publisher Full Text

Fraley C, Raftery AE: MCLUST Version 3 for R: model mixture modeling and modelbased clustering. In Tech. rep., no. 504. Department of Statistics, Univerisity of Washington; 2006.

Barber CB, Dobkin DP, Huhdanpaa H: The Quickhull algorithm for convex hulls.
ACM Transactions of Mathematical Software 2006, 22:469483. Publisher Full Text

Kendziorski C, Newton MA, Lan H, Gould MN: On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression pro les.
Statistics in Medicine 2003, 22:38993914. PubMed Abstract  Publisher Full Text

Garrett ES, Parmigiani G: POE: statistical methods for qualitative analysis of gene expression. In The Analysis of Gene Expression Data. Edited by Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. Springer; 2003:362387. Publisher Full Text

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America 1999, 96:67456750. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chow ML, Moler EJ, Mian IS: Identifying marker genes in transcription pro ling data using a mixture of feature relevance experts.
Physiological Genomics 2001, 5:99111. PubMed Abstract

Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data.
Journal of the American Statistical Association 2002, 97:7787. Publisher Full Text

Getz G, Levine E, Domany E: Coupled twoway clustering analysis of gene microarray data.
Proceedings of the National Academy of Sciences of the United States of America 2000, 97:1207912084. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 1999, 286:531537. PubMed Abstract  Publisher Full Text

Menardi G, Torelli N: Preserving the clustering structure by a projection pursuit approach. In Data Analysis and classification. Edited by Palumbo F, Lauro CN, Greenacre MJ. Springer; 2010:171178. Publisher Full Text

Friedman J: Exploratory projection pursuit.
Journal of the American Statistical Association 1987, 82:249266. Publisher Full Text

Hastie T, Stuetzle W: Principal curves.
Journal of the American Statistical Association 1989, 84:502516. Publisher Full Text

R Development Core Team: [http://www.Rproject.org] webcite
R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2009.

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics.
Genome Biology 2004, 5:R80. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chiogna M, Massa MS, Risso D, Romualdi C: A comparison on effects of normalisations in the detection of differentially expressed genes.
BMC Bioinformatics 2009, 10:61. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bolstad B, Irizarry R, Astrand M, Speed T: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.
Bioinformatics 2003, 19:185193. PubMed Abstract  Publisher Full Text

Banfield JD, Raftery AE: Modelbased Gaussian and nonGaussian clustering.
Biometrics 1993, 49:803821. Publisher Full Text

Li J, Zha H: Twoway Poisson mixture models for simultaneous document classification and word clustering.

Banerjee A, Dhillon IS, Ghosh J, Sra S: Clustering on the unit hypersphere using von MisesFisher distributions.

Hartigan JA: Clustering Algorithms. New York, John Wiley & Sons; 1975.

de Berg M, Cheong O, van Kreveld M, Overmars M: Computational Geometry: Algorithms and Applications. Heidelberg, Springer; 2008.