Abstract
Background
The search for cluster structure in microarray datasets is a base problem for the socalled "omic sciences". A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a highdimensional space, as could be the case of some gene expression datasets.
Results
In this work we introduce the Penalized kNearestNeighborGraph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a twostep procedure: first it constructs the kNearestNeighborGraph of the dataset of interest using a low kvalue and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different subgraphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric.
Conclusions
In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwisedistance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their highdimensional data.
Background
The introduction of microarrays and other highthroughput technologies over the last years has changed fundamentally the biological and biomedical research. DNA microarrays, in particular, allow the simultaneous monitoring of thousands of genes on different experimental conditions or time points. Several problems can be faced with this technology. For example, it can be used for the identification of differentially expressed genes [1], which could highlight possible gene targets for more detailed molecular studies or drug treatments. Another application is to assign samples to known classes (class prediction) [2], using genetic profiles to improve, for example, the diagnosis of cancer patients. Both are supervised applications, in which all samples belong to a previously known class. It is very common, also, to use microarray data for (nonsupervised) exploratory analysis. A common application is to use a clustering algorithm to group the different genes. As stated by Richards et al. [3], clustering in this way can help summarizing datasets, reducing from thousands of genes to a small number of clusters. It can also help find systemic effects, as looking for small changes in expression levels across many genes in a cluster could be a better strategy for finding the causes of complex, polygenic disorders than looking for large changes in single genes [4]. Clustering genes can be seen as a "classical" clustering problem, with thousands of examples (genes) measured over a small number of variables (tissues, experimental conditions, etc.). The other common nonsupervised analysis is to cluster the samples (tissues/diseases/patients). The goal in this second situation is to find groups of samples sharing similar gene expression patterns. Seminal works in this area are Golub et al. [2] and Alizadeh et al. [5], both aimed at the discovery of new cancer subtypes. However, clustering samples is very different to clustering genes. In this case the analysis is made over socalled "wide datasets", with a few samples measured over thousands of features (genes). Dealing with high dimensional spaces is a known challenge for clustering procedures, as they usually fail to handle manifoldstructured data, i.e. data that form lowdimensional, arbitrary shapes or paths through a highdimensional space.
Clustering is a key component of pattern analysis methods, aiming at finding hidden structures in a set of data. Informally, the objective is to organize the samples in a way that "similar" objects are grouped together. It is a very active field of research [69], with applications that cover diverse problems, from image segmentation in computer vision [10] to characterizing customer groups based on purchasing patterns [11] and, as mentioned, the analysis of microarray expression data [12].
The problem of finding clusters in a dataset can be divided into three stages: i) measuring the similarity of the objects under analysis, ii) grouping the objects according to these similarities, and iii) evaluating the "goodness" of the clustering solution. The last stage has received little attention until recent years, when a growing interest in the problem can be noticed [1315]. The second stage (finding clusters efficiently given a set of similarities between objects) has been widely studied in the literature [16] and several clustering algorithms have been introduced. They are usually divided into hierarchical and partitional methods [17]. Hierarchical clustering (HC) algorithms find successive clusters using previously defined ones, in an agglomerative ("bottomup") or divisive ("topdown") way [18]. The result of this process is a binary tree, called a dendrogram [19]. HC has been extensively applied to microarray data [2,5,2022]. De Souto et al. [12] noted that more than 90% of published clustering applications to microarray data use HC. Partitional clustering algorithms find all the clusters simultaneously as a division of the data and do not impose a hierarchical structure. One of the most widely used approaches is the KMeans [23] algorithm (or its related version PAM [24]) that, starting from k (usually random) clusters, searches iteratively for a locally optimal solution of the clustering problem. Partitional methods have also been successfully applied to microarray data [2528]. Recently, Frey and Dueck [8] proposed the innovative and computationally efficient Affinity Propagation (AP) algorithm. According to this method, each data point is viewed as a node in a network. Nodes exchange messages until a set of clustercenters emerges as a solution. The algorithm shares characteristics with both hierarchical and partitioning methods. To the best of our knowledge, the work of Michele Leone and Weigt [29] is the only published application of AP to microarray data.
Several clustering methods targeted to microarray data have been proposed [3036], but, as discussed by De Souto et al. [12], most researchers still rely on the traditional HC method with Pearson's correlation or Euclidean metric, mostly because of its ease of use and availability. According to this, instead of developing new algorithms appropriate for highdimensional microarray problems, we propose here to obtain better solutions going back to stage one of clustering, i.e., by finding better ways to measure similarities between datapoints. These improved distances can be used by any base clustering method to produce better clustering results. In some sense, our view shares the spirit of kernel methods [37], looking for solutions to new problems by using appropriate new metrics together with wellknown pattern analysis algorithms.
In the last years, several methods for characterizing the nonlinear manifold where a highdimensional dataset may lie were developed, like ISOMAP [38], Locally Linear Embedding [39] or Laplacian Eigenmaps [40]. Basically, they all look for local neighborhood relations that can be used to produce low dimensional projections of the data at hand. In this paper we discuss a new strategy to evaluate similarities in manifold spaces that easily extends the application of any clustering algorithm to these cases. Following ISOMAP, we first create the knearestneighborgraph (knngraph) of the data, using a low kvalue. If the graph is disconnected, which is expected in clustering problems, we add a number of edges (following different strategies that will be discussed later) in order to create a connected graph. The key point of our method is that the added edges have a highly penalized length. We then apply an appropriate algorithm to measure interpoint distances along the connected graph and use these measures as (dis)similarities. We call the method the PKNNG metric (for Penalized KNearestNeighborGraph based metric). The PKNNG metric can be applied to any base measure of similarity (Euclidean, Pearson's correlation, Manhattan, etc.) and the resulting distances can be clustered with any of the usual methods (HC or Kmeans, for example).
There are some methods proposed in the recent literature that can handle arbitrary manifolds. In an early attempt to use manifold projection in clustering, Polito and Perona [41] showed how, in theory, LLE can naturally produce clusters. An interesting method (Path based clustering) based on graph theory was developed by Fisher and Buhmann in a series of papers [4244]. They start by assuming that, if two points belong to the same cluster, there should be a path between them, going over other points in the same cluster, such that all distances in that path are small. They consider then that the length of a given path can be defined as the maximum distance between two points in the path. Following, they define the distance between two points as the minimum length among all the paths that connect the two points. We call this metric the Path Based Metric (PBM). The idea of measuring distances based on "neighbor of a neighbor" relations is the same we use. The main differences are that we use a knearestneighbor graph, we consider all the edges in the path when we measure distances and we give a less extreme value to the maximum edge. PBM and the PKNNG metrics are basically heuristics, based on simple notions. There are also principled methods to find good metrics [45], learning the metric actively from the data and from examples of similar and dissimilar points provided by the final user. In principle the method can learn nonlinear metrics, but it is not designed for data lying in manifolds. The wellknown Single Linkage Hierarchical clustering [18] is efficient and has been widely used. This method is equivalent to finding the Minimum Spanning Tree (MST) of the dataset, and thus can also be considered as based on graphs.
Several more general but very efficient clustering methods have been introduced in the last years. We selected three of them to compare with our proposal of using a simple clustering method as PAM or HC with the new PKNNG metric. The Spectral Clustering algorithm [6,46] has recently received increasing attention and is considered to be effective for arbitrary manifolds using an appropriate kernel. The Evidence Accumulation Clustering algorithm (EAC) method of Fred and Jain [47] is based on the innovative idea of producing several different clustering solutions and extracting from them the information to produce the final clustering. The main idea is that if a pair of points is usually clustered together, under different conditions, then they could be assigned to the same cluster with high confidence. Recently, Kim et al. [48] have presented the MultiK clustering methods, which is based on the same general idea of ensemble clustering.
In the next Section we show results using some artificial problems in highdimensional spaces and discuss the application of the PKNNG metric to microarray datasets. Later, in the Methods Section, we explain in more detail the method and introduce and evaluate, in controlled experiments with other artificial datasets, the different schemes we use to connect the subgraphs and the diverse penalization functions considered in this work.
Results and Discussion
Evaluation on artificial datasets
In a first series of experiments we used artificial datasets to evaluate the behavior of the new metric in controlled situations, in which we change the difficulty of the clustering problem by setting, for example, the dimensionality of the input space or the distance between the clusters. In this and all other experiments in this work we consider that we know in advance the right number of clusters for each dataset (we discussed in the Introduction that we do not analyze the problem of finding the number of clusters in this work).
Experimental settings
We evaluated the proposed metric using five synthetic noisy datasets that simulate microarray samples:
A1
The first dataset is taken from Kim et al. [48]. It is a 3 clusters problem in a 300 genes space, where each cluster has 50 samples. As explained by the authors, the 300 genes are divided into ten groups or blocks, each one with 30 genes. In each block for a given cluster all the 30 dimensional samples were commonly drawn from a Normal distribution N(α_{30}, I_{30}), where α_{30 }is a 30dimensional vector with all components equal to α, which is randomly chosen from {0.5, 0, 0.5} in each block and I_{30 }is the identity matrix in a 30 dimensional space. As explained by Kim et al., this dataset represents gene sets with coexpression patterns that are commonly up or down regulated under specific experimental conditions.
A2
The previous dataset is a simple clustering problem, where all the clusters are well separated spheres. In this second artificial problem we start to consider the more complicated case of elongated clusters. This is a two clusters problem in a 100 genes space, where each cluster has 25 samples drawn from a Normal distribution. Each cluster has an elongated shape in the 100 dimensional space, where the principal direction (taken at random) explains 10% of the total standard deviation in the problem, and all the other dimensions have a similar deviation (the remaining 90% of the total standard deviation divided equally among the 99 directions). The principal axes of both clusters are parallel, separated by one and half times the deviation in the principal direction. This dataset simulates a problem in which all genes are highly correlated, but have a shift in their expression level for the two experimental conditions.
A3
This is also a two clusters problem, generated as in A2, where the centers of each cluster are in the same position as in A2 (but separated by two times the deviation in the principal direction), but each of the two clusters has a different random direction for its principal axis. This dataset simulates a problem in which all genes are still correlated, but the correlation matrix is different for each experimental condition, which leads to a better separation when using correlation as base metric.
A4
This is the three clusters version of problem A3, where each of the three clusters has a different random direction for its principal axis. The separation among neighbor clusters is equal to the deviation in the principal direction.
A5
This last dataset is similar to problem A4, three elongated clusters with different correlation matrices and the same separation, but this time there are 200 genes that express differentially for the three conditions, and there are also 200 noisy genes (sampled from a centered Normal distribution with the same deviation used in all nonprincipal directions).
We considered two base metrics to evaluate similarities: Euclidean and Pearson's correlation. We evaluated distances with the PKNNG metric using 5 neighbors, MinSpan connection and exponential penalization. We first verified on reduced experiments with the artificial problems that the choice of the number of neighbors is not critical to the final clustering results. In the Methods Sections we discuss in particular the low dependence of the results with the connection and penalization methods, using other three artificial problems.
For comparison with the PKNNG metric we also used the PBM and RBF metrics. The PBM is a graphbased metric that we already described in the Introduction. It has no free parameters. The RBF or Gaussian metric is based on the RBF inner product 〈x, z〉 = exp(x  z)^{2}/σ^{2}), using the property of a kernel vector space that relates kernel to distance, x  z^{2 }= 〈x, x〉 + 〈z, z〉  2〈x, z〉. This metric has a free parameter, σ, that acts as a global scale for the solution. We used two different procedures to set the value of σ. In the first case we set σ to the median of the base distance (Euclidean or correlation) among all pairs of points in the dataset. This follows the default strategy for model selection under the RBF kernel in the Kernlab R package [49]. We call this metric RBFmean. As a second strategy, we used the procedure suggested by Ng et al [46] and implemented in the spectral clustering method in the same R package. In this case the criterion is to select the σ that gives the minimum withincluster sum of square distances in the projected space for the spectral clustering method. We call this metric the RBFmin.
In all the experiments in this Subsection we use two of the most wellknown clustering methods: Hierarchical clustering (HC) with average linkage and PAM, an appropriate version of Kmeans. For each evaluation we applied the following procedure: First we created a realization of the dataset. Then we measured distances in the dataset using the base, PBM, PKNNG, RBFmean and RBFmin metric and applied HC and PAM to all distances. We evaluated the goodness of the solutions using the corrected Rand (cRand) index [50], also known as Adjusted Rand index (ARI), comparing the assigned cluster with the true label. For each dataset under study we produced 100 evaluations following this procedure and analyzed the corresponding distributions of cRand values.
Results
Figure 1 shows the results of the analysis of the dependence of PKNNG with k, the number of neighbors considered in the graph, over the five artificial datasets using two different base metrics (Euclidean and Pearson's correlation) and two clustering methods (PAM and HCaverage). For all the figures in this Section, the results are shown as boxplots of the distribution of cRand values corresponding to 100 experiments. For each distribution, the bottom and top of the box show the lower and upper quartiles, respectively, and the whiskers represent the lowest or highest datum still within 1.5 IQR of the corresponding quartile. Is is clear from the figure that the results are almost independent of k. In only 3 out of the 20 cases there is some dependence on k, all of them using the HCaverage method. Figures 2, 3, 4 and 5 show the results of the comparison of the five metrics. In Figure 2 we use PAM clustering with distances based on the Euclidean metric. It is clear that in all cases the clustering results with the new PKNNG metric are clearly superior to those obtained with the plain Euclidean metric. Also, PBM shows a very poor performance, always similar to a random clustering. The PBM metric was designed to work in cases in which there are clear separations among the clusters, which is not the case in our artificial problems when using the euclidean base metric. When the clusters are not completely separated, PBM's base idea that only the longest edge counts for measuring distances in the graph leads to wrong results. PKNNG also usually gives a high weight to long edges in the graph, because in most cases they are the penalized ones, but it also considers the contribution of all other edges in the path connecting two points. In this sense, our metric can be viewed as a softened version of PBM. Both versions of the RBF metric show equivalent results, always similar to the base Euclidean metric in this case. Finally, all five datasets seem to be equally difficult using these settings.
Figure 1. PKNNG's dependence on k: Results on artificial datasets. Evaluation of different number of neighbors (k) for the PKNNG metric. We use five artificial datasets that simulate gene expression problems and consider two clustering methods (PAM, top row, and HC, bottom row) and two base metrics (Euclidean, left column, and Pearson's correlation, right column).
Figure 2. Results on artificial datasets: Comparison with other graphbased metrics using PAM and the Euclidean base metric. Evaluation of the PKNNG metric on five artificial datasets that simulate gene expression problems. For each dataset we show the results of the PAM clustering method using different metrics: plain Euclidean, PKNNG, PBM, RBFmean and RBFmin.
Figure 3. Results on artificial datasets: Comparison with other graphbased metrics using HC (Av) and the Euclidean base metric. Evaluation of the PKNNG metric on five artificial datasets that simulate gene expression problems. For each dataset we show the results of the HC clustering method (Average Linkage) using different metrics: plain Euclidean, PKNNG, PBM, RBFmean and RBFmin.
Figure 4. Results on artificial datasets: Comparison with other graphbased metrics using PAM and the Correlation base metric. Evaluation of the PKNNG metric on five artificial datasets that simulate gene expression problems. For each dataset we show the results of the PAM clustering method using different metrics: plain Pearson's correlation, PKNNG, PBM, RBFmean and RBFmin.
Figure 5. Results on artificial datasets: Comparison with other graphbased metrics using HC (Av) and the Correlation base metric. Evaluation of the PKNNG metric on five artificial datasets that simulate gene expression problems. For each dataset we show the results of the HC clustering method (Average Linkage) using different metrics: plain Pearson's correlation, PKNNG, PBM, RBFmean and RBFmin.
In Figure 3 we show the results using the same base metric but with the HC clustering method. The results are very different this time, which highlights the importance of evaluating metrics with diverse clustering methods. For all metrics but the new PKNNG, HC tends to create one very small cluster in all problems (in fact, we had to force HC to discard clusters with less than 4 datapoints to obtain the nonrandom results in the figure). The PBM metric show the same poor results as with PAM. In the A1 dataset the results are similar to PAM (Figure 2), but in the other four cases there is a big drop on performance for all metrics but the new PKNNG.
We also applied all methods using as base metric the Pearson's correlation. There are clear differences comparing with the Euclidean base metric. In Figure 4 we show the results for the PAM clustering method. RBFmin shows a very poor performance in this case, due to a bad automatic selection of the σ parameter. PBM, on the other side, is better suited for this setup, in which the clusters are more separated. In fact, PBM produces the best results for the A5 dataset. Overall, PKNNG is still the best method but the differences are lower in this case. All methods show bad results for the A2 dataset. This is simply due to the fact that, as the two clusters are colinear, there is no separation between them using Pearson's correlation. This is a clear limit for the class of metrics that we consider in this work, which are improvements of a given base metric and work by enhancing the separation among certain datapoints. The clusters in problems A3, A4 and A5, which are elongated but no colinear, can be more easily separated by Pearson's correlation. The results with HC clustering, Figure 5 are qualitatively similar, but there are some differences in performance, in particular for the PBM metric in this case, which again shows the importance of evaluating metrics with diverse clustering methods.
Overall, considering the complete results on our artificial problems, PKNNG seems to be the more stable and efficient metric of the five compared in this work. It is interesting to note that the PKNNG metric produce accurate results both for elongated manifolds (A2 to A5) and for typical compact cloud data (A1).
Evaluation on gene expression datasets
In this second series of experiments we discuss the use of the PKNNG metric on real gene expression datasets.
Experimental settings
We included eight publicly available gene expression datasets in this analysis. Their main characteristics are summarized in Table 1. In the first seven datasets the objective is to cluster the different samples based on the information provided by the expression levels of a subset of genes. In these seven cases the samples have known classes corresponding to different tissues/phenotypes, measured by laboratory analysis (golden rule). In five of these datasets (ALB, LEU, BCLP, CNS and CGM) we used the reduced versions of Monti et al. [15]. In particular, for the CGM dataset we performed a second gene selection reducing the number of input genes to 1000, as in all the other datasets, considering the genes with the highest standard deviation. For the ALI dataset we used the version provided by De Souto et al. [12] and for the THY dataset we used the original version. In both cases we selected the subset of 1000 genes with the highest standard deviation. The last dataset (Y) is the only case in which the objective is to find four functional classes of genes based on information about different experimental conditions. In this case we used as input the original version of the dataset [20] and as golden rule the 4 functional classes selected by BenHur et al. [13] (one of the classes is in fact the union of two different functional classes). We included this dataset in our experimentation because it shows an intermediate gene clustering situation, with a fairly balanced examples/variables ratio (n ~ p) and a few functional classes. Overall, we selected datasets with diverse difficulties, regarding the number of classes, the balance among them and the examples/variables (n/p) ratio. Following Monti et al. [15], we normalized all datasets by row to mean zero and standard deviation one (for each sample we normalize genes for the first seven datasets and experimental conditions for the last one).
Table 1. Gene Expression datasets
We used the same experimental settings as with the artificial datasets. The only difference is in the sampling procedure. In this case, for each evaluation we took a sample with 95% of the examples in the dataset. Then we measured distances in the sample using all the metrics and applied the clustering method (HC or PAM) to all distances. For each dataset under evaluation we produced 100 evaluations following this procedure. Again, we evaluated the goodness of the clustering solutions using the cRand index, but this time comparing the assigned clusters with the original biological classes (golden rule). We also verified that there in no strong dependence on the value of k. We show the corresponding results in the additional file 1: Other evaluations.
Additional file 1. Other evaluations. Evaluation of: i) dependence on k for public gene expression datasets and ii) the use of HCcomplete linkage with our five metrics.
Format: PDF Size: 39KB Download file
This file can be viewed with: Adobe Acrobat Reader
We also applied three stateoftheart clustering algorithms, described in the Introduction, to all eight gene expression datasets: i) The spectral clustering method with the RBFmin metric, ii) The EAC method with average linkage in its second step and iii) the recently introduced MultiK clustering method. We also considered a fourth clustering method in this comparison, ModelBased Clustering (MBC) [51], which showed good results in geneexpression problems [52]. For this method we used a covariance structure consisting of a constant diagonal matrix for each cluster, but allowing a different constant for each cluster, which is the most complex structure that the method could use given the low number of samples in our gene expression datasets. In all cases we used the same protocol described in the previous paragraph for sampling and evaluation (in fact we used exactly the same 100 samples with 95% of the examples each one).
Results
In Figure 6 we show a comparison of the five metrics using the PAM clustering algorithm and the Euclidean base metric on the eight gene expression datasets considered in this work. PKNNG shows the more consistent behavior. In seven out of the eight datasets it is equivalent or superior to the other metrics. PBM and RBF results are much more problemdependent. In Figure 7 we show the same experiment but using HC as clustering method. The differences in favor of PKNNG are clearer in this case. The combination of HC and the PKNNG metric shows the biggest improvements in accuracy over HCPlain, in particular for the ALB, ALI, LEU and THY datasets. Smaller improvements can be observed for the CNS, BCLP and Y datasets. On the two figures it is notorious that in several cases PBM or RBF show results that are lower than the base metric. In Figures 8 and 9 we show the corresponding experiments using Pearson's correlation as base metric. Apart from the PBM metric, which always shows a bad performance, the remaining four metrics show a more stable result with this base metric. PKNNG show only small improvements over the plain metric in this case. When using the PAM method, PKNNG is the only metric that finds the correct solution for the ALB dataset. It also shows interesting results for the LEU and Y datasets. The ALI dataset with these settings is the only case in all our experiments in which another metric (RBFmin) clearly outperforms PKNNG.
Figure 6. Results on gene expression datasets: Comparison with other metrics using PAM and the Euclidean base metric. Evaluation of the PKNNG metric on eight gene expression datasets. For each dataset we show the results of the PAM clustering method using different metrics: plain Euclidean, PKNNG, PBM, RBFmean and RBFmin.
Figure 7. Results on gene expression datasets: Comparison with other metrics using HC (Av) and the Euclidean base metric. Evaluation of the PKNNG metric on eight gene expression datasets. For each dataset we show the results of the HC clustering method (Average Linkage) using different metrics: plain Euclidean, PKNNG, PBM, RBFmean and RBFmin.
Figure 8. Results on gene expression datasets: Comparison with other metrics using PAM and the Correlation base metric. Evaluation of the PKNNG metric on eight gene expression datasets. For each dataset we show the results of the PAM clustering method using different metrics: plain Correlation, PKNNG, PBM, RBFmean and RBFmin.
Figure 9. Results on gene expression datasets: Comparison with other metrics using HC (Av) and the Correlation base metric. Evaluation of the PKNNG metric on eight gene expression datasets. For each dataset we show the results of the HC clustering method (Average Linkage) using different metrics: plain Correlation, PKNNG, PBM, RBFmean and RBFmin.
In Figures 10 and 11 we show the comparison of our proposal (to use the PKNNG metric plus a simple clustering method like HC or PAM) with four stateoftheart clustering algorithms for the Euclidean and Pearson's correlation base metrics. In this case we also included other version of HC, the complete linkage method. For the Euclidean base metric, Figure 10 in six out of the eight datasets one of our proposals shows the best results (or equivalent to the best method) in extracting the biological information related to the original classes/tissues. Only for the ALB and LEU datasets some other method clearly outperforms our proposal. The same analysis is valid when using correlation as base metric (Figure 11). The EAC and MultiK methods outperforms all other methods in the ALI and LEU datasets and Spectral clustering shows a good result in the CNS dataset. Making a global evaluation, considering all datasets and base metrics, there are no clear winners or losers, all the methods are competitive in some situations. As we stated before, we are comparing efficient methods in these experiments and, in consequence, which one works better is highly dependent on the dataset and the base metric.
Figure 10. Results on gene expression datasets: Comparison with other stateoftheart clustering methods (Euclidean base metric). Evaluation of the PKNNG metric on eight gene expression datasets using the base Euclidean metric. For each dataset we compare our proposal (the PKNNG metric plus PAM or HC) with other four methods: ModelBased Clustering (MBC), Spectral clustering (Spectral), Evidence accumulation with average linkage (EACav) and the MultiK method.
Figure 11. Results on gene expression datasets: Comparison with other stateoftheart clustering methods (Correlation base metric) Evaluation of the PKNNG metric on eight gene expression datasets using Pearson's correlation as base metric. For each dataset we compare our proposal (the PKNNG metric plus PAM or HC) with other four methods: ModelBased Clustering (MBC), Spectral clustering (Spectral), Evidence accumulation with average linkage (EACav) and the MultiK method.
For completeness, in the Additional file 2: Evaluation with different number of clusters we show the results of changing the number of selected clusters in the gene expression datasets with all metrics and clustering methods. Those figures show that in almost all cases cRand results (for all the metrics and methods discussed in this work) have a simple dependence with the number of cluster, showing a peak at the right place for each problem and a smooth decay after that point. Of course, the results in those figures cannot be used to select the right number of clusters for each dataset, as the cRand measures are based in a previous knowledge of the golden rules.
Additional file 2. Evaluation with different number of clusters. Evaluation of the effect of changing the number of selected clusters in the gene expression datasets with all metrics and clustering methods.
Format: PDF Size: 178KB Download file
This file can be viewed with: Adobe Acrobat Reader
Conclusions
In this work we have discussed the Penalized kNearestNeighborGraph based metric and have shown that it is a useful tool for clustering arbitrary manifolds. The PKNNG metric is based on a twostep method: first it constructs the kNearestNeighborGraph of the dataset using a low kvalue (from 3 to 7), and then it uses penalized weights for connecting the subgraphs produced by the first step. In the Methods section we clearly show that the key factor in the good performance of the PKNNG metric is the use of highly penalized weights at the second step.
We evaluated our proposal using five artificial datasets with different difficulties (that simulate gene expression problems) and eight public gene expression datasets. We showed first that our new metric is superior to other graphbased metric previously introduced and to the base metrics (Euclidean and Pearson's correlation) in all cases under analysis. In second term we showed that our proposal produced an equivalent performance to other stateoftheart clustering methods on all the wide gene expression datasets evaluated in this work.
The use of the PKNNG metric can improve the performance of commonly used pairwisedistance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their highdimensional data.
Methods: The PKNNG metric
The evaluation of similarities with the PKNNG metric is a twostep process. First we search the original dataset space for locally dense (connected) structures using the knngraph. In the ideal case the process should end with a connected subgraph corresponding to each cluster but in the real case, when working with finite noisy samples, there are usually too many separated structures, typically more than one for each real cluster. In the second step we add penalized edges to the graph in order to fully connect it, and use an appropriate algorithm to measure distances in the (now) connected graph.
First step: knngraphs
Among the several algorithms for discovering low dimensional manifolds recently introduced, ISOMAP has strong theoretical properties and is also easy to understand. We follow the main ISOMAP idea to search for locally connected structures. As explained by Tenenbaum, de Silva and Langford [38], in a curved manifold the geodesic distance between neighboring points can be approximated by the Euclidean input space distance. For distant points, in contrast, geodesic distances are better approximated as a path of short segments connecting neighboring points. To this end, we construct the knngraph of the data, i.e. the graph with one vertex per observed example, arcs between each vertex and its k nearest neighbors, and with weights equal to the Euclidean distance between them. As we look for dense subgraphs, at the end of the process we eliminate all the outliers from the graphs. We consider that an arc is an outlier if it is not reciprocal (i.e. one of the vertex is not a knn of the other) and the length of the arc is an outlier of its own distribution (i.e. if its length is larger than the 3rd quartile plus 1.5 times the interquartile distance of the distribution of the lengths of all the edges in the graph). As isolated points usually produce not reciprocal connections in kconnectedgraphs, with this definition we are basically deleting long connections from these kind of points. In Figure 12 we show a toy example of this process. Panel (a) shows the original data with the corresponding knngraph. The isolated point at the bottom left corner is an outlier, and the edge connecting it with the nearest subgraph has been eliminated. In general, the use of a low number of neighbors (3 to 7 in all our cases, as in the original work [38]) produces graphs that can follow the curved structure of any data without adding "shortcuts" between geodesically distant points (a known issue with ISOMAP [53]), but usually paying the price of producing disconnected components [53], which we connect in the next step.
Figure 12. Connection schemes. A toy dataset illustrating the different connection schemes evaluated in this work. a) The original data with the knngraph. Note the disconnected outlier at the bottomright. b) The MinSpan scheme. c) The AllSubGraphs scheme. Note that all MinSpan added edges are included. d) The AllEdges scheme. e) The Medoids scheme. See text for details on each method.
Second step: connecting subgraphs
As stated previously, when working with real data the first step will usually generate separated structures, typically more than one for each real cluster. In the ideal case, each sub raph should correspond to a unique cluster and we could leave them disconnected and interpret those distances as equal to infinite, as done for example by Brito et al. [54]. In the real case, however, some subgraphs correspond to the same cluster or two cluster could lie in the same subgraph. Therefore, we need to evaluate also geodesic distances between points lying in different subgraphs. With that purpose, we add to the graph a number of edges in order to fully connect it.
Connection schemes
Which edges to add in order to connect the graph is an interesting problem by itself. We evaluated in this work four different schemes that cover most of the simple possibilities:
MinSpan
In this first scheme we add to the graph the minimum number of edges, each of them of minimum length (the minimum spanning set), which fully connect the graph. On panel (b) of Figure 12 we show the result of this connection strategy on our toy example.
AllSubGraphs
Here we connect each subgraph to all other subgraphs using minimum length edges. Figure 12 panel (c), shows this strategy on the toy example.
AllEdges
In this simple scheme we add to the graph all remaining edges (of course, with a penalized weight). On panel (d) of Figure 12 we show the corresponding graph.
Medoids
In this last strategy we first find the medoid of each subgraph, and then add edges connecting each medoid to all remaining medoids. Figure 12 panel (e), shows this strategy applied to our toy example.
The idea behind MinSpan is to add the shortest available edges trying to follow as much as possible the structure of the manifold. MinSpan basically produces onedimensional structures. We showed in a previous work [55] that this effect can introduce some instability for distant points, but it does not affect the performance of clustering. AllSubGraphs is an extension of MinSpan and AllEdges can be viewed as an extension of AllSubGraphs. At each step of this chain we add more edges, which increases the connectivity of the graph, reducing the potential instability of MinSpan. On the other hand, by adding edges we increase the probability of introducing "shortcuts" in the manifold, which can reduce the overall performance of the method. The Medoids scheme is basically introduced to have a different scheme to compare with. As it does not include the minimum spanning set, it is always forced to use other paths in the graph and, consequently, it is not expected to be able to follow a curved manifold. Neighboring points belonging to different subgraphs can be completely separated by this connection scheme.
Once we have a connected graph, we can compute geodesic distances between all points as minimumlength paths in the graph using computationally efficient algorithms like Floyd or Dijkstra [56]. This is the most time consuming stage in the computation of the metric, with a complexity of O(n^{2}logn), where n is the number of points in the dataset. It is interesting to note that the number of dimensions in the problem only enters in the initial computation of the distance among all points, which is a mandatory step for any clustering method. The burden that we are adding is only related to the number of points in the datasets, and is lower than the complexity of most hierarchical methods. Furthermore, in this work we analyze gene expression datasets, for which n is usually a low number.
In order to compare these four connection schemes in controlled situations we used three artificial datasets with different characteristics:
Twomoons
The first dataset corresponds to points uniformly sampled from two arcs of circumference (two clusters), with Gaussian noise of a fixed amplitude added to the radial direction.
Threespirals
The second artificial dataset was generated by sampling uniformly from three equally separated spirals, with added Gaussian noise proportional to the radial distance. The result is a three clusters problem, each one of them having a nonuniform density, which can confuse some algorithms.
Threerings
On this problem the central cluster corresponds to a uniform sampling of a circle, which is surrounded by two rings, also sampled uniformly but with constant Gaussian noise added to the radial component. We split both middle and outerrings in halves (adding a small gap), to create a more difficult five clusters problem. This third dataset has clusters with different (but uniform) densities.
Similar datasets were used by other author with the same goal [4648]. All datasets were generated in a (2dimensional) plane. For each one of them we used three different noise levels. An example of the artificial clustering problems is shown in Figure 13. In order to evaluate the performance of the PKNNG metric at handling lowdimensional manifolds in highdimensional spaces, we embedded our three datasets in 2D, 3D and 10D spaces, generating the following four settings:
Figure 13. Artificial datasets. Samples of the artificial datasets used in this work. From top to bottom, rows correspond to the Twomoons (2 clusters), Threespirals (3 clusters) and Threerings (5 clusters) datasets. From left to right, columns correspond to low, medium and high noise levels, respectively.
2D
In this first case we kept the three datasets in the original 2D space.
3D
To start increasing the difficulty of the clustering problems, we coiled the original plane to form a swissroll, producing a nonlinear embedding of the original clusters into a 3D space. In Figure 14 we show an example of the resulting problem.
Figure 14. Swissroll embedding. Example of the embedding of our 2D artificial clustering problems in a 3D swissroll.
3Dnoise
In this third setting we added Gaussian noise to the previous 3D embedding in order to drift the points away from the surface of the swissroll.
10dnoise
As a last and more difficult setting, we took the 3D coiled data and added 7 extra dimensions to the problem. We then applied a random rotation in the 10D space, and finally added Gaussian noise in all 10 dimensions.
In all cases we measured base similarities with the Euclidean metric. For these artificial datasets the quality of the solutions was evaluated in terms of the clustering accuracy, i.e. the percentage of the dataset assigned to the right cluster. For each case under evaluation (dataset + noise level + embedding) we produced 100 different realizations of the dataset and computed the mean clustering accuracy.
We used only two of these embeddings (3Dnoise and 10d) to compare the four connection schemes here (we show the results of the four embeddings in the additional file 3: Full Figures). In Figure 15 we show the corresponding results using the PAM clustering algorithm and the exponential penalization (discussed in the next Subsection). We repeated the full experiment using the HC clustering algorithm, finding completely equivalent results (data not shown). At each panel we also include the results of using the standard Euclidean metric as a baseline reference.
Additional file 3. Full Figures. Complete versions of Figure 15 and 16, including the four embeddings discussed in the Methods section.
Format: PDF Size: 45KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 15. Comparison of the different connection schemes. Comparison of the four different connection schemes evaluated in this work: MinSpan, AllSubGraphs, AllEdges and Medoids. Columns of subfigures correspond, from left to right, to the Twoarcs, Threespirals and Threerings datasets. Rows correspond to the two more difficult embeddings; from top to bottom: 3dnoise and 10dnoise. Each subfigure shows the mean clustering accuracy of the diverse methods at three different noise levels.
The qualitative results are similar in all the situations under analysis. It is clear from the figure that three connection schemes (MinSpan, AllSubGraphs and AllEdges) have very similar performances, clearly superior to the baseline methodology. The Medoids scheme, as expected, does not show a real improvement over the plain Euclidean metric. Comparing the three efficient schemes, MinSpan seems to be slightly superior to AllSubGraphs and AllEdges. Based in these results we selected the Minspan connection scheme in the general evaluation with real and artificial datasets (Results and Discussion Section).
Weighting schemes
The key point of the PKNNG metric is that we penalize the weight of these new connecting edges. The logic behind penalizing is simple: the edges added by the different connection schemes lie on very low density sections of the space and going through those regions when measuring distances is opposite to the basic idea that the graph connects high density regions. Then, we penalize the added edges in order to clearly differentiate paths that need to go through low density regions from those who only use high density regions.
We found that a very effective penalization is the use of an exponential factor of the form:
where w is the graph weight corresponding to the added edge, d is the Euclidean distance between the points being connected by that edge and μ is the mean edge weight in the original graph. Using this metric we can connect subgraphs corresponding to the same cluster with a relatively small cost, because connections in the same order of magnitude of μ will get a low penalization. On the other hand, edges connecting distant subgraphs will be strongly penalized. In the additional file 4: Evaluation of Penalization Functions we show results of experiments with several penalization functions, including linear (w = αd, with α ∈ ℜ^{+}), power (w = d (d/μ)^{k}, with k ∈ N ^{+}) and other exponential functions, and different definitions of μ. Those results show that the functional form of the penalization is not critical to the method, given that it highly penalizes long edges.
Additional file 4. Evaluation of Penalization Functions. Discussion and evaluation of other penalization functions.
Format: PDF Size: 77KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 16 shows an experiment performed to assess the real influence of the penalization on the PKNNG metric. In this case we included in the comparison two connection schemes, namely MinSpan and AllEdges, under two different settings: i) in the (normal) exponentially penalized version (as used in all previous experiments), and ii) in a "plain" version in which we eliminated the penalization and used for the added connecting edges the standard Euclidean distance. Both settings used exactly the same graphs, the only difference being the high penalization of some weights in the first case. It is easy to see that the "plain" version of AllEdges is simply the standard Euclidean metric, our baseline in all experiments. For completeness, we also include in this case the results obtained measuring similarities with the knngraph with the minimum kvalue that produces a fully connected graph (we call this method minkconnectedgraph). In Figure 16 we show the results obtained with the PAM clustering algorithm applied to the five settings described in this paragraph for the 10D embedding (in the additional file 3: Full Figures we show the complete figure with the four different embeddings). Analyzing the figure, the two methods that use "plain", nonfullyconnected Euclidean graphs (the MinSpanPlain and the minkconnectedgraph) show in most cases a better performance than the standard Euclidean metric, but, more importantly, the two penalized metrics are in all cases clearly superior to the corresponding "plain" methods. These results suggest that the key factor of our new metric is the high penalization of the added edges.
Figure 16. Penalization. Comparison of two connection schemes (MinSpan and AllEdges) in their "plain" and exponentially penalized forms, using the 10D embedding. Columns of subfigures are the same as in Figure 15.
Finally, at this point we can consider the problem of outlier datapoints. After the First Step, our method left all outlier datapoints disconnected to any subgraph. If we apply the Second Step as it was described in the last paragraphs, we will end with penalized connections to outlier datapoints. In this case, most clustering algorithms will create individual clusters for each outlier. The advantage of this solution is that outliers are clearly identified, but, on the other hand, the number of clusters is artificially increased. Most researchers prefer another treatment for outliers, to add them to the nearest cluster. This can be easily done with our metric, by taking into account the number of datapoints in a subgraph before penalizing a connection. If the edge connects to a single datapoint, we just add an edge with no penalization. As clustering method usually separate clusters by the penalized edges, in most cases outliers will be assigned to the nearest cluster. We used this setting in all the experiments in this work.
Availability
An R implementation of the PKNNG metric can be download from http://www.cifasisconicet.gov.ar/granitto/nng_0.0.1.tgz webcite or requested to the contact author. All dataset in this work can also be requested to the contact author.
Authors' contributions
AB implemented all methods and performed all statistical analysis, PMG wrote the manuscript, both authors devised and developed the method and read and approved the manuscript.
Acknowledgements
We acknowledge partial support for this project from ANPCyT (grants PICT 237, 643 and 2226). We thank Pablo F. Verdes, Monica Larese and Lucas Uzal for useful comments on this and on previous manuscripts.
References

Tibshirani R, Chu G, Tusher VG: Significance analysis of microarrays applied to the ionizing radiation response.
Proceedings of the National Academy of Sciences USA 2001, 98(9):51165121. Publisher Full Text

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 1999, 286(5439):531537. PubMed Abstract  Publisher Full Text

Richards A, Holmans P, O'Donovan M, Owen M, Jones L: A comparison of four clustering methods for brain expression microarray data.
BMC Bioinformatics 2008, 9:490. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dettling M, Gabrielson E, Parmigiani G: Searching for differentially expressed gene combinations.
Genome Biology 2005, 6(10):R88. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling.
Nature 2000, 403(6769):503511. PubMed Abstract  Publisher Full Text

Filippone M, Camastra F, Masulli F, Rovetta S: A survey of kernel and spectral methods for clustering.
Pattern Recognition 2008, 41:176190. Publisher Full Text

Franti P, Virmajoki O, Hautamaki V: Fast Agglomerative Clustering Using a kNearest Neighbor Graph.
IEEE Transactions on Pattern Analysis and Machine Intelligence 2006, 28:18751881. PubMed Abstract  Publisher Full Text

Frey BJ, Dueck D: Clustering by passing messages between data points.
Science 2007, 315:972976. PubMed Abstract  Publisher Full Text

Shi J, Malik J: Normalized cuts and image segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22:888905. Publisher Full Text

Arabie P, Hubert L: Cluster Analysis in Marketing Research. In Advanced methods in marketing research. Oxford: Blackwell; 1994:160189.

de Souto M, Costa I, de Araujo D, Ludermir T, Schliep A: Clustering cancer gene expression data: a comparative study.
BMC Bioinformatics 2008, 9:497. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

BenHur A, Elissee A, Guyon I: A Stability Based Method for Discovering Structure in Clustered Data. In Pacific Symposium on Biocomputing Lihue: 37 January 2002; Hawaii. Edited by Altman RB, k Dunker A, Hunter L, Lauderdale K, Klein TE. World Scientific; 2002:617.

Tibshirani R, Walther G, Hastie T: Estimating the Number of Clusters in a Dataset Via the Gap Statistic.
Journal of the Royal Statistical Society B 2003, 63:411423. Publisher Full Text

Monti S, Tamayo P, Mesirov JP, Golub TR: Consensus Clustering: A ResamplingBased Method for Class Discovery and Visualization of Gene Expression Microarray Data.
Machine Learning 2003, 52(12):91118. Publisher Full Text

Xu R, Wunsch D II: Survey of Clustering Algorithms.
IEEE Transactions on Neural Networks 2005, 16:645678. PubMed Abstract  Publisher Full Text

Jain AK, Murti MN, Flynn PJ: Data Clustering: A Review.
ACM Computing Surveys 1999, 31:264323. Publisher Full Text

Sneath PHA, Sokal RR: Numerical Taxonomy. San Francisco: W.H. Freeman and Company; 1973.

King B: Stepwise clustering procedures.
Journal of the American Statistical Association 1967, 69:86101. Publisher Full Text

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proceedings of the National Academy of Sciences USA 1998, 95:1486314868. Publisher Full Text

Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen J, Marcussen N, HamiltonDutoit S, Wolf H, Orntoft T: Identifying distinct classes of bladder carcinoma using microarrays.
Nature Genetics 2003, 33:9096. PubMed Abstract  Publisher Full Text

Risinger J, Maxwell G, Chandramouli G, Jazaeri A, Aprelikova O, Patterson T, Berchuck A, Barrett J: Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer.
Cancer Research 2003, 63:611. PubMed Abstract  Publisher Full Text

McQueen J: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability Statistical Lab. University of California: 21 June  18 July 1965 and 27 December 1965 and 7 January 1966; California. Edited by Cam LML, Neyman J. University of California Press; 1967:281297.

Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, New Jersey: John Wiley & Sons; 1990.

Shai R, Shi T, Kremen TJ, Horvath S, Liau LM, Cloughesy TF, Mischel PS, Nelson SF: Gene expression profiling identifies molecular subtypes of gliomas.
Oncogene 2003, 22(31):49184923. PubMed Abstract  Publisher Full Text

Calza S, Hall P, Auer G, Bjohle J, Klaar S, Kronenwett U, Liu E, Miller L, Ploner A, Smeds J, Bergh J, Pawitan Y: Intrinsic molecular signature of breast cancer in a populationbased cohort of 412 patients.
Breast Cancer Research 2006, 8(4):R34. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Waite R, Paccanaro A, Papakonstantinopoulou A, Hurst J, Saqi M, Littler E, Curtis M: Clustering of Pseudomonas aeruginosa transcriptomes from planktonic cultures, developing and mature biofilms reveals distinct expression profiles.
BMC Genomics 2006, 7:162. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Soukas A, Cohen P, Socci ND, Friedman JMH: Leptinspecific patterns of gene expression in white adipose tissue.
Genes and Development 2000, 14:963980. PubMed Abstract  PubMed Central Full Text

Michele Leone S, Weigt M: Clustering by softconstraint affinity propagation: applications to geneexpression data.
Bioinformatics 2007, 23(20):27082715. PubMed Abstract  Publisher Full Text

Heyer L, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of coexpressed genes.
Genome Research 1999, 9:11061115. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li L, Weinberg C, Darden T, Pedersen L: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method.
Bioinformatics 2001, 17:11311142. PubMed Abstract  Publisher Full Text

McLachlan G, Bean R, Peel D: A mixture modelbased approach to the clustering of microarray expression data.
Bioinformatics 2002, 18(3):413422. PubMed Abstract  Publisher Full Text

Liu L, Hawkins DM, Ghosh S, Young SS: Robust singular value decomposition analysis of microarray data.
Proceedings of the National Academy of Sciences USA 2003, 100(23):1316713172. Publisher Full Text

Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization.
Proceedings of the National Academy of Sciences USA 2004, 101(12):41644169. Publisher Full Text

Tjaden B: An approach for clustering gene expression data with error information.
BMC Bioinformatics 2006, 7:17. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Marttinen P, Myllykangas S, Corander J: Bayesian clustering and feature selection for cancer tissue samples.
BMC Bioinformatics 2009, 10:90. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

ShaweTaylor J, Cristianini N: Kernel Methods for Pattern Analysis. Cambridge: Cambridge University Press; 2004.

Tenenbaum J, de Silva V, Langford J: A global geometric framework for nonlinear dimensionality reduction.
Science 2000, 290:23192323. PubMed Abstract  Publisher Full Text

Roweis S, Saul L: Nonlinear dimensionality reduction by locally linear embedding.
Science 2000, 290:23232326. PubMed Abstract  Publisher Full Text

Belkin M, Niyogi P: Laplacian eigenmaps and spectral techniques for embedding and clustering.
Advances in Neural Information Processing Systems 2002, 14:585591.

Polito M, Perona P: Grouping and Dimensionality reduction by Locally Linear Embedding. In Advances in Neural Information Processing Systems, 14 38 December 2001; Vancouver, British Columbia, Canada. Edited by Dietterich TG, Becker S, Ghahramani Z. MIT Press; 2001:12551262.

Fischer B, Buhmann JM: Path Based Pairwise Data Clustering with Application to Texture Segmentation. In Energy Minimization Methods in Computer Vision and Pattern Recognition, 35 September 2001;Sophia Antipolis, France. Edited by Figueiredo MAT, Zerubia J, Jain AK. Springer; 2001:235250. Publisher Full Text

Fischer B, Buhmann JM: Pathbased clustering for grouping of smooth curves and texture segmentation.
IEEE Transactions on Pattern Analysis Machine Intelligence 2003, 25:513518.

Fischer B, Buhmann JM: Bagging for Pathbased clustering.
IEEE Transactions on Pattern Analysis Machine Intelligence 2003, 25:14111415. Publisher Full Text

Xing EP, Ng A, Jordan MI, Russell S: Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems 15 914 December 2002; Vancouver, British Columbia, Canada. Edited by Becker S, Thrun S, Obermayer K. MIT Press; 2002:505512.

Ng A, Jordan M, Weiss Y: On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14 38 December 2001; Vancouver, British Columbia, Canada. Edited by Dietterich TG, Becker S, Ghahramani Z. MIT Press; 2001:849856.

Fred A, Jain AK: Combining Multiple Clusterings Using Evidence Accumulation.
IEEE Trans. Pattern Anal. Mach. Intell 2005, 27(6):835850. PubMed Abstract  Publisher Full Text

Kim EY, Kim SY, Ashlock D, Nam D: MULTIK: accurate classification of microarray subtypes using ensemble kmeans clustering.
BMC Bioinformatics 2009, 10:260. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Karatzoglou A, Smola A, Hornik K, Zeileis A: kernlab  An S4 Package for Kernel Methods in R.

Milligan G, Cooper M: A study of comparability of external criteria for hierarchical cluster analysis.
Multivariate Behavioral Research 1986, 21:441458. Publisher Full Text

Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17(10):977987. PubMed Abstract  Publisher Full Text

Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng G: Evaluation and comparison of gene clustering methods in microarray analysis.
Bioinformatics 2006, 22(19):2405. PubMed Abstract  Publisher Full Text

Mekuz N, Tsotsos JK: Parameterless Isomap with Adaptive Neighborhood. In Pattern Recognition, 28th DAGM Symposium, LNCS 4174. Edited by Franke K, Muller KR, Nickolay B, Schafer R. Springer; 2006:364373.

Brito M, Chavez E, Quiroz A, Yukich J: Connectivity of the mutual knearestneighbor graph in clustering and outlier detection.

Baya A, Granitto PM: ISOMAP based metrics for Clustering.
Inteligencia Artificial 2008, 12(37):1523. Publisher Full Text

Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms, Second Edition. MIT Press and McGrawHill; 2001.

Yeoh EJ, Ross ME, Shurtle SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.
Cancer Cell 2002, 1:133143. PubMed Abstract  Publisher Full Text

Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB: Largescale Analysis of the Human and Mouse Transcriptomes.
Proceedings of the National Academy of Sciences USA 2002, 99:44654470. Publisher Full Text

Pomeroy S, Tamayo P, Gaasenbeek M, Angelo LMSM, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Gene ExpressionBased Classification and Outcome Prediction of Central Nervous System Embryonal Tumors.
Nature 2002, 415:436442. PubMed Abstract  Publisher Full Text

Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR: Multiclass cancer diagnosis using tumor gene expression signatures.
Proceedings of the National Academy of Sciences USA 2001, 98:1514915154. Publisher Full Text