Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series

Open Access Research

A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis

Raffaele Giancarlo1, Giosué Lo Bosco1*, Luca Pinello2 and Filippo Utro3

Author affiliations

1 Dipartimento di Matematica ed Informatica, Universitá di Palermo, Via Archirafi 34, 90123 Palermo, Italy

2 Department of Biostatistics at Dana-Farber Cancer Institute and Harvard School of Public Health, 44 Binney Street, Boston, Massachusetts 02115, USA

3 Computational Genomics Group, IBM T.J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, N.Y. 10598, USA

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2013, 14(Suppl 1):S6  doi:10.1186/1471-2105-14-S1-S6


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/S1/S6


Published:14 January 2013

© 2013 Giancarlo et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research.

Results

A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm.

Conclusions

The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d2, Pearson correlation dr and mutual information dMI. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that dMI may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data.

Background

Recently, medical and biological research has been deeply influenced by the advent of high throughput technologies such as microarrays and RNA-seq platforms. They enable the acquisition of data that are fundamental for research in several areas of the biological sciences such as understanding biological systems and diagnosis (e.g. [1]). A fundamental aspect of microarray data analysis consists of clustering gene expression data [2,3]. However, its application to post-genomic data has revealed to be rather ad hoc. That is the reason why there is hardly any consensus on the best distance function and clustering algorithm to be used for the different types of post-genomic data. As a consequence, the common practice is to use several different clustering techniques to analyse a dataset, and to resort to visual inspection and prior biological knowledge to select what is considered the most "appropriate" result. Clearly, this data analysis is highly subjective, implying obvious risks. Those observations have motivated Handl et al. [4] to write a seminal paper with the intent to show to both bioinformatics researchers and end-users some of the fundamental aspects of the clustering methodology. The main message of that paper is that clustering can be considered as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm and (3) choice of a methodology to estimate the statistical significance of clustering solutions. Points (2) and (3) lead into two well established and rich research areas in data analysis ranging from statistics to Computer Science. Although computational methods for the analysis of microarray data have witnessed an exponential growth, a few contributions have been given in trying to assess their merits [5]. As a result, the need for a comprehensive evaluation of the whole analysis process for microarray data is being recognized and a few benchmarking studies have appeared [6-8]. Unfortunately, point (1) has been hardly investigated regarding this new type of data and very few results on this topic are available (see [2,9,10] and references therein).

In this paper, we address point (1) by introducing a new qualitative and quantitative method to describe and assess the discriminative ability of a distance function alone and in conjunction with a clustering algorithm. Moreover, the methodology is also able to give indications about the bias of clustering algorithms with respect to distances. It is worth recalling that very little is known about this latter point, one of the difficulties being a fair comparison between the performance of a distance function and a clustering algorithm measured in terms of their classification ability. This point is discussed in detail in the Methods section. The overall methodology that is introduced here makes use of the ROC plane and the ROC curve [11] in order to define the new external clustering validation index BMI and the new CROC curve. The net effect is the delivery of a methodology that rigorously uses external knowledge in order to assess the performance of a distance function while granting a fair comparison with clustering solutions generated by a clustering algorithm. It is worth mentioning that previous approaches to this problem presented the shortcomings of being based only on internal indices [10], i.e., homogeneity and separation: Indeed, it is well known that external validation is more accurate than the internal one [12]. The remainder of this paper is organized as follows. The experimental set-up we have used and the results are presented in the next section. Then, some conclusions and directions of future research are offered next. Finally, the Methods section describes in detail the new methodology to assess the intrinsic separation ability of three distance functions, and its use in conjunction with clustering algorithms.

Results and discussion

Experimental setup

Datasets

Technically speaking, a gold solution GS for a dataset is a partition of the data in a number of classes known a priori. Membership of a class is established by assigning the appropriate class label to each element. This means that the partition of the dataset in classes is based on some external knowledge that leaves no ambiguity on the actual number of classes and on their composition in terms of class memberships. Moreover, is also important to state that there exist two main kinds of gold solution datasets, i.e., (i) the ones for which an priori division in to classes of the dataset is known; (ii) and the ones for which high quality partitions have been inferred by analyzing the data. Dudoit and Fridlyand [13] elegantly make clear that difference in a related study and we closely follow their approach here.

Each dataset is a matrix, in which each row corresponds to an element to be clustered and each column to an experimental condition. The nine datasets, together with the acronyms used in this paper, are reported next. For conciseness, we mention only some relevant facts about them. The interested reader can find additional information in Dudoit and Fridlyand [13] for the Lymphoma and NCI60 datasets, Di Gesú et al. [14] for the CNS Rat, Leukemia and Yeast datasets and in Monti et al. [15], for the remaining ones.

CNS Rat: It is a 112 × 17 data matrix, obtained from the expression levels of 112 genes during a rat's central nervous system development. The dataset was studied by Wen et al. [16] and they suggested a partition of the genes into six classes, four of which are composed of biologically, functionally-related genes. This partition is taken as the gold solution, which is the same one used for the validation of FOM [17].

Gaussian3: It is a 60 × 600 data matrix. It is generated by having 200 distinctive features out of the 600 assigned to each cluster. There is a partition into three classes and that is taken as the gold solution. The data simulates a pattern whereby a distinct set of 200 genes is up-regulated in one of the three clusters, and down-regulated in the remaining two.

Gaussian5: It is a 500 × 2 data matrix. It represents the union of observations from 5 bivariate Gaussians, 4 of which are centered at the corners of the square of side length λ, with the 5th Gaussian centered at (λ/2, λ/2). A total of 250 samples, 50 per class, were generated, where two values of λ are used, namely, λ = 2 and λ = 3, to investigate different levels of overlapping between clusters. There is a partition into five classes and that is taken as the gold solution.

Leukemia: It is a 38 × 100 data matrix, where each row corresponds to a patient with acute leukemia and each column to a gene. The original microarray experiment consists of a 72 × 6817 matrix, due to Golub et al. [18]. In order to obtain the current dataset, Handl et al. [4] extracted from it a 38 × 6817 matrix, corresponding to the learning set in the study of Golub et al. and, via preprocessing steps, they reduced it to the current dimension by excluding genes that exhibited no significant variation across samples. The interested reader can find details of the extraction process in Handl et al.. For this dataset, there is a partition into three classes and that is taken as the gold solution. It is also worthy of mention that Leukemia has become a benchmark standard in the cancer classification community [19].

Lymphoma: It is a 80 × 100 data matrix, where each row corresponds to a tissue sample and each column to a gene. The dataset comes from the study of Alizadeh et al. [20] on the three most common adult lymphoma tumors. There is a partition into three classes and it is taken as the gold solution. The dataset has been obtained from the original microarray experiments, consisting of an 80 × 4682 data matrix, following the same preprocessing steps detailed in Dudoit and Fridlyand [13].

NCI60: It is a 57 × 200 data matrix, where each row corresponds to a cell line and each column to a gene. This dataset originates from a microarray study in gene expression variation among the sixty cell lines of the National Cancer Institute anti-cancer drug screen [21], which consists of a 61 × 5244 data matrix. There is a partition of the dataset into eight classes, for a total of 57 cell lines, and it is taken as the gold solution. The dataset has been obtained from the original microarray experiments as described by Dudoit and Fridlyand [13].

Novartis: It is a 103 × 1000 data matrix, where each row corresponds to a tissue sample and each column to a gene. The dataset comes from the study of Su et al. [22] on four distinct cancer types. There is a partition into four classes and we take that as the gold solution.

Simulated6: It is a 60 × 600 data matrix. It consists of a 600-gene by 60-sample dataset. It can be partitioned into 6 classes with 8, 12, 10, 15, 5, and 10 samples respectively, each marked by 50 distinct genes uniquely up-regulated for that class. In addition, a list of 300 noise genes (i.e., genes having the same distribution within all clusters) are included. In particular, such genes are generated with decreasing differential expression and increasing variation, following the same distribution. Finally, the first block of 50 genes of the list is assigned to cluster 1, the second block to cluster 2 and so on. This partition into 6 classes is taken as the gold solution.

Yeast: It is a 698 × 72 data matrix, studied by Spellman et al. [23] whose analysis suggests a partition of the genes into five functionally-related classes, which is taken as the gold solution and which has been used by Shamir and Sharan for a case study on the performance of clustering algorithms [24].

Distances

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13">View MathML</a> be a set. A function <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M14">View MathML</a> is a distance (or dissimilarity) on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13">View MathML</a> if, ∀ x, y ∈ <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13">View MathML</a>, it satisfies the following three conditions:

1. δ(x, y) ≥ 0 (non-negativity);

2. δ(x, y) = δ(y, x) (symmetry);

3. δ(x, x) = 0;

In the case of microarray data, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13">View MathML</a> = <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M15">View MathML</a>, i.e. each data point <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1">View MathML</a> is a vector in m-dimensional space. Note that a dataset X is a finite subset of <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M13">View MathML</a>, |X| = n. One can categorize distance functions according to three broad classes: geometric, correlation-based and information-based. Functions in the first class capture the concept of physical distance between two objects. They are strongly influenced by the magnitude of change in the measured components of vectors <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2">View MathML</a>, making them sensitive to noise and outliers. Functions in the second class capture dependencies between the coordinates of two vectors. In particular, they usually have the benefit of capturing positive, negative and linear relationships between two vectors. Functions in the third class are defined via well known quantities in information theory such as entropy and mutual information [25]. They have the advantage of capturing statistical dependencies between two discrete data points, even if they are not linear. Unfortunately, when one tries to apply them to points in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M15">View MathML</a>, a suitable discretization process must be carried out, known as binning, which usually poses some non-trivial challenges. For our experiments, we have considered the Euclidean distance, the Pearson correlation and Mutual Information since they are excellent representatives of the three categories described above. Indeed, they have been shown to be the most suitable for microarray data [9]. For the convenience of the reader, they are defined in the Methods section.

In what follows, we refer to distance and dissimilarity functions with the generic term distance functions.

Algorithms and hardware

In our experiments, we have chosen K-means among Partitional Methods, and Average Link, Complete Link and Single Link among the Hierarchical Methods clustering algorithms. The details of those algorithms are not reported here and the interested reader can find a detailed description of them in [26]. Of course, each of the above mentioned algorithms has already been used for data analysis of microarray data, e.g. [14,27-29]. All experiments were performed on several state-of-the-art PCs.

Evaluating the performance of distance functions via the BMI index and the CROC curve

In order to shed light on the proper choice of a distance function for clustering of microarray data, one needs to address the following points:

(A) Assessment of the intrinsic separation ability of a distance. That is, how well a distance discriminates independently of its use within a clustering algorithm.

(B) Assessment of the predictive clustering algorithm ability of a distance. That is, which distance function grants the best performance when used within a clustering algorithm.

(C) The interplay between (A) and (B).

Points (A) and (B) have been studied before (see [9] and references therein) with some useful insights. Unfortunately, very little is known about (C), one of the difficulties being a fair comparison between the performance of a distance function and a clustering algorithm measured in terms of classification ability (the technical details regarding this point are in the Methods section). We address this latter problem by introducing the BMI and the CROC curve. Technically, the idea is to map clustering solutions on the ROC plane and then to design a procedure that allows a fair comparison of distance functions and clustering algorithms via that mapping. Although the full details are given in the Methods section, it results convenient here to outline our approach. To this end, we need to introduce some terminology and recall some definitions. Given a clustering solution C = {c1, ...,cr}, it can be represented by a binary matrix J, referred to as connectivity matrix, where each entry of J is defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M3">View MathML</a>

(1)

Note that an important property of the connectivity matrix is transitivity, i.e. ∀i, j, k such that J(i, j) = J(j, k) = 1, then J(i, k) = 1. This is straightforward since a clustering solution is a partition of the dataset, so that two different clusters have always empty intersection. A useful tool to assess the performance of a classifier, not necessarily binary, is the confusion matrix, which is a matrix where each row represents the instances in a predicted class, while each column represents the instances in an actual class. In the case of a binary classification, the 2 × 2 confusion matrix stores the number of elements of class 0 classified as 0, denoted T0, and the number of elements of class 0 classified as 1, denoted F1. One defines T1and F0 analogously. In this context, the Sensitivity TPR and Specificity TNR are defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M4">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M5">View MathML</a>

A ROC plane is a plane where y = TPR and x = FPR = 1 - TNR, and it is useful to measure a classification in terms of TPR and FPR rates, once having established to represent with 0 the positive class. Note that, since a classifier assigns data items to classes, the TPR represents the percentage of item pairs correctly assigned to different classes, while the FPR is the percentage of item pairs incorrectly assigned to different classes. In the ROC plane, it is possible to define the ROC curve, which is a two-dimensional visualization of TPR versus FPR for increasing threshold values. Indeed, the area under this curve (AUC for short) is defined in the range [0,1], where a value of 0.5 corresponds to the performance of a classifier with a random assignment rule, while the closer is AUC to one, the better is the performance of the classifier. The CROC curve of a distance is a transformation of the ROC curve in which each point corresponds to a proper clustering solution.

We address point (C) by:

(C.1) showing how to map a clustering solution into the ROC plane (see subsection Clustering solutions, ROC plane and the BMI index)

(C.2) introducing a distance between a clustering solution and GS (see subsection A procedure to compare distance functions and clustering algorithms via ROC analysis);

(C.3) showing how (C.1) and (C.2) can be used to fairly compare the intrinsic ability of distance functions and of a clustering algorithms to identify "structure" in a dataset (see subsection A procedure to compare distance functions and clustering algorithms via ROC analysis).

The BMI takes values in the range [0,1]. Moreover, the closer the value of the index is to zero, the better the agreement between a partition and the GS. Therefore, in contrast to the most popular external validation indices [27], which need to be maximized, BMI needs to be minimized.

Results

The BMI can be regarded as the core of the methodology, and here it has been used in multiple ways: to assess a distance, to assess an algorithm, or purely as an external validation index. In the following, we give the result details for each one of these uses.

Values on Table 1 show the BMI computed for the three considered distances on each of the 9 datasets. In this case, the BMI refers to the best clustering solution produced, i.e. the closest point to (0,1) on the CROC curve (see the paragraph The BMI index and the CROC curve in the methods section for details). Therefore, the lower the value of the BMI for a distance, the better its intrinsic discrimination ability is. An analysis of the results in Table 1 shows that the Euclidean distance d2 and Pearson correlation distance dr have a good intrinsic discrimination ability when computed on most of the datasets (see the bold values in the Table 1), except on CNS-RAT and Yeast, in agreement with [9] and [30]. The same behavior is not confirmed by the mutual information distance dMI, which does not seem to offer the same flexibility and versatility as the other two distances. Taking into account such a fact, together with the results presented in [10], one receives an indication that MI may be superior to the considered distances only in conjunction with clustering algorithms specifically designed for its use. That is, although theoretically superior to other distances in terms of its ability to capture statistical dependency, it does not seem to offer the same flexibility and versatility as the other two distances considered here.

Table 1. BMI-values

Figures 1-3 show the partitions in the ROC plane for each considered algorithm, and the CROC for each distance. As it is discussed next, the availability of both BMI and CROC allows a detailed analysis of the interaction between a distance function and a clustering algorithm and represents one of the main contributions of this research. Indeed, we have that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the intrinsic separation ability of a distance function with respect to clustering. This can be observed by looking at the gray area in the corresponding figures since it represents the set of points which have a better performance with respect to the best distance point for BMI. Thus, we can assert that an algorithm improves the intrinsic separation ability of a distance when its BMI falls inside the grey region. In particular, Complete link is the best performer, falling 19/27 times into the gray region, followed by K-means (16/27) and Average link (12/27). It is worth pointing out that the intrinsic separation ability of dMI is not improved by Average Link and that Single Link is the worst performer (3/27). Moreover, the mapping of a clustering solution s to a point Ps allows to observe the agglomerative or divisive behaviour of the clustering algorithm that has produced s (see subsection Clustering solutions, ROC plane and the BMI index for more details about the methodology). Results show a more divisive behavior of Complete Link for all the considered distances, and of K-means in the case of Pearson and MI distances.

thumbnailFigure 1. CROC Euclidean. The CROC curve and plot of the clustering solutions for each dataset in the case of the Euclidean distance. Each subfigure is referred to a dataset. The markers show TPR versus FPR of each clustering solution. The area in gray represents the set of points which have a better performance with respect to the best distance point for BMI, while the dotted line represents set of points with the same performance. Complete link, K-means and Average link fall into the gray region for most of the dataset.

thumbnailFigure 2. CROC Pearson. The CROC curve and plot of the clustering solutions for each dataset in the case of Pearson Correlation. Each subplot is referred to a dataset. The markers show TPR versus FPR of each clustering solution. The area in gray represents the set of points which have a better performance with respect to the best distance point for BMI, while the dotted line represents set of points with the same performance. As in Figure 1, Complete link, K-means and Average link fall into the gray region for most of the dataset.

thumbnailFigure 3. CROC Mutual Information. The CROC curve and plot of the clustering solutions for each dataset in the case of Pearson Correlation. Each subplot is referred to a dataset. The markers show TPR versus FPR of each clustering solution. The area in gray represents the set of points which have a better performance with respect to the best distance point for BMI, while the dotted line represents set of points with the same performance. Complete link and K-means fall into the gray region for most of the dataset, but this is not the case of Average Link.

The BMI can be regarded also as an external validation index and it can be used to asses the performance of clustering algorithms. Therefore, in order to validate this new index we compare it with the state of the art indices in the literature. Indeed, we have considered the set <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16">View MathML</a> of all the clustering solutions corresponding to the points in the CROC for each considered distance d, and each dataset X. Then, for each element in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16">View MathML</a> we have computed the BMI, the Adjusted Rand (RA), the Fowlkes and Mallows (FM) and the F values. For the convenience of the reader, the latter three indices are defined in the Methods section. Figures 4-6 plot the values of BMI, RA, FM, F for each clustering solutions in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16">View MathML</a>. Each figure refers to a distance, and each sub-figure to a dataset. The x-axis represents the number of clusters of each clustering solution in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M16">View MathML</a>. The Figures seem to suggest a strong anti-correlation between the curves of BMI and the other indices. In order to quantify this correlation, we have computed the Pearson correlation between the curves of BMI and those latter indices. The results are in Tables 2 - 4 and show that the BMI is on most of the datasets highly anti-correlated with the three other indices. In particular, BMI shows 13/27 times a strong anti-correlation (bold values in Tables 2) and 10/27 times a moderate anti-correlation with RA, 13/27 times a strong anti-correlation (bold values in Tables 3) and 12/27 a moderate anti-correlation with F, and 11/27 times a strong anti-correlation (bold values in Tables 4) and 13/27 times a moderate anti-correlation with FM. This demonstrates that the BMI values are congruent with other indices, so its use as an external validity index is fully justified.

thumbnailFigure 4. BMI and RA,F,FM Euclidean. Plots of the BMI, RA, F, FM for all the clustering solution corresponding to the points in the CROC of the Euclidean distance. Each subplot refers to a dataset.

thumbnailFigure 5. BMI and RA,F,FM Pearson. Plots of the BMI, RA, F, FM for all the clustering solution corresponding to the points in the CROC of the Pearson Correlation Distance. Each subplot refers to a dataset.

thumbnailFigure 6. BMI and RA,F,FM MI. Plots of the BMI, RA, F, FM for all the clustering solution corresponding to the points in the CROC of the MI distance. Each subplot refers to a dataset. For Gaussian5, the dMI is not computed due to the small number of features and the corresponding subplot has been removed from the figure.

Table 2. BMI vs RA

Table 3. BMI vs F

Table 4. BMI vs FM

Conclusions

In this paper we have presented a procedure to asses the discriminative ability of a distance for data clustering. Such procedure is based on the BMI, a new external validation index that has the versatility to be used to asses a distance, to asses an algorithm, or purely as an external validation index. We have applied the overall methodology on 9 datasets, in the case of the Euclidean, Pearson and Mutual Information distances. Some of the computed results agree with other state of the art external validation indices, but with respect to them our procedure is more informative since it can shed light on the bias of the clustering algorithm with respect to a distance. An important thing to stress about the proposed methodology is that, although it was validated in the context of gene expression data obtained by microarray technologies, it is worth pointing out that the proposed methodology is "generic", i.e, it can be applied to other kind of data (e.g. RNA-seq). As a future direction of investigation, we intend to extend the proposed methodology to study a very challenging problem in the field of data analysis, i.e., the quantification of the intrinsic complexity of a dataset, defined as: the difficulty for a clustering algorithm to find the correct partition of a dataset.

Methods

Definition of distance functions

We now formally define the distances used in this paper.

The Euclidean distance belongs to the geometric class of distances, and it is defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M6">View MathML</a>

(2)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1">View MathML</a> = (x1,...,xm), <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2">View MathML</a> = (y1,...,ym).

The Pearson distance dr is a correlation-based distance:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M7">View MathML</a>

(3)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2">View MathML</a> are the sample means of <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M2">View MathML</a>.

The Mutual Information distance dMI is an information-based distance so defined

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M8">View MathML</a>

(4)

where pi = P(X = xi) and pj = P(Y = yj ) are the marginal probability mass functions (p.m.f. for short) and pij = P(X = xi,Y = yj ) the joint p.m.f. When dealing with such a distance, the problem is the estimation of the marginal and joint p.m.f., which involves a discretization of the data values, usually done by using binning and histogram based procedures [31].

Definition of external indices

Recall from [27] that an external index measures how well a clustering solution computed by an algorithm agrees with the gold solution for a given dataset. Formally, let C = {c1,...,cr} be the partition of the items in dataset X into r clusters, corresponding to the gold solution for that dataset. Let P = {p1,...,pt} be an analogous partition, possibly produced by a clustering algorithm.

An external index measures the level of agreement of the two partitions. External indices are usually defined via a r × t contingency table T, where Tij represents the number of items in both ci and pj, 1 ≤ i r and 1 ≤ j t.

For our experiment we have used the RA, FM and F indices. We report their formulas next, pointing out that additional details about them can be found in [27].

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M9">View MathML</a>

(5)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M10">View MathML</a>

(6)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M11">View MathML</a>

(7)

where Ti. = |ci| and T.j = |pj |.

Note that there is a little difference in the range of values of the three indices: while the FM and the F indices can assume a value in the range [0,1], the RA may be negative [32]. All three indices need to be maximized, that is, for each of them, the closer the index is to one, the better the agreement between the two partitions.

The BMI index and the CROC curve

The ROC plane can be used to estimate the similarity between a reference partition and a generic one as follows. The reference partition is mapped to the point (0, 1) in the ROC plane, corresponding to perfect classification. Analogously, the generic partition is mapped to a point in the ROC plane, depending on the number of "misclassified" elements with respect to the reference partition. Then, a distance measure between such a point and (0, 1) gives an indication about the similarity of the partitions. The BMI is the Euclidean distance between those two points. Moreover, the mapping of partitions into the ROC plane at the base of the BMI can be used to assess the intrinsic discriminative ability of a distance function for clustering by generating the CROC curve and by considering the closest point to (0, 1) on this curve as a clustering solution associated to the distance function. Then, the BMI between this point and (0,1) gives the required estimate. In the following subsections, we give details about the BMI and the CROC.

Clustering solutions, ROC plane and the BMI index

Given a gold solution GS, it is possible to map a clustering solution s into the ROC plane as follows:

1. Compute the connectivity matrix Js for s.

2. Starting from Js, compute the confusion matrix with respect to GS.

3. Use that confusion matrix to compute TPR and FPR for s. Those two variables naturally identify a point into the ROC plane, associated to s.

A few remarks are in order. The above approach naturally leads to measure a clustering solution in terms of TPR and FPR rates. As anticipated, the point into the ROC plane associated with GS is PGS = (0,1).

Given a clustering solution s, let Ps = (x, y) be the point in the ROC plane corresponding to it.

The performance of s is proportional to the proximity of Ps to PGS, as we now explain. Let Em be the Misclassification error rate defined as the sum between FPR (x) and False negative rate (FNR = 1 - y). That is, Em is the L1 metric (d1) computed between Ps and PGS, i.e., d1(PGS, Ps) = |x + 1 - y|. Then, the closer Ps and PGS are with respect to d1, the better the clustering solution with respect to Em.

It is worth pointing out that Ps gives a measure of the agglomerative and divisive tendency of a generic clustering algorithm. Indeed, the greater the x value, the more divisive the clustering algorithm is. Analogously, the smaller the y value, the more agglomerative the clustering algorithm is. Indeed, we can actually devise an index that measures such a tendency.

Let Eb be the Balancing error rate defined as the measures of how much FPR and FNR are unbalanced. The BMI for a generic clustering solution is:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M17">View MathML</a>

(8)

where the weights α and β allow to tune the importance between balance and misclassification.

Among all the possible weight combinations, a natural choice for BMI is to set α and β in order to take into an equal account the misclassification error rate Em and the balancing error rate Eb. This corresponds to the setting α = β = 0.5, and it is of interest and relevance here to notice that in this case BMI corresponds to d2(PGS,Ps). That is, the L2 (Euclidean) metric between the points PGS and Ps. This means that the closer Ps and PGS are with respect to d2, the better the clustering solution, in equal measure (α = β = 0.5) between misclassification error rate Em and balancing error rate Eb.

Operationally, once fixed α = β = 0.5, if one wants to compute the BMI of a clustering algorithm producing a clustering solution with x = FPR and y = TPR, respectively, one needs only to compute the Euclidean distance between the points Ps and PGS in the ROC plane. It is obvious that such a technique can also be used to compare the performance of several clustering solutions by considering the Euclidean distances between the associated points into the ROC plane and PGS.

A procedure to compare distance functions and clustering algorithms via ROC analysis

We recall from [9] that starting from a distance matrix D and a gold solution GS, it is possible to derive a ROC curve into the ROC plane, as we now briefly outline. Given a threshold value ∈ [0,1] and the distance matrix D, let Ibe a matrix in which each entry is defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S1/S6/mathml/M12">View MathML</a>

(9)

Therefore, considering all the points corresponding to different threshold values, we obtain the ROC curve for the distance function d. If each point in the ROC curve corresponded to a proper partition of the items, i.e., a clustering solution, we could use it to address point (C) (see Evaluating the performance of distance function via the BMI index and the CROC curve). Unfortunately, that is not the case, as we now argue. In fact, Icannot be considered as a connectivity matrix of a dataset partition since it could not satisfy the transitive property (see formula 1). This issue always occurs when three generic items xi, xj, xk lie on a straight line at distances D(i, j) = D(j, k) = , involving I(i, j) = 1, I(j, k) = 1 but I(i, k) = 0. Therefore, in order to properly compare a distance function with a clustering algorithm, via ROC analysis, we need to convert the matrix Iinto a matrix Jrepresenting a connectivity matrix of a clustering solution. This can be done in several ways: here we have adopted an approach based on the connected components induced by the matrix I. Intuitively, the process is the following: if Idoes not correspond to a partition, i.e., at least two sets a and b have non-empty intersection, then they are merged into a new set c = a b. This allows to transform the ROC curve associated to a distance function into a new curve in which each point corresponds to a proper clustering solution. We refer to this curve as the CROC of a distance. Recall from [11] that the AUC represents also the probability that a random pair of elements belonging to different classes will be correctly ranked. By analogy, in our case, the AUC under the CROC represents the probability of the following event: two couples (x, x′) and (y, z) such that x, x′ belong to the same cluster while y, z to different clusters satisfy the relation d(y, z) >d(x, x′), where d represents a generic distance. Using the CROC curve, one can find the best clustering solution associated to a distance function with respect to BMI, as the closest point P to PGS into the CROC curve (see the green dot marker in Figures 1 - 3 as an example). Note that the value of the area under the CROC increases while the BMI of the best clustering solution decreases, thus we expect a low value of the BMI for a distance which has a good intrinsic discrimination ability. One can now fairly compare a distance function and a clustering solution produced by an algorithm, in terms of their classification ability:

1. Compute the ROC curve for a distance function d.

2. Calculate the CROC curve starting from the ROC curve computed in the previous point.

3. Find the best point into the CROC curve, i.e., the point with the lowest value of BMI, and mark it.

4. Map one or more clustering solutions in the ROC plane (as described in subsection Clustering solutions, ROC plane and the BMI index) and mark the corresponding points.

5. Rank the performance of each marked points in the ROC plane, as described in subsection Clustering solutions, ROC plane and the BMI index.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors participated in the design of the methods and of the related experimental methodology. LP and FU implemented all of the algorithms and performed the experiments. RG and GL coordinated the research and wrote the report. All authors have read and approved the manuscript.

Declarations

The publication costs for this article were funded by the corresponding author's institution

This article has been published as part of BMC Bioinformatics Volume 14 Supplement 1, 2013: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S1 webcite.

References

  1. Stanford Microarray DataBase [http://smd.stanford.edu/] webcite

  2. D'haeseleer P: How does gene expression cluster work.

    Nat Biotechnol 2005, 23:1499-1501. PubMed Abstract | Publisher Full Text OpenURL

  3. Speed TP: Statistical analysis of gene expression microarray data. Chapman & Hall/CRC; 2003. OpenURL

  4. Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis.

    Bioinformatics 2005, 21:3201-3212. PubMed Abstract | Publisher Full Text OpenURL

  5. Mehta T, Tanik M, Allison D: Towards sound epistemological foundations of statistical methods for high-dimensional biology.

    Nature genetics 2004, 36:943-947. PubMed Abstract | Publisher Full Text OpenURL

  6. Freyhult E, Landfors M, Önskog J, Hvidsten T, Rydén P: Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering.

    BMC Bioinformatics 2010, 11:503. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  7. Giancarlo R, Scaturro D, Utro F: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer.

    BMC Bioinformatics 2008, 9:462. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  8. Giancarlo R, Utro F: Speeding up the Consensus Clustering methodology for microarray data analysis.

    Algorithms for Molecular Biology 2011, 6:1. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  9. Giancarlo R, Lo Bosco G, Pinello L: Distance functions, clustering algorithms and microarray data analysis.

    Lecture Notes in Computer Science, Volume 6073 2010. OpenURL

  10. Priness I, Maimon O, Ben-Gal I: Evaluation of gene-expression clustering via mutual information distance measure.

    BMC Bioinformatics 2007, 8:1-12. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. JA Hanley BM: The meaning and use of the area under a receiver operating characteristic (ROC) curve.

    Radiology 1982, 143:29-36. PubMed Abstract | Publisher Full Text OpenURL

  12. Jain A, Dubes R: Algorithms for Clustering Data. Englewood Cliffs: Prentice-Hall 1988; OpenURL

  13. Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset.

    Genome Biology 2002, 3:RESEARCH0036. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Di Gesú V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D: Genclust: a genetic algorithm for clustering gene expression data.

    BMC Bioinformatics 2005, 6:289. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  15. Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.

    Machine Learning 2003, 52:91-118. Publisher Full Text OpenURL

  16. Wen X, Fuhrman S, Michaels GS, Carr GS, Smith DB, Barker JL, Somogyi R: Large scale temporal gene expression mapping of central nervous system development.

    Proc Natl Acad Sci USA 1998, 95:334-339. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data.

    Bioinformatics 2001, 17:309-318. PubMed Abstract | Publisher Full Text OpenURL

  18. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeeck M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

    Science 1999, 286:531-537. PubMed Abstract | Publisher Full Text OpenURL

  19. Brunet JP, Tamayo P, Golub T, Mesirov J: Metagenes and molecular pattern discovery using matrix factorization.

    Proc Natl Acad Sci USA 2004, 101:4164-4169. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson JJ, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.

    Nature 2000, 403:503-511. PubMed Abstract | Publisher Full Text OpenURL

  21. NCI 60 cancer microarray project [http://genome-www.stanford.edu/NCI60] webcite

  22. Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J: Large-scale analysis of the human and mouse transcriptomes.

    Proc Natl Acad Sci USA 2002, 99:4465-4470. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

    Mol Biol Cell 1998, 9:3273-3297. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Shamir R, Sharan R: Algorithmic approaches to clustering gene expression data.

    In Current Topics in Computational Biology Edited by Jiang T, Smith T, Xu Y, Zhang MQ, Cambridge, Ma.: MIT Press. 2003, 120-161. OpenURL

  25. Cover TM, Thomas JA: Elements of Information Theory. New York City: Wiley-Interscience; 1991. OpenURL

  26. Jain AK, Murty MN, Flynn PJ: Data clustering: a review.

    ACM Computing Surveys 1999, 31:264-323. Publisher Full Text OpenURL

  27. Giancarlo R, Scaturro D, Utro F: A tutorial on computational cluster analysis with applications to pattern discovery in microarray data.

    Mathematics in Computer Science 2008, 1:655-672. Publisher Full Text OpenURL

  28. Giancarlo R, Utro F: Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis.

    Theoretical Computer Science 2012, 428:58-79. OpenURL

  29. Xu Y, Olman V, Xu D: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning tree.

    Bioinformatics 2002, 18:536-545. PubMed Abstract | Publisher Full Text OpenURL

  30. Giancarlo R, Lo Bosco G, Pinello L, Utro F: The three steps of clustering in the post-genomic era: a synopsis. In Computational Intelligence Methods for Bioinformatics and Biostatistics, Volume 6685 of Lecture Notes in Computer Science. Edited by Rizzo R, Lisboa P. Springer Berlin/Heidelberg; 2011:13-30. OpenURL

  31. BWSilverman: Density Estimation for Statistics and Data Analysis (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Chapman and Hall/CRC; 1986. PubMed Abstract | Publisher Full Text OpenURL

  32. Yeung KY: Cluster analysis of gene expression data. PhD thesis. University of Washington; 2001. OpenURL