Dipartimento di Matematica ed Informatica, Universitá di Palermo, Via Archirafi 34, 90123 Palermo, Italy
Department of Biostatistics at DanaFarber Cancer Institute and Harvard School of Public Health, 44 Binney Street, Boston, Massachusetts 02115, USA
Computational Genomics Group, IBM T.J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, N.Y. 10598, USA
Abstract
Background
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl
Results
A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as
Conclusions
The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance
Background
Recently, medical and biological research has been deeply influenced by the advent of high throughput technologies such as microarrays and RNAseq platforms. They enable the acquisition of data that are fundamental for research in several areas of the biological sciences such as understanding biological systems and diagnosis (e.g.
In this paper, we address point (1) by introducing a new qualitative and quantitative method to describe and assess the discriminative ability of a distance function alone and in conjunction with a clustering algorithm. Moreover, the methodology is also able to give indications about the bias of clustering algorithms with respect to distances. It is worth recalling that very little is known about this latter point, one of the difficulties being a fair comparison between the performance of a distance function and a clustering algorithm measured in terms of their classification ability. This point is discussed in detail in the
Results and discussion
Experimental setup
Datasets
Technically speaking, a
Each dataset is a matrix, in which each row corresponds to an element to be clustered and each column to an experimental condition. The nine datasets, together with the acronyms used in this paper, are reported next. For conciseness, we mention only some relevant facts about them. The interested reader can find additional information in Dudoit and Fridlyand
CNS Rat: It is a 112 × 17 data matrix, obtained from the expression levels of 112 genes during a rat's central nervous system development. The dataset was studied by Wen
Gaussian3: It is a 60 × 600 data matrix. It is generated by having 200 distinctive features out of the 600 assigned to each cluster. There is a partition into three classes and that is taken as the gold solution. The data simulates a pattern whereby a distinct set of 200 genes is upregulated in one of the three clusters, and downregulated in the remaining two.
Gaussian5: It is a 500 × 2 data matrix. It represents the union of observations from 5 bivariate Gaussians, 4 of which are centered at the corners of the square of side length λ, with the 5th Gaussian centered at (λ/2, λ/2). A total of 250 samples, 50 per class, were generated, where two values of λ are used, namely, λ = 2 and λ = 3, to investigate different levels of overlapping between clusters. There is a partition into five classes and that is taken as the gold solution.
Leukemia: It is a 38 × 100 data matrix, where each row corresponds to a patient with acute leukemia and each column to a gene. The original microarray experiment consists of a 72 × 6817 matrix, due to Golub
Lymphoma: It is a 80 × 100 data matrix, where each row corresponds to a tissue sample and each column to a gene. The dataset comes from the study of Alizadeh
NCI60: It is a 57 × 200 data matrix, where each row corresponds to a cell line and each column to a gene. This dataset originates from a microarray study in gene expression variation among the sixty cell lines of the National Cancer Institute anticancer drug screen
Novartis: It is a 103 × 1000 data matrix, where each row corresponds to a tissue sample and each column to a gene. The dataset comes from the study of Su
Simulated6: It is a 60 × 600 data matrix. It consists of a 600gene by 60sample dataset. It can be partitioned into 6 classes with 8, 12, 10, 15, 5, and 10 samples respectively, each marked by 50 distinct genes uniquely upregulated for that class. In addition, a list of 300 noise genes (i.e., genes having the same distribution within all clusters) are included. In particular, such genes are generated with decreasing differential expression and increasing variation, following the same distribution. Finally, the first block of 50 genes of the list is assigned to cluster 1, the second block to cluster 2 and so on. This partition into 6 classes is taken as the gold solution.
Yeast: It is a 698 × 72 data matrix, studied by Spellman
Distances
Let
1.
2.
3.
In the case of microarray data,
In what follows, we refer to distance and dissimilarity functions with the generic term distance functions.
Algorithms and hardware
In our experiments, we have chosen Kmeans among
Evaluating the performance of distance functions via the BMI index and the CROC curve
In order to shed light on the proper choice of a distance function for clustering of microarray data, one needs to address the following points:
(A) Assessment of the intrinsic separation ability of a distance. That is, how well a distance discriminates independently of its use within a clustering algorithm.
(B) Assessment of the predictive clustering algorithm ability of a distance. That is, which distance function grants the best performance when used within a clustering algorithm.
(C) The interplay between (A) and (B).
Points (A) and (B) have been studied before (see
Note that an important property of the connectivity matrix is transitivity, i.e. ∀
A ROC plane is a plane where
We address point (C) by:
(C.1) showing how to map a clustering solution into the ROC plane (see subsection
(C.2) introducing a distance between a clustering solution and GS (see subsection
(C.3) showing how (C.1) and (C.2) can be used to fairly compare the intrinsic ability of distance functions and of a clustering algorithms to identify "structure" in a dataset (see subsection
The
Results
The
Values on Table
BMIvalues
CNS Rat
0.6804
0.6875
0.6692
Gaussian3
0.7170
0
0.7102
Gaussian5
0.2358
0.5424

Leukemia
0.3498
0.2559
0.3000
Lymphoma
0.3509
0.3385
0.7028
NCI60
0.4699
0.4699
0.5643
Novartis
0.4260
0.4240
0.4183
Simulated6
0.5022
0.8150
0.7456
Yeast
0.6647
0.6750
0.6677
The
Figures
CROC Euclidean
CROC Euclidean. The CROC curve and plot of the clustering solutions for each dataset in the case of the Euclidean distance. Each subfigure is referred to a dataset. The markers show
CROC Pearson
CROC Pearson. The CROC curve and plot of the clustering solutions for each dataset in the case of Pearson Correlation. Each subplot is referred to a dataset. The markers show
CROC Mutual Information
CROC Mutual Information. The CROC curve and plot of the clustering solutions for each dataset in the case of Pearson Correlation. Each subplot is referred to a dataset. The markers show
The
CNS Rat
0.1397
0.2682
0.3476
Gaussian3
0.2207
0.997
0.3336
Gaussian5
0.9918
1

Leukemia
0.9512
0.9830
0.9754
Lymphoma
0.9498
0.9465
0.3409
NCI60
0.6241
0.6060
0.6485
Novartis
0.8998
0.8787
0.8750
Simulated6
0.9249
0.9800
0.4720
Yeast
0.5121
0.5106
0.6246
The Pearson correlation between the
CNS Rat
0.4590
0.1684
0.5910
Gaussian3
0.5031
0.8714
0.5371
Gaussian5
0.5518
1

Leukemia
0.8155
0.8246
0.8068
Lymphoma
0.6329
0.5915
0.5896
NCI60
0.86139
0.8533
0.8529
Novartis
0.793199
0.7194
0.8283
Simulated6
0.9419
0.9373
0.4966
Yeast
0.4808
0.3151
0.5448
The Pearson correlation between the
CNS Rat
0.3408
0.0507
0.5335
Gaussian3
0.4027
0.91840
0.4383
Gaussian5
0.6297
1

Leukemia
0.8701
0.8628
0.8453
Lymphoma
0.6969
0.6624
0.5338
NCI60
0.6801
0.6429
0.7264
Novartis
0.8194
0.7661
0.8230
Simulated6
0.9280
0.9255
0.4584
Yeast
0.4183
0.2481
0.4887
The Pearson correlation between the
Conclusions
In this paper we have presented a procedure to asses the discriminative ability of a distance for data clustering. Such procedure is based on the
Methods
Definition of distance functions
We now formally define the distances used in this paper.
The
where
The
where
The
where
Definition of external indices
Recall from
An external index measures the level of agreement of the two partitions. External indices are usually defined via a
For our experiment we have used the
where
Note that there is a little difference in the range of values of the three indices: while the
The BMI index and the CROC curve
The ROC plane can be used to estimate the similarity between a reference partition and a generic one as follows. The reference partition is mapped to the point (0, 1) in the ROC plane, corresponding to perfect classification. Analogously, the generic partition is mapped to a point in the ROC plane, depending on the number of "misclassified" elements with respect to the reference partition. Then, a distance measure between such a point and (0, 1) gives an indication about the similarity of the partitions. The
Clustering solutions, ROC plane and the BMI index
Given a gold solution GS, it is possible to map a clustering solution
1. Compute the connectivity matrix
2. Starting from
3. Use that confusion matrix to compute
A few remarks are in order. The above approach naturally leads to measure a clustering solution in terms of
Given a clustering solution
The performance of
It is worth pointing out that
Let
where the weights
Among all the possible weight combinations, a natural choice for
Operationally, once fixed
A procedure to compare distance functions and clustering algorithms via ROC analysis
We recall from
Therefore, considering all the points corresponding to different threshold values, we obtain the ROC curve for the distance function
1. Compute the ROC curve for a distance function
2. Calculate the CROC curve starting from the ROC curve computed in the previous point.
3. Find the best point into the CROC curve, i.e., the point with the lowest value of
4. Map one or more clustering solutions in the ROC plane (as described in subsection
5. Rank the performance of each marked points in the ROC plane, as described in subsection
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors participated in the design of the methods and of the related experimental methodology. LP and FU implemented all of the algorithms and performed the experiments. RG and GL coordinated the research and wrote the report. All authors have read and approved the manuscript.
Declarations
The publication costs for this article were funded by the corresponding author's institution
This article has been published as part of