Department of Industrial Engineering, Tel Aviv University, Israel

Abstract

Background

The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient.

Results

Relying on several public gene expression datasets, we evaluate the homogeneity and separation scores of different clustering solutions. It was found that the use of the MI measure yields a more significant differentiation among erroneous clustering solutions. The proposed measure was also used to analyze the performance of several known clustering algorithms. A comparative study of these algorithms reveals that their "best solutions" are ranked almost oppositely when using different distance measures, despite the found correspondence between these measures when analysing the averaged scores of groups of solutions.

Conclusion

In view of the results, further attention should be paid to the selection of a proper distance measure for analyzing the clustering of gene expression data.

Background

In recent years, DNA microarray technology has become a vital scientific tool for global analysis of genes and their networks. The new technology allows simultaneous profiling of the expression levels of thousands of genes in a single experiment. At the same time, the successful implementation of microarray technology has required new methods for analyzing such large scale datasets. Clustering is a central analysis method of gene-expressions that has been implemented extensively in various works and applications

Many clustering algorithms depend heavily on 'similarity' or 'distance' measures (although not necessarily a distance function that satisfy all mathematical conditions of a metric) that quantify the degree of association between expression profiles. The definition of the distance measure is a key factor for a successful identification of the relationships between genes and networks

Despite the crucial influence of the similarity measure upon the clustering results, there are fewer publications on this subject in the bioinformatics literature. Many publications focus on the efforts to optimize and justify the implemented biological processes and the clustering algorithms, while the similarity measures are often selected by default

In addition to the Euclidean distance, another widely used measure for analyzing and clustering gene expression data is the Pearson correlation coefficient

Within the large body of research on gene expression clustering, there are few publications that systematically explore the appropriateness of chosen similarity measures. Herzel and Grosse (1995)

Most of the above papers, with the exception of Daub et al (2004)

This paper proposes a procedure to evaluate the MI between gene expression patterns. Consequently, by using several public gene expression datasets, it compares the MI measure with respect to both the Euclidean distance and the Pearson correlation. The comparison includes a consistency examination upon clustering solutions of different quality in terms of the number of errors. The clustering is carried out by using normalized homogeneity and separation functions that provide a uniform scale for the examination. The results of the first experiment clearly show that the MI outperforms the conventional measures by yielding a more significant differentiation among clustering solutions. Next, the paper employs the MI measure to evaluate the solutions of four recognized clustering algorithms over a yeast cell-cycle database

The remainder of paper is organized as follows. The Results section describes two experiments: the first experiment compares the robustness of the distance measures and the second experiment evaluates the solutions of known clustering algorithms by both the MI based scores and the Pearson correlation based scores. The Discussion and Conclusion sections follow the Results section. The Methods section addresses the compared distance measures and their implementation to clustering; the assessment of the quality of the clustering solutions, and the compared clustering algorithms.

Results

Experiment 1: Robustness of compared distance measures

The underlying idea in this experiment was to evaluate the performance of the three distance measures based on clustering solutions with a known number of clustering errors. Given a dataset with true two-clusters solution, we generated several erroneous solutions having a different number of errors. Clustering errors were generated by transferring samples from their true cluster to the erroneous one. Given a dataset with

In the next stage, the generated clustering solutions were grouped by their quality level, i.e., by the number of errors with respect to the true solution. The average homogeneity and separation scores were calculated for each group based on each of the three similarity measures. Finally, the "robustness" of each similarity measure was defined and evaluated according to the conformation with the following two criteria.

• A monotonic relationship between the obtained scores and the quality of clustering solutions. The smaller is the number of errors in a solution, the better should be its homogeneity and separation scores and vice versa.

• Statistically-significant differentiation between clustering solutions of different quality level. The average homogeneity and separation scores for each similarity measure and for each group were evaluated empirically from the experiments. Accordingly, it is expected that scores of groups of different quality will significantly differ from each other. Such a differentiation assures that the scores of high-quality clustering solutions will not get "mixed up" with the scores of low-quality clustering solutions, hence, our use of the term "robustness".

The first criterion is mainly affected by the trend of the averaged scores of clustering solutions as their quality changes. The second criterion is more rigorous in a sense, and tries to establish a statistically-significant differentiation between groups of clustering solutions based on the scores' mean values and the scores' standard deviations. In fact, a similarity measure that complies with the second criterion guarantees a high power of a statistical test. In our case, this criterion decreases the Type II statistical error, thus the probability to reject a false null hypothesis that two different clustering solutions belong to the same quality group.

Datasets

In this part of the experiment we used four public gene-expression datasets that are listed in Table

Used datasets for the experiment 1

#

No. of tissues (samples)

No. of genes

Min H.ratio

Min S.ratio

1

28 lung cancer/23 colon^{1 }cancer

1 K

1.47

1.65

2

26 breast cancer/28 lung cancer

1 K

1.42

1.83

3

26 breast cancer/23 colon^{1 }cancer

1 K

1.12

1.32

4

40 colon^{2 }cancer/22 normal colon

2 K

1.02

0.93

List of the used datasets for experiment 1 including their sizes. The last two columns represent the ratio of the Homogeneity (H) and the Separation (S) Z-scores between the MI measure and the best score from among the Pearson correlation and the Euclidean distance. These ratios are calculated based on 10-errors clustering solutions.

Experimental results

The average homogeneity and separation scores for each group were normalized to provide a uniform scale for the comparison of the three similarity measures. The normalization of the average scores were calculated with respect to the mean and the variance values of the group of single-error solutions. Thus, the obtained Z-scores reflect the difference in the average values in terms of the number of standard deviations of the single-error group of solutions. Note that the homogeneity and separation scores of the single-error groups of solutions were approximately normally distributed for almost all the distance measures and all the datasets. For example, in Figure

A normal-shape frequencies of MI-Separation scores

**A normal-shape frequencies of MI-Separation scores**. A normal-shape frequencies of the MI-based separation scores of the single-error group of solutions for dataset 1 that contains 1000 genes from 51 sampled tissues.

In the first experimentation stage we found that all three similarity measures fully meet our first criterion of "robustness". Namely, for all the distance measurements and for all the datasets the scores demonstrated a consistent capability to evaluate the quality of the clustering solutions. The results in Figures

Homogeneity Z-scores for dataset 1

**Homogeneity Z-scores for dataset 1**. Normalized homogeneity Z-scores of clustering solutions with different number of errors based on the Pearson correlation, the Euclidean distance and the MI measures for dataset 1 that contains 1000 genes from 51 sampled tissues.

Separation Z-scores for dataset 1

**Separation Z-scores for dataset 1**. Normalized separation Z-scores of clustering solutions with different number of errors based on the Pearson correlation, the Euclidean distance and the MI measures for dataset 1 that contains 1000 genes from 51 sampled tissues.

Homogeneity Z-scores for dataset 2

**Homogeneity Z-scores for dataset 2**. Normalized homogeneity Z-scores of clustering solutions with different number of errors based on the Pearson correlation, the Euclidean distance and the MI measures for dataset 2 that contains 1000 genes from 54 sampled tissues.

Separation Z-scores for dataset 2

**Separation Z-scores for dataset 2**. Normalized separation Z-scores of clustering solutions with different number of errors based on the Pearson correlation, the Euclidean distance and the MI measures for dataset 2 that contains 1000 genes from 54 sampled tissues.

In terms of the second criterion, the results from all four datasets (two datasets are depicted in Figures

Despite the intrinsic tradeoff between the homogeneity and the separation scores, the MI-based scores clearly dominate the other distance measures for all the number of errors. Similar results were obtained for all datasets (See additional file

Used software and parameters for comparison of clustering algorithms. The file contains the following 2 sections. Section 1: A comparison study of the

Click here for file

Experiment 2: Comparison of known clustering algorithms by the MI measure

In the above experiment we compared the robustness of the MI measure to that of the Pearson correlation and the Euclidean distance. Within the examined datasets, it was found that the MI measure is statistically superior to the conventional measures in the detection of clustering errors. Note that the comparison has been performed with respect to known final clustering solution

Evidently, the use of different clustering algorithms often results in different solutions. There is a large body of research that compares the performance of different clustering algorithms with respect to gene-expression levels (e.g.,

This experiment compares four clustering algorithms that have been widely applied to gene-expression patterns (referred to as the

The four compared algorithms are the

The Yeast cell-cycle dataset

The study of the algorithms is based on their clustering solutions over the known dataset of yeast cell cycle

Using periodicity and correlation algorithms (e.g., Pearson correlation), the authors identified 800 genes as being periodically regulated. Note that although the yeast cell-cycle dataset has been extensively analyzed previously, its "correct" clustering is unknown. Spellman et al. (1998)

The preparation of the dataset was performed in a similar manner to Gat-Viks et al. (2003)

Experimental results

The compared algorithms were tested with respect to the yeast cell-cycle dataset to obtain their best clustering solutions with 5, 6 and 7 clusters. The comparison between the algorithms was performed between solutions with the same number of clusters. Clustering solutions were graded by the MI-based homogeneity and separation scores (see Methods section). This grading method is conventionally used to determine the quality of a clustering solution when the true solution is unknown

The comparison results for 5 and 7 clusters are given in Figure

MI-based scores of clustering solutions with 5 clusters

**MI-based scores of clustering solutions with 5 clusters**. Efficiency frontiers for solutions with 5 clusters, obtained by the

MI-based scores of clustering solutions with 7 clusters

**MI-based scores of clustering solutions with 7 clusters**. Efficiency frontiers for solutions with 7 clusters, obtained by the

In general, we found that the

When considering

Comparative analysis of the

No. of clusters

Better H. scores

Better S. scores

Better H/S Scores

Dominate solution

5

80%

30%

30%

Yes

6

100%

30%

30%

Yes

7

100%

48%

N/A

Yes

Comparative analysis of obtained clustering solutions in experiment 2 (not limited to the solutions in the efficiency frontier). The first column lists the number of clusters in the solutions. The second and third column give, respectively, the percentages of the sIB solutions that obtain a better MI-based Homogeneity and MI-based Separation scores with respect to the solutions of the other clustering algorithms (Click, K-means and SOM). The fourth column lists the percentages of the sIB solutions that obtain a better MI-based Homogeneity and Separation (combined) scores with respect to the solutions of the other clustering algorithms. The last column lists the existence of a sIB dominate solution over all the other solutions.

Note that when scoring the different algorithms with respect to the Euclidean or the Pearson measures, the (relative) ranking can be totally different. For example, Figure

Pearson correlation based scores of solutions with 5 clusters

**Pearson correlation based scores of solutions with 5 clusters**. Efficiency frontiers for solutions with 5 clusters, obtained by the

Pearson correlation based scores of solutions with 7 clusters

**Pearson correlation based scores of solutions with 7 clusters**. Efficiency frontiers for solutions with 7 clusters, obtained by the

Discussion

This paper presents two related experiments. In the first experiment, which is based on known clustering solutions, we show the statistical superiority of the average MI-based measure independently of the selected clustering algorithm. In the second experiment, we show that the use of different distance measures can yield very different results when evaluating the solutions of known clustering algorithms. This is particularly true when looking at the score of the "best clustering solution" rather than the averaged score of a group of solutions, as we did in Experiment 1. This important fact is often overlooked in the literature. The essential question "which distance measure to use?" remains without a definitive answer, yet, in view of the first experiment, we propose to further investigate it by integrating the MI-based distance measure into known clustering algorithms.

The statistical superiority of the MI-based score in the first experiment can be attributed to several appealing properties of this measure, as elaborated in the Methods section. The first property is the use of MI as a generalized measure of correlation between variables, a property which seems valuable for gene expression data

We can indicate two possible reasons why the

The second possible reason for the superiority of the

Conclusion

In this work we analyze the performance of the Mutual Information (MI) distance measure for clustering of gene-expression data. In comparison to the Pearson correlation and the Euclidean distance, the MI measure is known to have some advantages: it is a generalized measure of statistical dependence in the data, and it is reasonably immune against missing data and outliers. In this work we show that the average MI measure also yields a higher power of test among different clustering solutions, thus, this measure is potentially more robust for differentiating erroneous clustering solutions. A comparative study of known clustering algorithms reveals that their best solutions are ranked almost oppositely when using different distance measures. In view of these results, further attention should be paid to the selection of a proper distance measure for the evaluation of clustering of gene expression data. One future direction is the integration of the MI-measure in known clustering algorithms. Another potential research is to implement Daub et al. (2004)

Methods

Similarity measures

This section discusses three similarity measures and their properties: the **x **= (_{1},..., _{n}) and **y **= (_{1},..., _{n}).

Euclidean Distance and Pearson Correlation

The Euclidean distance between two expression profiles is given by

It measures similarity according to positive linear correlation between expression profiles, which may identify similar or identical regulation

Numerous biological researches (e.g.,

The Pearson correlation coefficient between two expression patterns (e.g.,

where

The Pearson correlation reflects the degree of linear relationship between two patterns. It ranges between -1 to +1, reflecting respectively a perfect negative (positive) linear relationship between the patterns. A zero correlation value implies that there is no linear relationship between the two patterns, yet it gives no indication regarding nonlinear relationships that might exist between the patterns.

The correlation coefficient is invariant under any scalar transformation of the data. Accordingly, two expression profiles that have "identical" shapes with different magnitudes will obtain a correlation value of 1. The ability to measure (dis)similarities according to positive and negative correlations can help to identify control processes that antagonistically regulate downstream pathways

Gene expression measurements, like other empirical measurements, suffer from noise effects. Variations in the measurements might come from many sources: intrachip defects, variation within a single lot of chips, variation within an experiment, and biological variation for a particular gene

Both the Pearson correlation and the Euclidean distance require complete gene expression profiles as input. However, gene-expression microarray experiments often generate datasets with missing expression values. Therefore, another source of uncertainty when implementing these measures is the need to use methods for estimating missing data, such as

Mutual Information

The Mutual _{i }∈ _{x}, _{j }∈ _{j }and probability distributions functions _{i}) ≡ _{i}, _{j}) ≡ _{j}, the Mutual information between two expression patterns, represented by random variables

The MI is always non-negative. It equals zero if and only if

Another important feature of the MI is its robustness with respect to missing expression values. In fact the MI can be estimated from datasets of different sizes. This is advantageous in analyzing expression datasets that contain a certain amount (up to 25%) of missing values

The MI between a pair of expression patterns is upper bounded by their marginal entropies. Accordingly, the MI measure exhibits a low value if the marginal entropies are low, even if the patterns are completely correlated. Therefore, there is a need to normalize the MI measure, giving a high score for highly correlated sequences, independent of their marginal entropies. There are several ways to carry out such normalization. Michaels et al. (1998)

As noted above, the use of the discrete form of the MI measure requires the discretization of the continuous expression values. The most straightforward and commonly used discretization technique is to use a histogram-based procedure

_{l }= ⌊1 + log_{2}_{u }=

Assessment of clustering quality

The homogeneity and the separation functions are often used to determine the quality of a clustering solution when the true solution is unknown

Consider a set of _{1}, _{2}, ..., _{N}} divided into _{i }and _{ i}) the expression pattern of element

where _{1},...,_{k}, by _{t1},...,_{tk}

where _{i}, _{j }are the number of elements in cluster _{i}, _{j }respectively. The homogeneity and the separation are typically conflicting functions – usually the better is the homogeneity of a solution, the worse is its separation, and vice versa.

Known Clustering Algorithms

The underlying concepts and the parameters of each of the clustering algorithms are given in an additional file

Authors' contributions

IP participated in the design of the study, produced the necessary data and software, carried out the analyses and drafted the manuscript. IBG conceived the study, designed and supervised the work, helped to analyze the results and drafted the manuscript. OM helped in the research coordination and review. The authors read and approved the final manuscript.