Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA

Abstract

Background

A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.

Results

In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.

Conclusion

Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.

Background

The primary purpose of this paper is to introduce two new external indices for measuring the performance of a clustering algorithm for the specific purpose of grouping genes using their expression profiles.

Clustering of genes on the basis of expression profiles is a frequently, if not always, performed operation in analyzing the results of a microarray or SAGE study. Often times it is taken as a first step in understanding how a class of genes act in consort during a biological process. Statistics and machine learning literature provide a huge choice of clustering tools for such unsupervised learning operations. Not only do multiple algorithms exist, but even a single algorithm may rely on various user selectable tuning parameters such as desired number of clusters, or threshold values for forming a new cluster, initial values etc. Naturally, the results may be quite varied (see, e.g.,

Past evaluations of clustering algorithms have been of general (non-biological) nature. For example, a good clustering algorithm ideally should produce groups with distinct non-overlapping boundaries, although a perfect separation can not typically be achieved in practice. Figure of merit measures (FOM, hereafter)

Although popular statistical clustering algorithms (e.g., UPGMA) have often been reported to successfully produce clusters of functionally similar genes, it is important to make that requirement a part of the evaluation strategy in selecting one from a list of competing clustering algorithms. Some attempts in this direction have been made in recent years (e.g.,

In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performances of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We also provide an R-code with some simple illustrations for computing these indices [see

R-CODE FOR BHI AND BSI. The file contains an R-CODE for calculating the performance indices for clustering algorithms introduced in this paper.

Click here for file

We use publicly available GO

Results

We first consider the breast cancer data. This data set consisted of expression profiles of 258 significant genes based on their eleven dimensional expression profiles over four normal and seven DCIS samples. Based on the size of the data set we judge that a cluster size between four and ten might be appropriate. Thus, both the biological homogeneity index (BHI) and the biological stability index (BSI) was computed for each clustering algorithm in this range of cluster numbers. As described in the Methods section, we used eleven functional classes for this study. Figure

BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data

BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data. The thick black line is the 95th percentile of BHI values under random clustering.

Three of the seven clustering algorithms were used with two choices of dissimilarity measures. These are indicated by the line types with solid lines corresponding to one-minus the Pearson's correlation coefficient as a dissimilarity measure and dashed lines corresponding to Euclidean distance, respectively. In the rest of the paper, the term correlation refers to the Pearson's correlation coefficient. The plot of BHI reveals that UPGMA with the correlation measure happens to produce most homogeneous biological clusters based on this data set and the results are statistically significant when the number of clusters are between six and ten. We also computed p-values under a non-uniform resampling which maintains the same cluster sizes (on the average) as produced by a given clustering algorithm. This is easily accomplished by drawing a random sample with probability proportional to the original cluster sizes instead of a simple random sample in Step 2 of the statistical scoring algorithm. Note that it is computationally expensive however, since separate resampling needs to be done for each

The biological stability index (BSI) is plotted in Figure

BSI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data

BSI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data.

Next we report the results for the sporulation data set. As stated in the methods section, we have used two different sets of functional classes for biological validations. For the details, we refer to Figures

BHI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FatiGO

BHI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FatiGO. The thick black line is the 95th percentile of BHI values under random clustering.

BSI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FatiGO

BSI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FatiGO.

BHI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FunCat

BHI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FunCat. The thick black line is the 95th percentile of BHI values under random clustering.

BSI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FunCat

BSI for various clustering algorithms applied to the positively expressed genes in yeast sporulation data with functional classes from FunCat.

Model based selected only six clusters even if a larger maximum number of clusters was specified. The biological stability index, on the other hand was high for UPGMA and Fanny (Euclidian) but low for K-Means and Fanny (correlation). Thus, considering everything, Fanny (Euclidian) seems to be the optimal algorithm for the yeast data set. Other overall good performers were Diana (correlation) and SOTA.

Discussions and conclusion

Historically, validation measures for clustering algorithms are based on the data themselves. They measure the extent of a clustering algorithms's ability in finding similarity structures hidden in the data. However, for clustering biological data such as the gene expression profiles, it would be reasonable to consider external measures that employ the existing biological knowledge (which can be taken as the "ground truth"). As argued by

The two indices introduced here are useful in quantifying the results of an unsupervised clustering in grouping genes with similar biological functions given a reference collection of relevant functional classes. These indices will be preferable over internal indices when there is a substantive existing biological knowledge about the genome under consideration (e.g., as reflected by the proportion of annotated genes).

As mentioned in the background section, the stability aspect was absent in existing external indices based on biological information. In our earlier work

Past studies have often concluded that clustering of the gene expression profiles (typically via UPGMA with correlation similarity) show that functionally similar genes are grouped together. This is often concluded by inspecting a handful of handpicked genes. Such conclusions are inherently incomplete unless one can quantify the agreement between the clusters produced via the expression profiles and the biological classes because it is likely that many biologically unrelated genes will be grouped together as well.

The proposed indices are easy to interpret and easy to implement. They are also useful in identifying the optimal clustering algorithm for a given data set in its ability to cluster biologically similar genes. As illustrated in this paper, no single clustering algorithm is likely to be the winner in all data sets. The approach introduced here will be even more useful as the gene ontology databases grow with time.

As shown with the illustrated data sets, the biological indices can also guide us to determine the number of clusters to be used in a clustering routine. Once an optimal algorithm is determined one may choose

Methods

Suppose _{1},.....,_{F }be _{i }⊂

Biological homogeneity index

Consider two annotated genes

where _{j}, _{j }= _{j }∩ _{j}, and where for a set

This is a simple measure that is easy to interpret and implement once the reference collection of functional classes are in place. This also works with overlapping functional classes. This measure can be thought of as an average proportion of gene pairs with matched functional classes that are statistically clustered together based on their expression profiles.

Biological stability index

Next we capture the stability of a clustering algorithm by inspecting the consistency of the biological results produced when the expression profile is reduced by one observational unit. This stability measure is unrelated to the one introduced by

In a microarray or SAGE study, each gene has an expression profile that can be thought of as a multivariate data value in ℜ^{p}, for some ^{p-1 }obtained by deleting the observations at the ^{g,i}, denote the cluster containing gene ^{g,0 }be the cluster containing gene

A successful clustering is characterized by high values of both of these indices. The following subsection describes how to attribute a

Statistical scoring

By comparing with "random clustering", we can compute the observed level of significance or

Step 1. Compute a performance measure

Step 2. Compute the same performance measure _{obs }corresponding to a random clustering algorithm that ignores the data and assigns genes to clusters randomly and independently. This can easily be done by generating (^{g,0 }and ^{g,i}, 1 ≤

Step 3. Repeat Step 2 a large number of times, say

Step 4. Compute the p-value as the proportion of times the performance measure by random cluster assignments exceeds (or equals to) the value obtained using the clustering algorithm under consideration

This proportion estimates the probability of obtaining a value as high as _{obs }just by chance (i.e., by "random clustering"). A 95% upper limit of the distribution of

Range of

In general, the users will have the flexibility of investigating the performance of a clustering algorithm over a range of cluster numbers of their choosing. Some clustering algorithms such as Fanny or Model based clustering use data based selection of total number of hard clusters even if a larger number of clusters are desired by the user. For others, this choice is subjective. Often times, the biologists conducting the microarray experiment will make this call. For our illustration with the yeast data we have selected a range of

Human breast cancer progression data

We illustrate our methods using the expression profiles of 258 genes (SAGE tags) that were judged to be significantly differentially expressed at 5% significance level between four normal and seven ductal carcinoma in situ (DCIS) samples

For constructing the functional classes, we have used a publicly available web-tool called AmiGO

Yeast sporulation data

As a second illustrative data set, we use a well known data set collected by

We use two separate web-based tools both using the GO ontology to annotate these ORF's. The resulting functional classifications were different although they had some common GO terms. We wanted to see whether the end comparison of the clustering algorithms is sensitive to the choice of the biological classes. To this end, we wanted to compare two different sets of functional classes, both based on the biological processes, with the same set of yeast ORF's.

For the first set of functional classes we mined the yeast genome database using the FatiGO webtool

The next set of functional classes were obtained using the web-based GO mining tool FunCat

The clustering algorithms

We consider the following well known clustering algorithms representing the vast spectrum of clustering techniques that are available in statistical pattern recognition and machine learning literature. We evaluate these algorithms using the two biological performance measures BHI and BSI. One minus correlation was taken as the dissimilarity measure for the "distance" based algorithms. In addition, for UPGMA, Diana, Fanny, we also considered the standard Euclidean distance between expression vectors as a dissimilarity measure. Thus, overall, ten clustering schemes were subjected to this comparative evaluation.

UPGMA

This is perhaps the most commonly used clustering method with microarray data sets. This is an agglomerative hierarchical clustering algorithm

K-means

K-means

Diana

This is also a hierarchical algorithm which is divisive in nature

Fanny

This algorithm produces a fuzzy cluster

SOM

Clustering by self-organizing maps

Model based clustering

Under this scheme

SOTA

Self-organising tree algorithm or SOTA has received a great deal of attention in recent years and was used to cluster microarray gene expression data in

UPGMA (hclust) and K-Means are available in the base distribution of R. Diana and Fanny are available in the library "cluster". Model based clustering is available in the R-package mclust. For SOM, we have used an R code written by Niels Waller and Janine Illian

Authors' contributions

Susmita Datta: Development of statistical methods, identification of data sets and biological commentary; Somnath Datta: Development of statistical methods and computing. Both authors approved the final manuscript.

Acknowledgements

We thank the reviewers for their constructive comments. This research was supported by a grant (H98230-06-1-0062) from the National Security Agency. We thank Joaquín Dopazo and Jaime Huerta Cepas for sharing their R-package for SOTA with us. Help from our graduate students Vasyl Pihur and Mourad Atlas is also acknowledged.