Abstract
Background
Data clustering is a powerful technique for identifying data with similar characteristics, such as genes with similar expression patterns. However, not all implementations of clustering algorithms yield the same performance or the same clusters.
Results
In this paper, we study two implementations of a general method for data clustering: kmeans clustering. Our experimentation compares the running times and distance efficiency of Lloyd's Kmeans Clustering and the Progressive Greedy Kmeans Clustering.
Conclusion
Based on our implementation, not just in processing time, but also in terms of mean squareddifference (MSD), Lloyd's Kmeans Clustering algorithm is more efficient. This analysis was performed using both a gene expression level sample and on randomlygenerated datasets in threedimensional space. However, other circumstances may dictate a different choice in some situations.
Background
Researchers are inundated with data with little obvious information readily accessible; this is especially true in the many disciplines of the life sciences. These data may be very confusing and perplexing to biologists when viewed as a whole. To make these data more meaningful and to derive important biological understanding from these data, researchers have access to many different data processing techniques. One popular and meaningful approach is to cluster data into groups, where each group aggregates data with similar biological characteristics.
Data clustering is a very powerful technique in many application areas. Not only may the clusters have meaning themselves, but clustering allows for efficient data management techniques in that data that is grouped in the same manner will usually be accessed together. Access to data within a cluster may predict that other data in that cluster will be accessed soon; this can lead to optimized storage strategies which perform much better than if the data were randomly stored.
An easy abstraction for clustering data is based on multidimensional proximity relationships. While there may be other relationships among the data items, we focus on a distance relationship between data so that a meaningful and simple analytical conclusion can be made from simpler comparisons. Using proximity relationships, data is clustered in such a way that the squarederror distortion is minimized both globally and locally. The effectiveness of the algorithms analyzed are measured against this criterion. The mean squarederror distortion is defined as
where X = {x_{1}, x_{2},..., x_{k}} is the closest cluster center to a point in V = {v_{1}, v_{2},..., v_{n}} and n is the total number of points [1].
There are various algorithms that exist to implement clustering in terms of proximity measures. Depending on the quality of the cluster, the implementation speed of these algorithms can vary. In this article, we focus on two widely used kmeans clustering algorithms. A kmeans clustering algorithm can be formally defined as a function that receives as input a set of points in multidimensional space and a number, k, of desired centers or cluster representatives; one area of active research is the issue of optimally "seeding" the algorithm with the proper value of k and the starting locations of the k cluster centers. With this input, the algorithm produces an output set of point sets such that each point set has a defined center that minimizes the cumulative distance to the center of all points in that set, for all the possible choices of each set.
We have implemented two versions of the kmeans clustering algorithm: Lloyd's Kmeans Clustering and Progressive Greedy Kmeans Clustering. The former is a relatively faster algorithm and is fairly straightforward. The latter is a more conservative approach and can run for a much longer time but can sometimes yield better results in terms of distance measures.
We first describe these algorithms, then we examine these algorithms and discuss some experimental results. These results are analyzed based on the running time for the algorithms and the mean squarederror distortion and are compared in terms of complexity and efficiency.
Methods
Algorithm description: Lloyd's Kmeans Clustering algorithm
Lloyd's Kmeans Clustering algorithm was designed by S. P. Lloyd [2]. Given a number k, separate all data in a given partition into k separate clusters, each with a center that acts as a representative. There are iterations that reset these centers then reassign each point to the closest center. Then the next iteration repeats until the centers do not move. The algorithm is as follows [1]:
1. Assign each data point to the cluster C_{i }corresponding to the closest cluster representative x_{i}(1 ≤ i ≤ k)
2. After the assignments of all n data points, compute new cluster representatives according to the center of gravity of each cluster.
While the Lloyd's algorithm often converges to a local minimum of the squared error distortion rather than the global minimum [1], it is the faster of the two algorithms discussed in this paper.
We used C as the programming language to implement this algorithm using two primary structures for the points: an array of points that is dynamically declared when the user specifies the input points and arrays for each of k centers. These latter arrays for each center themselves have arrays within them – one for each dimensional in a multidimensional space – for the points that are assigned to that particular center (for our analysis, we have used threedimensional points).
Algorithm description: Progressive Greedy Kmeans Clustering algorithm
The Progressive Greedy Kmeans Clustering algorithm is similar to Lloyd's in that it searches for the best center of gravity for each point, but it assigns points to a center based on a different technique. In each iteration, Lloyd's algorithm reassigns a point to a new center and then readjusts the centers accordingly. The Progressive Greedy approach does not act upon every point in each iteration; rather the point which would most benefit moving to another cluster is reassigned. Every iteration in the Progressive Greedy algorithm calculates the "cost" of every point in terms of a Euclidean distance (in threedimensional space), i.e.,
Each point p = (x_{p}, y_{p}, z_{p}) has a cost associated with it in terms of the current center C_{i }= (x_{i}, y_{i}, z_{i}) to which it belongs. The point is a candidate to be moved if the Eculidean distance cost can be reduced by moving that point from one cluster C_{i }to another cluster C_{j }= (x_{j}, y_{j}, z_{j}) with that cluster having a closer center. In other words, a point is a candidate to be moved from C_{i }to C_{j }if
is greater than 0. Once all the candidates are calculated, the point with the largest difference is then moved. If no point has a difference value greater than 0, the algorithm is finished.
Each iteration in the Progressive Greedy Kmeans Clustering algorithm does the following:
1. Calculate the cost of moving each point to each of the other cluster centers as well as the cost of its current cluster center. For every point, store the best change if less than the cost of its current cluster center.
2. If there is a point with a best change, move it. If there is more than one, pick the one point that when moved sees the greatest improvement.
3. If nothing else can be done, finished.
The Progressive Greedy Kmeans Clustering is slower, but the sacrifice is an attempt to minimize the squarederror distortion mentioned earlier.
The implementation of Progressive Kmeans clustering uses the same C data structures as was used for Lloyd's.
Results
Analysis of biological data
M. B. Eisen, et. al. [3] were one of the first groups to apply the clustering approach to the analysis the gene expression data.
We applied both clustering algorithms to the analysis of microarray data. The clustering algorithms classified gene expression data into clusters such that functionallyrelated genes are grouped together. In the following example [1], the expression information of ten genes is recorded at three different times (see Table 1). The distance matrix of the ten genes was calculated based on the Euclidean distance in threedimensional space. The clustering algorithms grouped the gene expression data into clusters satisfying the following two conditions [1]:
Table 1. Expression levels of ten genes at three different times.
• within a cluster, any two genes should be highly similar to each other (i.e., the distance between them should be small; this condition is called homogeneity), and
• any two genes from different clusters should be very different from each other (i.e., the distance between them should be large; this condition is called separation).
Both algorithms yielded the same three clusters of the ten genes as follows: {g_{1}, g_{6}, g_{7}}, {g_{3}, g_{5}, g_{8}}, and {g_{2}, g_{4}, g_{9}, g_{10}}. Tables 2 and 3, respectively, are the running time comparisons and mean squareddistance comparisons of the two clustering algorithms applied to these biological data.
Table 2. Running time comparison in seconds for different k values.
Table 3. MSD comparisons for different k values (actual values).
Analysis of a randomlygenerated data set
We used computergenerated random points to test the two clustering algorithms; presumably, this data represents few natural clusters which should present close to a "worst case" for the clustering algorithms. Figures 1 to 4 show the running time comparisons of various runs using different values of k and different numbers of points. Each individual value in these Figures is a mean time of multiple runs and is expressed in terms of seconds, though what is important here is the relative size of these values.
Figure 1. Running time comparison when k = 3.
Figure 2. Running time comparison when k = 4.
Figure 3. Running time comparison when k = 5 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000).
Figure 4. Running time comparison when k = 10 (excludes the running times of Progressive Greedy algorithm when the number of points exceeds 10,000).
A comparison of mean square differences are shown in Tables 4 and 5 using different numbers of points and k values of 5 and 10, respectively. In these Tables, the maximum and minimum local cluster mean squares are shown alongside the general global average MSD.
Conclusion
The advantage of Lloyd's Kmeans Clustering algorithm compared to the Progressive Greedy Kmeans Clustering algorithm is clear from the above comparisons. Based on our implementation, not just in processing time, but also in terms of mean squareddifference, Lloyd's Kmeans Clustering algorithm is more efficient. For very large data sets, Lloyd's algorithm definitely works faster. When the number of points exceeds 10000, the Progressive Greedy Kmeans Clustering algorithm needs optimization to even to be able to handle the very large floating point values associated with finding the mean squareddifference. Without optimization, Progressive Greedy Kmeans Clustering would not even run without generating floating point exception errors. We therefore conclude that Lloyd's Kmeans Clustering algorithm seems to be the better algorithm. However, other circumstances may dictate a different choice in some situations.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
GAW carried out the kmeans clustering algorithm design and implementation. XH participated in the design and applications of the algorithms. Both authors have read and approved the final manuscript.
Acknowledgements
The authors would like to thank Steven F. Jennings for comments on the preliminary version of this work. This publication was made possible in part by NIH Grant #P20 RR16460 from the IDeA Networks of Biomedical Research Excellence (INBRE) Program of the National Center for Research Resources.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S6.
References

Jones NC, Pevzner PA: An Introduction to Bioinformatics Algorithms. The MIT Press; 2004.

Lloyd SP: Least squares quantization in PCM [PulseCode Modulation.].
IEEE Transactions on Information Theory 1982, 28:129137. Publisher Full Text

Eisten MB, Spellman PT, Brown PO, Bostein D: Cluster analysis and display of genomewide expression pattern.
Proceedings of the National Academy of Sciences of the United States of America 1998, 95:1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text