Abstract
Background
The traditional (unweighted) kmeans is one of the most popular clustering methods for analyzing gene expression data. However, it suffers three major shortcomings. It is sensitive to initial partitions, its result is prone to the local minima, and it is only applicable to data with sphericalshape clusters. The last shortcoming means that we must assume that gene expression data at the different conditions follow the independent distribution with the same variances. Nevertheless, this assumption is not true in practice.
Results
In this paper, we propose a genetic weighted Kmeans algorithm (denoted by GWKMA), which solves the first two problems and partially remedies the third one. GWKMA is a hybridization of a genetic algorithm (GA) and a weighted Kmeans algorithm (WKMA). In GWKMA, each individual is encoded by a partitioning table which uniquely determines a clustering, and three genetic operators (selection, crossover, mutation) and a WKM operator derived from WKMA are employed. The superiority of the GWKMA over the kmeans is illustrated on a synthetic and two reallife gene expression datasets.
Conclusion
The proposed algorithm has general application to clustering largescale biological data such as gene expression data and peptide mass spectral data.
Background
Clustering is defined as a process of partitioning a set of objects (patterns) into a set of disjoined groups (clusters). Its goal is to reduce the amount of data by categorizing or grouping similar data items together and obtain useful information. Clustering methods can be divided into two basic types: hierarchical and partitional clustering [1]. Within each type there exists a wealth of subtypes and different algorithms. Hierarchical clustering proceeds successively either by merging smaller clusters into larger ones (bottomup), or by splitting larger clusters into smaller clusters (topdown). The hierarchical clustering methods differ in the rules used to decide which two small clusters are merged or which large cluster is split. The final result of the algorithm is a binary tree of clusters called a dendrogram, which shows how the clusters are related to each other. By cutting the dendrogram at a desired level, a clustering of objects in a dataset into disjoint groups is obtained.
On the other hand, partitional clustering – kmeans, for example – attempts to directly divide a dataset into a number of disjoint groups. All partitional clustering algorithms need as input the number of clusters and a cost (criterion) function to define the quality of a partition. The partitional clustering method aims at optimizing the cost function to minimize the dissimilarity of the objects within each cluster, while maximizing the dissimilarity of different clusters. In general, the partitional clustering algorithms are iterative and hillclimbing, and thus they are sensitive to the choice of the initial partition. Furthermore, since the associated cost functions are nonlinear and multimodal, usually these algorithms converge to a local minimum. The algorithms based on combinatorial optimization such as integer programming, dynamic programming and, branchbound methods are too expensive since the number of partitions of n objects into k clusters is O(k^{n}).
Genetic algorithms (GA) [2], inspired by natural evolution of genes, offer heuristic solutions to some optimization problems. The algorithm typically starts with a set of solutions (randomly generated) called the population and creates successive, new generations of the population by genetic operations such as natural selection, crossover, and mutation. Natural selection is performed based on the fitness (related to the cost function) of an individual. For an individual, the better its fitness, the more chances it has to survive in the next generation. Crossover is performed by certain crossover rule and mutation aims at changing an individual by a userspecified mutation probility. The intuition underlying the approach is that each new population will be better than the previous one. Actually it has been proved [3] that a canonical GA converges to the global optimum with probability 1.
A GA is highly dependent on the coding of the solutions (individuals). In the context of weighted kmeans, a natural representation of a solution is a pair of variables (partitional string, cluster centroids). The partitional string describes for each object the index of cluster which it belongs to. The cluster centroids are representative objects of the clusters and their attributes are found by averaging the corresponding attributes among the objects in a particular cluster. These two variables depend on each other such that if one of them is given, the other one can be uniquely constructed. Since the cluster centroids generally are real numbers, it might be very difficult to encode them. On the other hand, a direct encoding for partitional strings is a simple problem.
Genetic algorithms have been previously considered for clustering problems [411]. Often genetic algorithms are not hybridized with kmeans algorithms [5,6,9,11] and thus their rates of convergence were very slow. On the other hand, when GA are hybridized with kmeans algorithms [7,8,10], the resultant algorithms inherit some drawbacks of unweighted kmeans algorithms, for example, that the resultant clusters are sphericalshape. Further, if the inherent structure of the clusters in the data is not spherical shaped, such algorithms can not give the correct results
In this paper, we propose a genetic weighted kmeans algorithm (GWKMA). This is a hybrid approach to combining a GA with the weighted kmeans algorithm (WKMA) [12,13] and partially remedies drawbacks of other attempts [411]. The GWKMA encode the solutions by partitional strings and employs three genetic operations – natural selection, crossover and mutation – and one WKM operation derived from the weighted kmeans algorithm (WKMA).
Methods
WKMA
In a general sense, a kpartitioning algorithm takes as input a set D = {x_{1}, x_{2}⋯, x_{n}} of n objects and an integer K, and outputs a partition of D into exactly K disjoint subsets D_{1},⋯, D_{K}. Denote such a partition by Δ. Each of the subsets is a cluster, with objects in the same cluster being somehow more similar to each other than they are to all subjects in other different clusters. One way to make the determination of Δ into a welldefined problem is to define a cost function which measures the clustering quality of any partitions of a dataset.
In this paper, each attribute of an object (gene) is expressed as a real number and thus each object may be described by a real number row vector of dimension d, where d is the number of attributes of an object. Assume that all objects in the dataset have the same number of attributes, i.e. no missing data. Let (x_{i}, i = 1,⋯, n) be a dataset of n objects. Let x_{ij }denote the jth attribute of object x_{i}. X = (x_{ij}) is called an attribute matrix of object set D. For the predefined number K of clusters, the cost function for a weighted kmeans clustering technique may be defined by
where
n_{k }and m_{k }are the mean and the number of objects in D_{k}, respectively, and G is a weighted matrix which is a symmetrical positive. The objective of a weighted kmeans algorithm is to find an optimal partition expressed by Δ* and a symmetrical positive matrix G* satisfying equation (3) such that
Obviously, given a partition Δ, the value of J_{G}(Δ) change with the multiplication of a weighted matrix G. Therefore the weighted matrix must be normalized. In this study the determinant of G is set to be 1, i.e.
For fixed G = I in equation (1), condition (4) is satisfied automatically, and equations (1) and (3) become the cost function and optimal objective of a traditional kmeans algorithm, respectively.
For a fixed partition, we wish to determine G such that the cost function (1) is optimized under the normalization condition (4). To do that, we form the Lagrangian function
and calculate its derivatives with respect to G
Equating the derivative to zero and using the auxiliary condition (4) lead to
where and is the withingroup variance of cluster k ((k = 1,⋯, K), and
Finally, we have
Note that W is dependent on partition Δ. To avoid ambiguousness, denote W induced by Δ as W(Δ). Substituting (7) into (1) leads to J(Δ) = d(det(W(Δ)))^{1/d}. As d is a constant for a given dataset, the cost function of a weighted kmean clustering is reduced to
Thus the objective of a weighted kmean algorithm is simplified as finding an optimal partition expressed by Δ_{o }which minimizes
There are O(k^{n}) different partitions of n objects into exactly k clusters [1]. It is impractical to using an exhaustive search for the solution to clustering a largesize gene expression dataset. To overcome this problem, a heuristic approach is usually considered. The basic idea in the heuristic approach is to randomly select an initial partition and then move objects between groups if such moves make J significantly smaller.
Now consider how the cost function J changes when an object x currently in cluster D_{i }is tentatively moved to a different cluster D_{j}. Let Δ = (D_{1},⋯, D_{k}), Δ' = (D_{1},⋯, D_{i}\{x},⋯D_{k}), and Δ" = (D_{1},⋯, D_{i}\{x},⋯, D_{j }∪ {x},⋯, D_{k}) (i ≠ j). Obviously the condition for successfully moving x from D_{i }into D_{j }is
From the definitions, it follows that
Condition (10) is reduced to
since det(A + βy'y) = det(A)(1 + βyA^{1 }y') for any d × d invertible matrix A, any d – dimensional row vector y, and any number β.
If reassignment is profitable, the greatest decrease in the cost function is obtained by selecting the cluster for which is minimal. This leads to the iteratively optimal weighted kmeans algorithm (WKMA) shown in Figure 1.
Figure 1. Iterative optimal kmeans algorithm.
GWKMA
As the WKMA is sensitive to initial partitions and its result is prone to the local minima, this paper proposes a genetic weighted kmeans algorithm (GWKMA), shown in Figure 2. The GWKMA is a hybridization of GA and WKMA, including the three genetic operators in general GA and a WKM operator derived from WKMA. In the following we specify in details the encoding, selection, crossover, mutation, and WKM operators.
Figure 2. Genetic weighted kmeans algorithm (GKMA).
Encoding
In the literature [5,6,8], solutions (individuals) are encoded by the centers of clusters. Note that the centers of clusters are real numbers for general cluster tasks and the encoding of the real number in GA algorithms is hard and may degrade the accuracy of the solutions.
We use a partitional string to express a solution to a clustering. A partitional string is an integer string over the set {1,⋯, K}, on which each position corresponds to an object and the number in a position represents the cluster to which the corresponding object is assigned. Thus, the search space consists of all integer strings s_{Δ }with length n over the set {1,⋯, K}. A population is expressed by a set of partitional strings representing its individuals (solutions), denoted by or Δ*. One may set some additional conditions to refine the search space. For example, to avoid a singular clustering, one may impose the constraint that each element in the set {1,⋯, K} appears at least once in s_{Δ}.
The advantage of encoding the centers of clusters is that the resultant clusters from GA clustering are convex. GWKMA encodes the solutions (individuals) by integer strings (partitional strings). This simplifies the encoding of GA and does not degrade the accuracy of the solutions. Since GWKMA includes the weighted kmeans operator, the resultant clusters from GWKMA are still convex.
Selection operator
= Selection(Δ*, X, K, N). For convenience of the manipulation, GWKMA always assigns the best individual found over time in the population to individual 1 and copies it to the next population. Operator = Selection(Δ*, X, K, N) selects (N1)/2 individuals from the previous population according to the probability distribution given by
where N (odd positive integer) is the number of individuals in a population, s_{Δi }is the partitional string of individual i, and F(s_{Δi}) represents the fitness value of individual i in the current population. Fitness here is defined as
where J(s_{Δ}) is calculated by (8), and TJ is calculated by the following formula
and where . It is evident that TJ ≥ J(s_{Δ}) for any s_{Δ }in the problem. Note that there are (N1)/2+1 individuals in .
Crossover operator
Δ* = Crossover(, N). The intention of the crossover operation is to create new (and hopefully better) individuals from two selected parent individuals. In GWKMA, of two parent individuals, one always is the first individual that is the optimal individual found over time, and another is one of the selected (N1)/2 individuals from the parent population other than the first individual. In this paper, the crossover operator adopts the singlepoint crossover method for simplicity. Note that after the crossover operation population Δ* has N individuals.
Mutation operator
Δ* = Mutation(Δ*, Pm, K, N). Each position in a coding string is randomly selected with a userset mutation probability Pm, and the number in the selected position is uniformly randomly replaced by another integer from the set {1,⋯ K}. In other work [14], such a mutation depends on the distance of the corresponding object from the cluster centoids. Actually such a complex strategy is not necessary because the WKM operator will be used. To avoid any singular partition (containing an empty cluster), the mutation operator also randomly assigns one object to a cluster which is empty after all genetic operations.
WKM operator
[Δ*, J(Δ*)] = WKM(Δ*, X, K, N): The WKM operator is obtained by calling WKMA for each individual s_{Δ }in population Δ*. It is sufficient to run the repeat loop in AKMA for several times. Note that J(Δ*) is an Ndimensional vector, each component of which corresponds to the value of the cost function of an individual in population Δ*. In other works [79], several different kmeans operators were employed, and their functions are similar to that of WKMA. However, those kmeans algorithms are neither iteratively optimal nor weighted.
Evaluation
The term "evaluation of a clustering method" usually refers to the ability of a given method to recover true clusters in a dataset. There have been several attempts to evaluate a clustering method on theoretical grounds [14,15]. Since a clustering result can be considered as a partition of objects into a number of groups, for evaluating a clustering method it is necessary to define a measure of agreement between two partitions of the same dataset. In the clustering literature, measures of agreement between partitions are referred to as external indices. Several such indices have been described [15,16]. This paper adopts the adjusted Rand index (ARI).
Consider two partitions of N objects: the rcluster partition U = {u_{1},⋯ u_{r}} and the scluster partition V = {v_{1},⋯, v_{s}}. One may construct a contingency table (Table 1), where entry n_{ij }is the number of objects that are both in clusters u_{i }and v_{j}, i = 1,⋯, r, j = 1,⋯, s. Let and denote the sum of row i{i = 1,⋯, r}and the sum of column j (j = 1,⋯, s) in the contingency table, respectively, and let and (the number of pairs of N objects). Based on the contingency matrix of two partitions, the ARI is defined as [16,17]:
Table 1. Contingency table for two partitions of n objects
The ARI is an adjusted Rand index [18] in that its expected value is 1 when they matched perfect and 0 when the two partitions are selected at random. Accordingly, the large value of ARI indicates the two partitions are highly in agreement. To investigate the sensitivity of the partition clustering methods to initial partitions, the clustering method is run with numerous different initial partitions. Then the average ARI (AARI) of all pairwise resultant clusterings is calculated. This AARI indicates the sensitivity of the clustering method to initial partitions. The larger the value of AARI, the more insensitive (better) the clustering method is to initial partitions.
To evaluate the quality of the clusters, we propose a measure of internal consistency based on the singular value decomposition (SVD) of each cluster. To define internal consistency, suppose we are given a partition of our n × m dataset X into K disjoint clusters, where m is the number of time points and n is the number of genes. For the jth cluster (j = 1,⋯, K) we have a matrix X_{j }of microarray measurements, where the rows are genes and the columns are time points, so that X_{j }is a n_{j }× m matrix, where n_{j }is the number of genes in the jth cluster. Using the SVD, we decompose , where U_{j }and V_{j }are orthogonal matrices and S_{j }is a diagonal matrix whose entries describe the importance of the columns of U_{j }and V_{j}. The matrix X_{j}V_{j }= U_{j}S_{j }contains the projections of the rows (genes) of X_{j }onto the basis V_{j}. The entries of S_{j }(singular values) give the relative importance of the rows of V_{j}. If the first entry of S_{j }is much larger than the second entry then we know that most of the information in the rows of X_{j }is captured by a single dimension. We thus define the internal consistency of the jth cluster to be the ratio of the first and second singular values in S_{j}. The internal consistency provides a measure of how well a single dimension can describe all genes. We can evaluate the quality of a clustering with K clusters by the average internal consistency (AICo) of the K clusters. The high value of the AICo indicates the good quality of the clusters
Results
This section uses a synthetic and two reallife gene expression datasets to investigate the performance of the GWKMA in terns of AARI and AICo, while compared with the widely used kmeans.
Synthetic dataset (SYN)
A synthetic dataset is generated by the sine function modeling cyclic behaviour of genes employed by Yeung, et al. [19]. Let x_{ij }be the simulated expression level of gene i and time point j in the dataset and be modeled by x_{ij }= λ_{j }* φ(i, j)(1 + α_{ij}), where φ(i, j) = sin(2πj/8  w_{k(i) }+ ε_{ij}). λ_{j }is the amplitude control at time j, which is chosen according to the standard normal distribution. φ(i, j) models the cyclic behaviour of genes. Each cycle is assumed to span 12 time points. Different clusters are represented by different phase shifts, and w_{k(i) }represents a phase shift for gene i in cluster k, which is chosen according to the uniform distribution on interval [0, 2π]. The random variable ε_{ij }represents the noise of gene synchronization, which is chosen according to the normal distribution with the mean of zero and the standard deviation of 0.3. α_{ij }represents the error of gene i at time j, which is chosen according to the normal distribution with the mean of zero and the standard deviation of 0.4. Using the model above, a synthetic dataset is generated consisting of expression levels of 600 genes at 12 time points. These 600 genes belong to six clusters, each of which contains 100 genes.
Two reallife datasets
The first reallife dataset is a subset of gene expression profiles over 11 time points collected during the process of bacterial cell division [20], and contain 431 gene expression profiles with the standard deviation greater than 0.5 and no missing data points, denoted by BAC in this paper. The second dataset is a subset of gene expression profiles over 7 time points collected during the developmental program of sporulation in budding [21], and contains 529 gene expression profiles with the standard deviation greater than 1.0, and no missing data points, denoted by SPO. These two original datasets are publicly available from the Stanford microarray database [22] at http://genomewww5.stanford.edu/ webcite.
In the experiments conducted in this study, the number of generations is set to be GEN = 15, the population size = 21, and the mutation probability Pm = 0.10. The Matlab™ software package was used to conduct our experiments. Both AARI and AICo are computed over a variety of the numbers of clusters and for a number of the running results of both GWKMA and the traditional kmeans.
The AICos with the cluster numbers from 2 to 10 are calculated from the results of 5 runs of both the GWKMA (solid lines) and the kmeans (dash lines), and are depicted in the upper panel of Figure 3. The values of AICo for the GWKMA are greater than 1.8 while those for the kmeans are less than 1.6. This indicates that the quality of clustering from the GWKMA is higher than that from the kmeans. The AARI with the cluster numbers from 2 to 10 are calculated form the results of 5 runs of both the GWKMA (solid lines) and the kmeans (dash lines), and are depicted in the lower panel of Figure 3. The values of AARI for the GWKMA are greater than those for the kmeans over all the clusterings except for the one with k = 8. This result means that the GWKMA is more insensitive to initial partitions than the kmeans.
Figure 3. Comparison of the GWKMA (solid lines) to the kmeans (dash lines) over a variety of the numbers of cluster on dataset SYN.
Figures 4 and 5 depict the comparisons of the GWKMA and the kmeans in terms of the AICo (the upper panels) and AARI (the lower panels) for the two reallife gene expression datasets. Before the clustering, two datasets are normalized by shifting the median of each gene expression profile to zero. From Figures 4 and 5, the same results are obtained from the reallife gene expression data as those from the synthetic dataset. That is, the GWKMA is better than the kmeans in terns of AARI and AICo.
Figure 4. Comparison of the GWKMA (solid lines) to the kmeans (dash lines) over a variety of the numbers of cluster on dataset BAC.
Figure 5. Comparison of the GWKMA (solid lines) to the kmeans (dash lines) over a variety of the numbers of cluster on dataset SPO.
The superior quality of clustering from the GWKMA can be explained as follows. The kmeans method assumes that 1) all attributes (data at time points) of objects (genes) are independent and 2) the standard deviations of all attributes over all objects are equal.
In practice, these two assumptions are not true. For example, we calculate the sample covariance matrix of dataset SPO shown in the matrix S in Figure 6. The elements on the main diagonal of matrix S are not equal and instead range from 0.23 to 4.44. This indicates that assumption 2) for the kmeans is invalid. Actually, in many data analysis cases, gene expression data is normalized such that the standard deviation of each attribute over all objects is 1. In this case assumption 2) for the kmeans is valid. However, assumption 1) for the kmeans is still invalid. For example, in the matrix S in Figure 6 most offdiagonal elements are far from zero. This means most attributes in dataset SPO are correlated and thus not independent.
Figure 6. The sample covariance matrix of dataset SPO.
Conclusion
In this study, a genetic weighted kmeans algorithm (GWKMA) is proposed which is a hybrid algorithm of the weighted kmeans algorithm and a genetic algorithm. GWKMA was run on one synthetic and two reallife gene expression datasets. The results of the computational experiments show that the GWKMA performs better than the kmeans in terms of the cluster quality (AARI) and the clustering sensitivity to initial partitions (AICo).
In reallife datasets, the assumptions for the kmeans are typically not satisfied. The weighted kmeans does not needs the assumptions for the kmeans. However, like the kmeans, the weighted kmeans is also sensitive to initial partitions. The proposed GWKMA possesses the merits of both genetic algorithm and the weighted kmean algorithm, and thus overcomes the disadvantages of the kmeans and the weighted kmeans. In addition, the proposed algorithm is generic and could have applications to clustering largescale biological data such as gene expression data and peptide mass spectral data.
Competing interests
The author declares that they have no competing interests.
Acknowledgements
This work is supported by Natural Sciences and Engineering Research Council of Canada (NSERC). The author benefited greatly from discussions with Drs W.J. Zhang and A.J. Kusalik at the early stage of this research.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S6.
References

Hartigan J: Clustering Algorithms. New York, Wiley Press; 1975.

Reeves CR, Rowe JE: Genetic Algorithms: Principles and Perspectives: A Guide To GA Theory. Boston: Kluwer Academic Publishers; 2003.

Rudolph G: Convergence analysis of canonical genetic algorithms.
IEEE Transactions on Neural Networks 1994, 5:86101. Publisher Full Text

Krishna K, Murty MM: Genetic Kmeans algorithm.
IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics 1999, 29:433439. Publisher Full Text

Hall LO, Ozyurt IB, Bezdek JC: Clustering with a genetically optimized approach.
IEEE Transactions on Evolutionary Computation 1999, 3:103112. Publisher Full Text

Maulik U, Bandyopadhyay S: Genetic algorithmbased clustering technique.
Pattern Recognition 2000, 33:14551456. Publisher Full Text

Franti P, Kivijärvi J, Kaukoranta T, Nevalainen O: Genetic algorithms for largescale clustering problems.
The computer Journal 1997, 40:547554. Publisher Full Text

Scheunders P: A genetic cmeans clustering algorithm applied to color image quantization.
Pattern Recognition 1997, 30:859866. Publisher Full Text

AlSultan KS, Khan MM: Computational experience on four algorithms for the hard clustering problem.
Pattern Recognition letters 1996, 17:295308. Publisher Full Text

Wu FX, Zhang WJ, Kusalik AJ: A genetic kmeans clustering algorithm applied to gene expression data.
Proceedings of the Sixteenth Canadian Conference on Artificial Intelligence, Halifax, Canada 2003, 520526.

Tseng LY, Yang SB: A genetic clustering algorithm for data with nonsphericalshape clusters.
Pattern Recognition 2000, 33:12511259. Publisher Full Text

Duda RO, Hart PE, Stork DG: Pattern Classification. New York: Wiley Press; 2001.

Spath H: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. West Sussex: Ellis Horwood Limited; 1975.

EstivillCastro V: Why so many clustering algorithms – A position paper.
SIGKDD Explorations 2002, 4:6573. Publisher Full Text

Theodoridis S, Koutroumbas K: Pattern recognition. San Diego: Academic Press; 1999.

Dudoit S, Fridlyland J: A predictionbased resampling method for estimating the number of clusters in a dataset.
Genome Biol 2002, 3(7):RESEARCH0036. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hubert L, Arabie P: Comparing partitions.
Journal of Classification 1985, 2:193218. Publisher Full Text

Rand WM: Objective criteria for the evaluation of clustering methods.
Journal of the American statistical Association 1971, 66:864850. Publisher Full Text

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17:977987. PubMed Abstract  Publisher Full Text

Laub MT, McAdams HH, Feldblyum T, Fraser CM, Shapiro L: Global analysis of the genetic network controlling a bacteria cell cycle.
Science 2000, 290:21442148. PubMed Abstract  Publisher Full Text

Chu S, DeRisi J, Eisen MB, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast.
Science 1998, 282:699705. PubMed Abstract  Publisher Full Text

Sherlock G, HernandezBoussard T, Kasarskis A, Binkley G, Matese JC, Dwight SS, Kaloper M, Weng S, Jin H, Ball CA, Eisen MB, Spellman PT, Brown PO, Botstein D, Cherry JM: The Stanford Microarray Database.
Nucleic Acids Research 2001, 29:152155. PubMed Abstract  Publisher Full Text  PubMed Central Full Text