Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Multiconstrained gene clustering based on generalized projections

Jia Zeng12*, Shanfeng Zhu34, Alan Wee-Chung Liew5 and Hong Yan67

Author Affiliations

1 School of Computer Science and Technology, Soochow University, Suzhou 215006, China

2 Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong

3 School of Computer Science and Technology, Fudan University, Shanghai 200433, China

4 Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China

5 School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD 4222, Queensland, Australia

6 Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong

7 School of Electronic and Information Engineering, University of Sydney, NSW 2006, Australia

For all author emails, please log on.

BMC Bioinformatics 2010, 11:164  doi:10.1186/1471-2105-11-164


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/11/164


Received:8 July 2009
Accepted:31 March 2010
Published:31 March 2010

© 2010 Zeng et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem.

Results

We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods.

Conclusions

The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.

Background

Computational annotating gene functions is a fundamental issue in bioinformatics. Microarray gene expression data have been used widely to study the cell cycle system, genetic regulatory interactions, development at the molecular level, and genes that act in response to a certain infectious disease. To determine gene functions, a basic approach is gene clustering using gene expression data based on the assumption that genes with similar expression patterns should share similar functions in the process. Typical gene clustering methods include hierarchical clustering [1], the k-means algorithm [2], self-organizing maps [3], the fuzzy c-means algorithm [4], and hidden Markov models [5]. However, gene clustering regularized by only single constraint of gene expression is not enough to obtain biologically reliable clusters, because microarray data are often noisy, contain missing values, and have uncertain temporal dependencies in time-series data [6,7]. Therefore, other constraints besides gene expression data should be incorporated for the robust and reliable gene clustering.

Recent multiconstrained gene clustering (MGC) methods have attracted much more interests [8-13]. The basic idea is that multiple constraints such as Gene Ontology (GO) and metabolic network structures can prevent gene clustering from falling into the locally optimal solution space constrained by noisy gene expression data alone. One key problem is how to combine multiple pieces of constraints to find a consistent clustering solution. Current MGC methods adopt a linear combination strategy to integrate multiple constraints of the same nature into a single new constraint, so that standard clustering algorithms for single-constrained gene clustering problems can be used, e.g., hierarchical clustering [8], Gaussian mixture models [9], k-medoids [10], and iterative conditional modes (ICM) for Markov random fields [12]. More specifically, they build a distance matrix of gene expression data as the first constraint, and then build another distance matrix based on either metabolic pathway [8,12,14] or GO annotations [9,10] as the second constraint. These two constraints of distance matrices are added linearly to form the new distance matrix for gene clustering. This linear combination strategy has also been used to incorporate different constraints in document clustering [15,16]. Despite good clustering performance, there are two major problems yet to be solved. The first is that these MGC methods can only combine constraints of the same nature, i.e., all constraints have to be represented as distance matrices. If one constraint is a similarity matrix, we need to transform it into a distance matrix so that we can add it up to other distance matrices. Such transformation may distort the original constraint with information loss. Even if we have two distance matrices, the distance values may be in different scales and cannot be added directly. The second problem lies in the linear combination of the constraint matrices. In most cases, the desired combined constraint does not necessarily have a simple linear relationship with all other original constraints. In addition, the weights for the linear combination often need a reasonable justification in practice. Another MGC strategy is the GO-guided fuzzy c-means (FCM) algorithm [13], which uses GO annotations to initialize and update the cluster probability of each gene.

To overcome above problems, we propose a novel MGC method within the generalized projection framework, which is a generalization of the projection onto convex sets (POCS) technique, which has found many applications in image reconstruction [17] and microarray missing value imputation [18]. Theoretically, POCS provides a flexible framework to integrate multiple pieces of constraints for an optimal solution. It first transforms each constraint into a corresponding convex set, and then uses an iteratively convergent procedure to find a solution in the intersection of all sets. POCS can integrate constraints from different nature such as different similarity matrices. Indeed, it often handles different constraints in frequency and spatial domains in image reconstruction problems. Another advantage is that the original constraints remain intact. The clustering result is projected onto the solution set that satisfies each constraint iteratively and the final result may lie in the intersection set that satisfies a nonlinear combination of the original constraints. Without loss of generality, in this paper we consider two major types of constraints: the gene expression similarity [8] and the GO-based semantic similarity [19]. POCS produces a regularized clustering result that may be more reliable than those solely dependent on either the gene expression similarity or the GO semantic similarity due to the fact that expression data are often short and noisy, while GO terms may be inaccurate and mis-annotated. Because in most cases the solution set is nonconvex, we adopt the generalized projections similar to the POCS procedure. To minimizes the distance between the candidate solution and the constraint set, we design the generalized projector based on a method similar to the relaxation labeling (RL) algorithm [20,21], which has been used for the approximate inference for Markov random fields [22,23]

Usually genes have multiple functions and can be assigned into more than one group. Traditional gene clustering algorithms often use a hard clustering strategy that assigns genes into only one group. Recent MGC methods relax this limitation and allows genes to be assigned into several groups [9,10,13]. To take this situation into account, we use a soft clustering strategy in which genes are assigned to all clusters with different probabilities. Based on soft clustering results, we propose a new performance measure "gene log likelihood" (GLL) to measure the distance between the predicted clustering result and the reference clusters. This measure has also been widely applied to evaluating word clustering performance in topic modeling problems [24]. To confirm the effectiveness, we evaluate the POCS-based MGC method on the yeast gene expression dataset, and compare the clustering results with recent MGC methods such as k-medoids [10], ICM [12] and FCM [13]. Experimental results demonstrate that the POCS-based MGC can enhance the overall clustering performance by a large margin.

This paper is organized as follows. In the next section we propose the POCS-based MGC method and the RL-based generalized projector to minimize the distance between clustering solution to the corresponding constrained solution set. To account for genes in multiple clusters, we also propose GLL for calculating the distance between the predicted soft clustering results and the reference gene clusters. The result section shows comparative experimental results on different yeast expression datasets. The POCS-based MGC algorithm always converges to the optimal solution in practice. Finally, we draw conclusions and envision future work.

Methods

Gene clustering is a labeling problem, in which a set of cluster labels are assigned to genes for annotating gene functions. Given I genes and K clusters, the soft clustering solution is a matrix X = (xik), 1 ≤ i I, 1 ≤ k K, where xik∈ [0, 1] and Σk xik = 1. The element xik is the probability that the ith gene is associated with the kth cluster label. For each gene we use a probability vector xi = (xi1, . . ., xik, . . ., xiK) to represent its cluster labeling configuration. From this perspective, the clustering solution X is the cluster labeling configuration of I genes over K clusters. We may also use the winner-take-all strategy to figure out the hard clustering solution X*, in which the ith gene belongs to only one cluster with the highest probability, i.e., k* = arg maxk xik and xik* = 1.

Gene expression constraint

Based on microarray gene expression profiles, we can build the first constraint using the similarity matrix for gene clustering. The metric can be the Pearson's correlation coefficient and Euclidean distance [8-10], or the more complex type-2 fuzzy hidden Markov model-based sequence similarity [25]. Because the Pearson's correlation coefficient is suitable for time-series gene expression data [26], we adopt it for calculating the similarity between two genes' log-ratio transformed profiles [8], i.e., the logarithm of the ratio between each sample point in the profile and a control measurement. More specifically, given two genes' transformed profiles gi(m) and gi'(m) in length M, the correlation coefficient vii' is

where μi and σi denote mean and standard deviation of the transformed profile of the ith gene respectively. The correlation coefficient value ∈ [-1, 1], where the higher value corresponds to the higher similarity between two genes' profiles. Here we consider the anti-correlated gens as most dissimilar because the correlated genes often involve in similar reaction steps and share similar functions. Therefore, the Pearson's correlation coefficient matrix constrains the first clustering solution set C1 = {Xe}, which contains many locally optimal clustering solutions satisfying .

GO constraint

As an important source of biological knowledge, the Gene Ontology (GO) provides a consistent description of genes and gene products by a controlled and structured vocabulary, which includes three major categories: biological process (BP), molecular function (MF), and cellular component (CC). The GO terms are organized in the form of a directed acyclic graph (DAG) with two major semantic relations such as "is-a" and "part-of", where "A is-a B" means A is a subclass of B, and "C part-of D" means C is always part of D. Generally, simply identifying the shared GO annotations of gene products for their functional relationship has the following limitations. First, two quite different GO annotations can be closely related through their common ancestors in the DAG so as to have a higher semantic similarity. Second, the shared GO terms may be too general to describe the functional association of annotated gene products. Recently, the GO-based semantic similarity measures have been applied to searching semantically similar proteins [27], clustering gene expression data and assessing cluster validity [19,28,29], developing new human regulatory pathway modeling tools [30], validating protein interaction data [31], validating functional annotation of expression-based clusters [32], and enabling the identification of functionally related gene products independent of homology [33].

The GO-based semantic similarity measures assume that the more information two GO terms share, the more similar they are. In this paper we adopt a recent GO-based semantic measure proposed by Wang et al. [19], in which the similarity between two GO terms SGO(cm, cn) is calculated according to the graph structural information encoded in the GO. This semantic measure between annotated GO terms for genes has been demonstrated to be better than the classic Resnik's measure in clustering gene products. If c is a GO term, is the set of GO terms including term c and all its ancestors, and Ec is the set of edges connecting all terms in , the S-value of any term t in the graph DAGc = (c, , Ec) related to the term c, Sc(t), is defined as,

where we is the semantic contribution factor for edge e Ec linking the term t with its child term t'. Here we use we = 0.8 for "is-a" relation and we = 0.6 for "part-of" relation as suggested in [19]. After obtaining all S-values for all terms in the DAGc, the semantic value of the term c, SV (c), is

Given two GO terms c1 and c2 as well as their graphs and , the semantic similarity SGO(c1, c2) is

where (t) is the S-value of GO term t related to term c1, and (t) is the S-value of GO term t related to term c2. One gene may be annotated by many GO terms. Given two genes annotated by several GO terms, GOi = {ci1, . . ., cim, . . ., ciM} and GOi' = {ci'1, . . ., ci'n, . . ., ci'N}, the functional similarity between genes,

Note that the functional similarity between two GO term sets GOi and GOi' considers the hierarchical structure of GO terms c based on the S-value. Because the GO contains three main vocabularies, BP, MF and CC, the GO similarity value between genes can be calculated in a joint manner as

where BPsim, MFsim and CCsim denote the similarity values of the corresponding GO terms within the same type. The similarity value ∈ [0, 1], where the higher value corresponds to the higher similarity. As a result, the GO-based semantic similarity constrains the second clustering solution set C2 = {Xg}, which contains many locally optimal clustering solutions satisfying .

Generalized projections

Although the gene expression and GO-based semantic similarity may achieve a clustering solution with a high correlation, there is still a large amount of complementary information between their final clustering results [34]. Both gene expression and GO constrained solution sets C1 = {Xe} and C2 = {Xg} may not contain a single globally optimal solution, and even they contain such a solution, we are unlikely able to find it since the optimization procedures are highly nonlinear. So, we consider C1 and C2 as sets of all locally optimal solutions under different constraints. When both constraints are satisfied, we eliminate many unreasonable locally optimal solutions and obtain an improved clustering performance. Our objective is to find the biologically consistent clustering solution XC1 C2 using the POCS procedure [17]. Note that direct adding two constraints and based on the weight w ∈ [0, 1], i.e., , to produce the new constraint for gene clustering is not suitable because the constraints are from different nature. In contrast, the POCS framework decomposes the optimization procedure into different projections and solves the problem efficiently.

input: X0, Pn, wn, 1 ≤ n N, M.

output: XM.

begin

   for m ← 1 to M do

     ;

     // PnXm-1 is described in Algorithm 2.

   end

end

Algorithm 1: The simultaneous projection.

Within the POCS framework [17], each constraint on the solution is formulated as a corresponding closed convex set, Cn, 1 ≤ n ≤ N, in the Hilbert space H. The optimal solution Xis included in the intersection set C0 of all convex sets Cn,

(1)

If C0 is nonempty in Figure 1A, the successive projections onto the convex sets,

thumbnailFigure 1. (A) The consistent problem in Eq. (2), where the intersection set C0 is nonempty. The circle is the initial solution. The thick black point is the consistent solution in the intersection of two sets for gene expression and GO constraints, respectively. POCS ensures that the initial solution will converge to the consistent solution after enough projections represented by the arrows. (B) The inconsistent problem in Eq. (4), where the intersection set C0 is empty. After enough simultaneous projections represented by the arrows, the thick black dot is the approximate solution such that a weighted set distance from gene expression and GO constraints is minimized.

(2)

will converge to a consistent solution in C0 for any random initial value X0, where Xm, 1 ≤ m M is the solution at the mth iteration. Eq. (2) shows that the current solution Xm-1 is projected to each set or constraint Cn, 1 ≤ n N through the projector Pn successively in order to find the next better solution Xm until it converges to the consistent solution Xin the intersection of all sets. Figure 1A shows the projection process for the consistent problem in Eq. (2), where the thick black dot represents a consistent solution in the intersection of two sets C1 and C2 for the gene expression and GO constraints, respectively. The generalized projector Pn transforms Xm-1 into a solution within the set Cn that minimizes the distance between Xm-1 and ,

(3)

where ||· || denotes the norm in the Hilbert space H. Indeed, Eq. (3) indicates that we need to transforms the current clustering solution Xm-1 into a more suitable clustering solution based on the similarity or distance matrix for the set Cn. If C0 is empty in Figure 1B, the POCS algorithm uses simultaneous projections,

(4)

where wn is the weight on the projections satisfying and wn ≥ 0 for all n. The simultaneous projections converge weakly to a solution such that a weighted set distance function is minimized. Note that the simultaneous projections only linearly combine the solutions projected onto all constraint sets, which is more reasonable than the strategy that linearly combines constraints and then finds a solution under the new constraint. Figure 1B shows the simultaneous projections for the inconsistent problem in Eq. (4), where the thick black dot is an approximately best solution minimizing the weighted set distance from gene expression constraint C1 and GO constraint C2, respectively.

In practice, both C1 and C2 are often nonconvex. A set is convex if and only if λXa + (1 - λ)Xb is in the set when Xa and Xb are in the set for 0 ≤ λ ≤ 1. The constraint sets contain many locally optimal clustering "solutions" and the interpolation of the solutions, i.e., the weighted sum λXa + (1 -λ)Xb, has no mathematical meaning. Thus, we cannot use the classic POCS procedure (2). Nevertheless, we can still use the generalized projections (3) to solve the problem within the POCS framework [[17], Chapter 5], which do not require the sets be convex. In practice it is difficult to minimize the distance functions (3) under both constraints at the same time, so we do it iteratively based on generalized projections. The generalized projector iteratively minimizes the distance function (3), and will terminate if the distance in the next step cannot decrease. From the regularization point of view, the solution is regularized under different constraints simultaneously, and the final solution is a linear combination of each regularized solution in Eq. (4). The simultaneous projection weights wn can be fixed empirically according to prior knowledge. To summarize, Algorithm 1 shows the simultaneous projection algorithm.

input: , 1 ≤ i, i' ≤ I, 1 ≤ k K, J.

output: , 1 ≤ i I, 1 ≤ k K.

begin

   for j ← 1 to J do

     for i ← 1 to I do

       for k ← 1 to K do

         ;

         ;

       end

     end

   end

end

Algorithm 2: The relaxation labeling projector.

Now we design the generalized projector based on the iterative RL algorithm [20,21,23], which can find the soft cluster label for each gene under a certain constraint. Given the clustering solution X and the constraint , minimizing (3) is equivalent to maximizing the corresponding gain function,

(5)

where i' ∈ ∂i is a set of neighbors of the ith gene, and the term exp() increases with the similarity between two genes according to the constraint . The neighborhood system ∂i is defined as the ten nearest genes i' with top similarity values . The term encourages that if the genes have a high similarity value they also have a high similarity value in soft cluster labeling configurations. The RL algorithm iteratively updates the initial X1 by the gradient of the gain function (5) until j reaches the fixed maximum number J as shown in Algorithm 2. The value of J is determined experimentally to ensure that the gain function is maximized. That is, after J iterations, the RL algorithm converges to the local maximum of the gain function in terms of XJ. In the meanwhile, the distance function (3) is also minimized by XJ, where XJ is equivalent to in (3). Algorithm 2 shows the projection of X1 satisfying one constraint . Note that J is the number of iterations of the RL-based projector in Algorithm 2, while M is the number of iterations in the simultaneous projection in Algorithm 1. The RL-based projector is a fast algorithm and practically J = 5 is enough.

Gene log likelihood

If we have a reference gene clustering solution Y, we can calculate the distance between the predicted clustering solution X and the standard reference Y for the performance evaluation. The reference clustering solution is a matrix, Y = (yiw), 1 ≤ i I, 1 ≤ w W, where yiw = 1 denotes that the ith gene belongs to the wth cluster. The number of reference clusters W may not equal to the predicted number of clusters K in most cases. Because a gene may belong to multiple clusters due to multiple functions, the vector yi = (yi1, . . ., yiw, . . ., yiW) may contain multiple ones for the ith gene.

Based on the hard clustering solution X*, we may quantify the distance between X* and Y by normalized mutual information (NMI), which has been widely used in a lot of applications to measure the performance of clustering methods [12,19]. In information theory, the mutual information is defined as a quantity to measure the amount of information shared between two random variables. If one set of clusters is more consistent with the other set of clusters, the mutual information between two sets of cluster labels becomes larger. Generally, the mutual information is normalized because the range of the mutual information measures depends on the size of given sets of clusters. NMI is calculated as

where I is the number of genes, nw is the number of genes in the wth reference cluster, nk is the number of genes in the kth reference cluster, and nwk is the number of genes in both wth reference cluster and kth predicted cluster. If two sets of clusters are identical, NMI between them reaches the maximum value of one.

However, NMI cannot be used if one gene may be in multiple clusters. So, we propose a new performance measure referred to as gene log likelihood (GLL) log P(Y|X) for gene clustering, which measures the likelihood in predicting a single gene in the reference cluster Y based on X. GLL has a simple meaning that the ith gene in the wth reference cluster Y is predicted with a likelihood proportional to the product of the likelihood that the wth cluster is generated by the kth cluster and the likelihood that the ith gene is generated by the kth cluster in X. Higher values are better, indicating the obtained clustering solution X has a higher likelihood to generate the reference gene clusters Y. Specifically we calculate GLL as follows,

(6)

where xi = (xi1, . . ., xik, . . ., xiK) is the probability distribution over K clusters of the ith gene, i w denotes the set of all genes in the wth reference cluster with yiw = 1, and pw = (pw1, . . ., pwk, . . ., pwK) is the probability distribution of the wth reference cluster over K predicted clusters. Empirically, this probability pwk can be estimated by

(7)

(8)

where we assume that the genes are conditionally independent in the generative process. Indeed, this is a standard performance measure for word clustering in the text mining [24], which indicates the empirical likelihood in predicting a single word in a document.

Results and Discussion

Datasets

To calculate the gene expression constraint, we select four microarray time-series datasets [35], monitoring genome-wide mRNA levels for 6178 yeast Saccharomyces cerevisiae open reading frames simultaneously using several different methods of synchronization including four datasets: alpha, cdc15, cdc28 and elu datasets. Also we add the Hughes dataset [36] widely used in gene clustering [9,10], because it contains 300 time points while a small number of missing values. The missing values in the microarray data are interpolated by the POCS-based reconstruction method [18], which uses multiple constraints such as synchronization loss. To calculate the GO constraint, the GO (version 20080225) and annotation (version 1.1384) databases of yeast are downloaded from the GO official website. The yeast annotation file includes 6345 gene products annotated with 77152 GO terms.

To evaluate MGC methods for gene clustering, we generate two different sets of reference gene clusters with true cluster labels from KEGG [37] and SGD (Saccharomyces Genome Database) http://www.yeastgenome.org/ webcite referred to as KEGG clusters [12] and SGD clusters [19], respectively. The KEGG pathway maps are generally classified into six major categories including metabolism. We use ten subcategories under the metabolism category as KEGG clusters, which includes a total of 531 genes. Note that a gene can be in more than one cluster. Table 1 lists the KEGG clusters and the number of genes in the corresponding cluster. We also use the gene annotation and classification information in yeast biochemical pathways as SGD clusters. There are 142 pathways involved with 835 genes, among which only 26 pathways contain more than 10 genes, where a gene can be in more than one pathway. Table 2 summarizes the list of pathway clusters and the number of genes in the corresponding cluster. The reason why we use two different sets of reference clusters lie in the fact that gene clusters are variable depending on the different partitioning criteria. If the predicted clusters by the POCS-based method are close to both reference clusters, we may make a safe conclusion that this method is robust to annotate gene functions under different conditions.

Table 1. 10 reference gene clusters from KEGG

Table 2. 26 reference gene clusters from yeast biochemical pathways

Comparative results

The POCS-based MGC method requires two key parameters, the number of simultaneous projections M and the weight on projections wn, in Algorithm 1. Because we have two constraints, the weight for the GO-based constraint is w, and thus the weight for the gene expression constraint is 1 - w. Through experiments on the alpha dataset, we can determine proper M and w for desirable gene clustering performance. The parameters M and w are adjusted so that we can obtain the desirable result within the POCS framework. It is possible that another iterative method can estimate the parameters better. However, in many cases, such a better-performing method is a supervised learning procedure using reference gene clusters, and can be incorporated into the POCS procedure to achieve an even better performance or robustness. That is, POCS is useful for combining information from different sources if we can formulate corresponding constraint sets and projections.

To determine M, we randomly initialize the clustering solution, and the weight w = 0.5. Figure 2 shows the GLL values on the KEGG and SGD reference clusters when 10 projections are used. From different number of clusters K = 10, 15, 20, 25, we see that all GLL values do not increase significantly after two or three projections. So, we believe that M = 3 is enough to produce desirable clustering results in this task. From this experiment, we also see that Algorithm 1 converges quickly after a few projections. Then, we fix M = 3 and tune the weight w ∈ [0, 1]. By using M = 3 projections in practice, POCS does not increase the computational cost very much, which makes this algorithm very attractive in combining more constraints for gene clustering.

thumbnailFigure 2. GLL of the alpha dataset on the KEGG and SGD when 10 projections are used.

Figure 3 shows the GLL values on the KEGG and SGD reference clusters by increasing the weight at the step 0.1. We observe that the performance highly depends on different projection weights. If we use KEGG reference clusters, we find that weight w = 0.7 can produce higher GLL value on average. The gene expression constraint alone w = 0 does not ensure the best clustering result, while the GO constraint alone w = 1 does not ensure the best clustering result either. We see that the GO constraint can produce more reliable clustering result than the gene expression constraint, because the GO annotation is based on prior knowledge of biologists more reliable than gene expression data. Furthermore, we often assume that anti-correlated genes are not within the same cluster, but in some cases this assumption is not true. However, when the weight w increases, the final performance does not always increase and w = 0.5 produces a local minimum of the GLL value. After that, the GLL value continue to increase to the next local maximum of the GLL value. The SGD reference cluster reconfirms that the GO-based constraint is more reliable. The best clustering performance occurs often when w = 0.9 on average. Therefore, we adopt the weight w = 0.8 for the simultaneous projection in all our experiments.

thumbnailFigure 3. GLL of the alpha dataset on the KEGG and SGD when different weights w are used.

As far as Figure 3 is concerned, one major reason why GO information is more reliable for clustering is that the reference gene clusters from KEGG and SGD (Tables 1 and 2) are partly correlated with GO annotations. Therefore, we need to delete a certain fraction of GO annotations when perform clustering, and use only the gene expression constraint to predict the new gene functions compared with reference gene clusters. In this paper, we adopt the cross-validation procedure [10] to validate the POCS-based MGC method. More specifically, we perform a five-fold cross-validation by deleting 20% GO constraints from the datasets in turn. We shall examine whether the POCS-based MGC clustering method can predict the functions for those 20% genes without GO constraints as compared to reference KEGG and SGD gene clusters. We repot the average prediction performance for the five-fold cross-validation.

After we fix M = 3 and w = 0.8, we compare our POCS-based MGC method with three state-of-the-art MGC methods: k-medoids [10], ICM [12] and FCM [13]. Both k-medoids and ICM first linearly combine two constraints and , and then use the ICM and k-medoids algorithms to partition the genes into different clusters. We empirically determine the linear combination weight of the GO constraint w = 0.9 for k-medoids, which can produce the desirable clustering results in terms of GLL on average. For the ICM algorithm [12], we choose the best recommended parameter w = 0.2, which is biased toward the gene expression constraint. On the other hand, FCM uses GO annotations to initialize X0, and uses both initial X0 and gene expression values to update X0 until it converges to a new clustering solution XM . We use the best suggested weight w = 0.8 for FCM [13], which is biased toward the GO constraint for soft clustering.

Tables 3, 4, 5 and 6 show the average clustering performance and standard deviation in terms of GLL and NMI based on soft clustering solution X and the hard clustering solution X*, respectively. We see that the POCS produces the highest GLL value among all MGC methods, which means that its soft clustering solution is the most likely to generate both KEGG and SGD reference clusters. The k-medoids algorithm performs the worst, partly because it is easy to fall into the local optimal clustering solution. ICM uses an iterative procedure to find a better clustering solution by the combined constraint, but it is biased to the unreliable gene expression constraint. FCM performs slightly better than ICM partly because it is biased to the more reliable GO constraint. Compared with FCM, POCS significantly increases the GLL value around 15% on both KEGG and SGD reference clusters. Another observation is that the Hughes dataset has the highest GLL value, partly because it contains much longer gene expression profiles than alpha, cdc15, cdc28 and elu datasets. The longer gene expression profiles are more reliable for gene clustering. The NMI values are consistent with the GLL values, where if the soft clustering solution has a higher GLL value the corresponding hard clustering solution by the winner-take-all strategy also has a higher NMI value. Thus, the performance measure GLL can best account for this soft clustering solution, where the higher GLL value corresponds to better soft clustering solution. However, we observe that the GLL value varies much more than the NMI value, mainly because the soft clustering solution space is larger than that of the hard clustering. In some cases, the difference of NMI values between POCS and FCM is not significant. Thus, we need to examine the statistical significance in the difference of NMI values between POCS and FCM. Table 7 shows the p-values of pairwise t-test [38] over all five microarray datasets, which indicates that the NMI value of POCS is higher than the corresponding FCM results with a statistical significance of more than 99% for all datasets.

Table 3. Five-fold cross-validation of the GLL values on KEGG clusters

Table 4. Five-fold cross-validation of the NMI values on KEGG clusters

Table 5. Five-fold cross-validation of the GLL values on SGD clusters

Table 6. Five-fold cross-validation of the NMI values on SGD clusters

Table 7. P-values of pairwise t-test of POCS and FCM

To further confirm the effectiveness of POCS-based MGC method, we show two clustering examples. First, the gene YPR145W involves two KEGG pathways "Amino acid metabolism" and "Energy metabolism" in Table 1. All other MGC algorithms misclassify this gene into a single cluster, but our POCS algorithm successfully classify it into two clusters with probabilities 0.7 and 0.3. This example confirms the effectiveness of our method for identifying genes in multiple functions. Second, we examine the gene YJL052W involving two SGD pathways "glycolysis" and "gluconeogenesis" in Table 2. We compute the p-values between each gene function in GO and the cluster (alpha dataset when K = 10) containing the gene YJL052W using Gene Ontology Term Finder http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.pl webcite. We then rank the gene functions according to their p-values, and the top function is assigned to the gene cluster. We find that the top function is "glycolysis" with the p-value 3.12e - 41, which is consistent with one of SGD pathways in which YJL052W involves. This example further confirms that the discovered clusters indeed reflect the true biological functions in terms of pathways.

Conclusion

This paper presents a novel MGC method within the generalized POCS framework, which successfully combines two constraints from different nature for gene clustering. In addition, we also propose the GLL to measure the soft clustering performance. Experimental results of five-fold cross-validation on different microarray datasets show that the POCS-based MGC method is competitive or superior to other state-of-the-art MGC methods based on KEGG and SGD reference gene clusters. In the future, we aim to incorporate more constraints such as DNA sequence features and gene network structures to improve gene clustering performance further. For example, the structural profiles of DNA sequences play important roles in key genetic processes such as transcription [39], replication [40], protein-DNA recognition [41], and tissue specificity [42]. We may use the similarity between structural profiles of DNA sequences as a new constraint for gene clustering. On the other hand, we may also develop more efficient supervised learning strategies to automatically determine the weights of simultaneous projections in Algorithm 1. For example, we may choose decision trees [43] or ensemble learning methods [44] to learn the weights of different constraints from training data, and apply these weights to clustering unknown genes for function prediction.

Authors' contributions

JZ developed this methodology, carried out experiments and drafted the manuscript. ZSF and AWL provided useful comments on methodology and helped revise this manuscript. HY initiated the project and participated in project design and helped revise the manuscript. All authors read and approved the final manuscript.

Acknowledgements

Great thanks are due to Xiao-Qin Cao and Xiao-Yu Zhao for their assistance in code implementation. This work is supported by the Hong Kong Research Grant Council (Project CityU 122607). This work is also supported by the National Nature Science Foundation of China (No. 60903076) and the Shanghai Committee of Science and Technology, China (No. 08DZ2271800 and 09DZ2272800).

References

  1. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns.

    Proc Natl Acad Sci 1998, 95(25):14863-8. OpenURL

  2. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.

    Nat Genet 1999, 22(3):281-5. OpenURL

  3. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation.

    Proc Natl Acad Sci 1999, 96(6):2907-12. OpenURL

  4. Dembélé D, Kastner P: Fuzzy C-means method for clustering microarray data.

    Bioinformatics 2003, 19:973-980. OpenURL

  5. Schliep A, Schönhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data.

    Bioinformatics 2003, 19:i255-i263. OpenURL

  6. Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.

    Proc Natl Acad Sci 2001, 98(16):8961-5. OpenURL

  7. Bar-Joseph Z: Analyzing time series gene expression data.

    Bioinformatics 2004, 20:2493-2503. OpenURL

  8. Hanisch D, Zien A, Zimmer R, Lengauer T: Co-clustering of biological networks and gene expression data.

    Bioinformatics 2002, 18(Suppl 1):S145-54. OpenURL

  9. Pan W: Incorporating gene functions as priors in model-based clustering of microarray gene expression data.

    Bioinformatics 2006, 22(7):795-801. OpenURL

  10. Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data.

    Bioinformatics 2006, 22(10):1259-1268. OpenURL

  11. Aubry M, Monnier A, Chicault C, de Tayrac M, Galibert MD, Burgun A, Mosser J: Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets.

    BMC Bioinformatics 2006, 7:241. OpenURL

  12. Shiga M, Takigawa I, Mamitsuka H: Annotating gene function by combining expression data with a modular gene network.

    Bioinformatics 2007, 23(13):i468-i478. OpenURL

  13. Tari L, Baral C, Kim S: Fuzzy c-means clustering with prior biological knowledge.

    J Biomed Inform 2009, 42:74-81. OpenURL

  14. Tritchler D, Parkhomenko E, Beyene J: Filtering genes for cluster and network analysis.

    BMC Bioinformatics 2009, 10:193. OpenURL

  15. Zhu S, Zeng J, Mamitsuka H: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity.

    Bioinformatics 2009, 25(15):1944-1951. OpenURL

  16. Zhu S, Takigawa I, Zeng J, Mamitsuka H: Field independent probabilistic model for clustering multi-field documents.

    Information Processing & Management 2009, 45:555-570. OpenURL

  17. Stark H, Yang Y: Vector space projections: a numerical approach to signal and image processing, neural nets, and optics. New York: Wiley; 1998.

  18. Gan X, Liew AWC, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge.

    Nucleic Acids Res 2006, 34:1608-1619. OpenURL

  19. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms.

    Bioinformatics 2007, 23:1274-1281. OpenURL

  20. Zeng J, Liu ZQ: Markov Random Field-based Statistical Character Structure Modeling for Handwritten Chinese Character Recognition.

    IEEE Trans Pattern Anal Mach Intell 2008, 30(5):767-780. OpenURL

  21. Zeng J, Liu ZQ: Type-2 fuzzy Markov random fields and their application to handwritten Chinese character recognition.

    IEEE Trans Fuzzy Syst 2008, 16(3):747-760. OpenURL

  22. Feng W, Liu ZQ: Region-Level Image Authentication Using Bayesian Structural Content Abstraction.

    IEEE Trans Image Process 2008, 17(12):2413-2424. OpenURL

  23. Zeng J, Feng W, Xie L, Liu ZQ: Cascade Markov random fields for stroke extraction of Chinese characters.

    Inf Sci 2010, 180:301-311. OpenURL

  24. Blei DM, Ng AY, Jordan MI: Latent Dirichlet allocation.

    J Mach Learn Res 2003, 3(4-5):993-1022. OpenURL

  25. Zeng J, Liu ZQ: Type-2 Fuzzy Hidden Markov Models and Their Application to Speech Recognition.

    IEEE Trans Fuzzy Syst 2006, 14(3):454-467. OpenURL

  26. Ramoni MF, Sebastianidagger P, Kohane IS: Cluster analysis of gene expression dynamics.

    Proc Natl Acad Sci 2002, 99:9121-9126. OpenURL

  27. Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.

    Bioinformatics 2003, 19:1275-1283. OpenURL

  28. Adryan B, Schuh R: Gene-Ontology-based clustering of gene expression data.

    Bioinformatics 2004, 20:2851-2852. OpenURL

  29. Bolshakova N, Azuaje F, Cunningham P: A knowledge-driven approach to cluster validity assessment.

    Bioinformatics 2005, 21:2546-2547. OpenURL

  30. Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways.

    Bioinformatics 2006, 22:967-973. OpenURL

  31. Wolting C, McGlade CJ, Tritchler D: Cluster analysis of protein array results via similarity of Gene Ontology annotation.

    BMC Bioinformatics 2006, 7:338. OpenURL

  32. Steuer R, Humburg P, Selbig J: Validation and functional annotation of expression-based clusters based on gene ontology.

    BMC Bioinformatics 2006, 7:380. OpenURL

  33. Schlicker A, Domingues FS, Rahnenführer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology.

    BMC Bioinformatics 2006, 7:302. OpenURL

  34. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A: Correlation between gene expression and GO semantic similarity.

    IEEE/ACM Trans Comput Biol Bioinform 2005, 2:330-338. OpenURL

  35. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

    Mol Biol Cell 1998, 9:3273-3297. OpenURL

  36. Hughes TR, et al.: Functional discovery via a compendium of expression profiles.

    Cell 2000, 102:109-26. OpenURL

  37. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG.

    Nucleic Acids Res 2006, 34:D354-D357. OpenURL

  38. Kreyszig E: Introductory Mathematical Statistics. New York: John Wiley & Sons; 1970.

  39. Cao XQ, Zeng J, Yan H: Structural property of regulatory elements in human promoters.

    Phys Rev E 2008, 77:041908. OpenURL

  40. Cao XQ, Zeng J, Yan H: Structural properties of replication origins in yeast DNA sequences.

    Phys Biol 2008, 5:036012. OpenURL

  41. Cao XQ, Zeng J, Yan H: Physical signals for protein-DNA recognition. Phys.

    Biol 2009, 6:036012. OpenURL

  42. Zeng J, Cao XQ, Zhao H, Yan H: Finding human promoter groups based on DNA physical properties.

    Phys Rev E 2009, 80:041917. OpenURL

  43. Zeng J, Zhao XY, Cao XQ, Yan H: SCS: Signal, context and structure features for genome-wide human promoter recognition.

    IEEE/ACM Trans Comput Biol Bioinform 2010, in press. OpenURL

  44. Zeng J, Zhu S, Yan H: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods.

    Briefings in Bioinformatics 2009, 10(5):498-508. OpenURL