Abstract
Background
Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem.
Results
We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCSbased gene clustering method outperforms current stateoftheart MGC methods.
Conclusions
The POCSbased MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.
Background
Computational annotating gene functions is a fundamental issue in bioinformatics. Microarray gene expression data have been used widely to study the cell cycle system, genetic regulatory interactions, development at the molecular level, and genes that act in response to a certain infectious disease. To determine gene functions, a basic approach is gene clustering using gene expression data based on the assumption that genes with similar expression patterns should share similar functions in the process. Typical gene clustering methods include hierarchical clustering [1], the kmeans algorithm [2], selforganizing maps [3], the fuzzy cmeans algorithm [4], and hidden Markov models [5]. However, gene clustering regularized by only single constraint of gene expression is not enough to obtain biologically reliable clusters, because microarray data are often noisy, contain missing values, and have uncertain temporal dependencies in timeseries data [6,7]. Therefore, other constraints besides gene expression data should be incorporated for the robust and reliable gene clustering.
Recent multiconstrained gene clustering (MGC) methods have attracted much more interests [813]. The basic idea is that multiple constraints such as Gene Ontology (GO) and metabolic network structures can prevent gene clustering from falling into the locally optimal solution space constrained by noisy gene expression data alone. One key problem is how to combine multiple pieces of constraints to find a consistent clustering solution. Current MGC methods adopt a linear combination strategy to integrate multiple constraints of the same nature into a single new constraint, so that standard clustering algorithms for singleconstrained gene clustering problems can be used, e.g., hierarchical clustering [8], Gaussian mixture models [9], kmedoids [10], and iterative conditional modes (ICM) for Markov random fields [12]. More specifically, they build a distance matrix of gene expression data as the first constraint, and then build another distance matrix based on either metabolic pathway [8,12,14] or GO annotations [9,10] as the second constraint. These two constraints of distance matrices are added linearly to form the new distance matrix for gene clustering. This linear combination strategy has also been used to incorporate different constraints in document clustering [15,16]. Despite good clustering performance, there are two major problems yet to be solved. The first is that these MGC methods can only combine constraints of the same nature, i.e., all constraints have to be represented as distance matrices. If one constraint is a similarity matrix, we need to transform it into a distance matrix so that we can add it up to other distance matrices. Such transformation may distort the original constraint with information loss. Even if we have two distance matrices, the distance values may be in different scales and cannot be added directly. The second problem lies in the linear combination of the constraint matrices. In most cases, the desired combined constraint does not necessarily have a simple linear relationship with all other original constraints. In addition, the weights for the linear combination often need a reasonable justification in practice. Another MGC strategy is the GOguided fuzzy cmeans (FCM) algorithm [13], which uses GO annotations to initialize and update the cluster probability of each gene.
To overcome above problems, we propose a novel MGC method within the generalized projection framework, which is a generalization of the projection onto convex sets (POCS) technique, which has found many applications in image reconstruction [17] and microarray missing value imputation [18]. Theoretically, POCS provides a flexible framework to integrate multiple pieces of constraints for an optimal solution. It first transforms each constraint into a corresponding convex set, and then uses an iteratively convergent procedure to find a solution in the intersection of all sets. POCS can integrate constraints from different nature such as different similarity matrices. Indeed, it often handles different constraints in frequency and spatial domains in image reconstruction problems. Another advantage is that the original constraints remain intact. The clustering result is projected onto the solution set that satisfies each constraint iteratively and the final result may lie in the intersection set that satisfies a nonlinear combination of the original constraints. Without loss of generality, in this paper we consider two major types of constraints: the gene expression similarity [8] and the GObased semantic similarity [19]. POCS produces a regularized clustering result that may be more reliable than those solely dependent on either the gene expression similarity or the GO semantic similarity due to the fact that expression data are often short and noisy, while GO terms may be inaccurate and misannotated. Because in most cases the solution set is nonconvex, we adopt the generalized projections similar to the POCS procedure. To minimizes the distance between the candidate solution and the constraint set, we design the generalized projector based on a method similar to the relaxation labeling (RL) algorithm [20,21], which has been used for the approximate inference for Markov random fields [22,23]
Usually genes have multiple functions and can be assigned into more than one group. Traditional gene clustering algorithms often use a hard clustering strategy that assigns genes into only one group. Recent MGC methods relax this limitation and allows genes to be assigned into several groups [9,10,13]. To take this situation into account, we use a soft clustering strategy in which genes are assigned to all clusters with different probabilities. Based on soft clustering results, we propose a new performance measure "gene log likelihood" (GLL) to measure the distance between the predicted clustering result and the reference clusters. This measure has also been widely applied to evaluating word clustering performance in topic modeling problems [24]. To confirm the effectiveness, we evaluate the POCSbased MGC method on the yeast gene expression dataset, and compare the clustering results with recent MGC methods such as kmedoids [10], ICM [12] and FCM [13]. Experimental results demonstrate that the POCSbased MGC can enhance the overall clustering performance by a large margin.
This paper is organized as follows. In the next section we propose the POCSbased MGC method and the RLbased generalized projector to minimize the distance between clustering solution to the corresponding constrained solution set. To account for genes in multiple clusters, we also propose GLL for calculating the distance between the predicted soft clustering results and the reference gene clusters. The result section shows comparative experimental results on different yeast expression datasets. The POCSbased MGC algorithm always converges to the optimal solution in practice. Finally, we draw conclusions and envision future work.
Methods
Gene clustering is a labeling problem, in which a set of cluster labels are assigned to genes for annotating gene functions. Given I genes and K clusters, the soft clustering solution is a matrix X = (x_{ik}), 1 ≤ i ≤ I, 1 ≤ k ≤ K, where x_{ik}∈ [0, 1] and Σ_{k }x_{ik }= 1. The element x_{ik }is the probability that the ith gene is associated with the kth cluster label. For each gene we use a probability vector x_{i }= (x_{i1}, . . ., x_{ik}, . . ., x_{iK}) to represent its cluster labeling configuration. From this perspective, the clustering solution X is the cluster labeling configuration of I genes over K clusters. We may also use the winnertakeall strategy to figure out the hard clustering solution X*, in which the ith gene belongs to only one cluster with the highest probability, i.e., k* = arg max_{k }x_{ik }and x_{ik* }= 1.
Gene expression constraint
Based on microarray gene expression profiles, we can build the first constraint using the similarity matrix for gene clustering. The metric can be the Pearson's correlation coefficient and Euclidean distance [810], or the more complex type2 fuzzy hidden Markov modelbased sequence similarity [25]. Because the Pearson's correlation coefficient is suitable for timeseries gene expression data [26], we adopt it for calculating the similarity between two genes' logratio transformed profiles [8], i.e., the logarithm of the ratio between each sample point in the profile and a control measurement. More specifically, given two genes' transformed profiles g_{i}(m) and g_{i'}(m) in length M, the correlation coefficient v_{ii' }is
where μ_{i }and σ_{i }denote mean and standard deviation of the transformed profile of the ith gene respectively. The correlation coefficient value ∈ [1, 1], where the higher value corresponds to the higher similarity between two genes' profiles. Here we consider the anticorrelated gens as most dissimilar because the correlated genes often involve in similar reaction steps and share similar functions. Therefore, the Pearson's correlation coefficient matrix constrains the first clustering solution set C_{1 }= {X_{e}}, which contains many locally optimal clustering solutions satisfying .
GO constraint
As an important source of biological knowledge, the Gene Ontology (GO) provides a consistent description of genes and gene products by a controlled and structured vocabulary, which includes three major categories: biological process (BP), molecular function (MF), and cellular component (CC). The GO terms are organized in the form of a directed acyclic graph (DAG) with two major semantic relations such as "isa" and "partof", where "A isa B" means A is a subclass of B, and "C partof D" means C is always part of D. Generally, simply identifying the shared GO annotations of gene products for their functional relationship has the following limitations. First, two quite different GO annotations can be closely related through their common ancestors in the DAG so as to have a higher semantic similarity. Second, the shared GO terms may be too general to describe the functional association of annotated gene products. Recently, the GObased semantic similarity measures have been applied to searching semantically similar proteins [27], clustering gene expression data and assessing cluster validity [19,28,29], developing new human regulatory pathway modeling tools [30], validating protein interaction data [31], validating functional annotation of expressionbased clusters [32], and enabling the identification of functionally related gene products independent of homology [33].
The GObased semantic similarity measures assume that the more information two GO terms share, the more similar they are. In this paper we adopt a recent GObased semantic measure proposed by Wang et al. [19], in which the similarity between two GO terms S_{GO}(c_{m}, c_{n}) is calculated according to the graph structural information encoded in the GO. This semantic measure between annotated GO terms for genes has been demonstrated to be better than the classic Resnik's measure in clustering gene products. If c is a GO term, is the set of GO terms including term c and all its ancestors, and E_{c }is the set of edges connecting all terms in , the Svalue of any term t in the graph DAG_{c }= (c, , E_{c}) related to the term c, S_{c}(t), is defined as,
where w_{e }is the semantic contribution factor for edge e ∈ E_{c }linking the term t with its child term t'. Here we use w_{e }= 0.8 for "isa" relation and w_{e }= 0.6 for "partof" relation as suggested in [19]. After obtaining all Svalues for all terms in the DAG_{c}, the semantic value of the term c, SV (c), is
Given two GO terms c_{1 }and c_{2 }as well as their graphs and , the semantic similarity S_{GO}(c_{1}, c_{2}) is
where (t) is the Svalue of GO term t related to term c_{1}, and (t) is the Svalue of GO term t related to term c_{2}. One gene may be annotated by many GO terms. Given two genes annotated by several GO terms, GO_{i }= {c_{i1}, . . ., c_{im}, . . ., c_{iM}} and GO_{i' }= {c_{i'1}, . . ., c_{i'n}, . . ., c_{i'N}}, the functional similarity between genes,
Note that the functional similarity between two GO term sets GO_{i }and G_{Oi' }considers the hierarchical structure of GO terms c based on the Svalue. Because the GO contains three main vocabularies, BP, MF and CC, the GO similarity value between genes can be calculated in a joint manner as
where BPsim, MFsim and CCsim denote the similarity values of the corresponding GO terms within the same type. The similarity value ∈ [0, 1], where the higher value corresponds to the higher similarity. As a result, the GObased semantic similarity constrains the second clustering solution set C_{2 }= {X_{g}}, which contains many locally optimal clustering solutions satisfying .
Generalized projections
Although the gene expression and GObased semantic similarity may achieve a clustering solution with a high correlation, there is still a large amount of complementary information between their final clustering results [34]. Both gene expression and GO constrained solution sets C_{1 }= {X_{e}} and C_{2 }= {X_{g}} may not contain a single globally optimal solution, and even they contain such a solution, we are unlikely able to find it since the optimization procedures are highly nonlinear. So, we consider C_{1 }and C_{2 }as sets of all locally optimal solutions under different constraints. When both constraints are satisfied, we eliminate many unreasonable locally optimal solutions and obtain an improved clustering performance. Our objective is to find the biologically consistent clustering solution X^{†}∈ C_{1 }∩ C_{2 }using the POCS procedure [17]. Note that direct adding two constraints and based on the weight w ∈ [0, 1], i.e., , to produce the new constraint for gene clustering is not suitable because the constraints are from different nature. In contrast, the POCS framework decomposes the optimization procedure into different projections and solves the problem efficiently.
input: X_{0}, P_{n}, w_{n}, 1 ≤ n ≤ N, M.
output: X_{M}.
begin
for m ← 1 to M do
;
// P_{n}X_{m1 }is described in Algorithm 2.
end
end
Algorithm 1: The simultaneous projection.
Within the POCS framework [17], each constraint on the solution is formulated as a corresponding closed convex set, C_{n}, 1 ≤ n ≤ N, in the Hilbert space H. The optimal solution X^{† }is included in the intersection set C_{0 }of all convex sets C_{n},
If C_{0 }is nonempty in Figure 1A, the successive projections onto the convex sets,
Figure 1. (A) The consistent problem in Eq. (2), where the intersection set C_{0 }is nonempty. The circle is the initial solution. The thick black point is the consistent solution in the intersection of two sets for gene expression and GO constraints, respectively. POCS ensures that the initial solution will converge to the consistent solution after enough projections represented by the arrows. (B) The inconsistent problem in Eq. (4), where the intersection set C_{0 }is empty. After enough simultaneous projections represented by the arrows, the thick black dot is the approximate solution such that a weighted set distance from gene expression and GO constraints is minimized.
will converge to a consistent solution in C_{0 }for any random initial value X_{0}, where X_{m}, 1 ≤ m ≤ M is the solution at the mth iteration. Eq. (2) shows that the current solution X_{m1 }is projected to each set or constraint C_{n}, 1 ≤ n ≤ N through the projector P_{n }successively in order to find the next better solution X_{m }until it converges to the consistent solution X^{† }in the intersection of all sets. Figure 1A shows the projection process for the consistent problem in Eq. (2), where the thick black dot represents a consistent solution in the intersection of two sets C_{1 }and C_{2 }for the gene expression and GO constraints, respectively. The generalized projector P_{n }transforms X_{m1 }into a solution within the set C_{n }that minimizes the distance between X_{m1 }and ,
where ·  denotes the norm in the Hilbert space H. Indeed, Eq. (3) indicates that we need to transforms the current clustering solution X_{m1 }into a more suitable clustering solution based on the similarity or distance matrix for the set C_{n}. If C_{0 }is empty in Figure 1B, the POCS algorithm uses simultaneous projections,
where w_{n }is the weight on the projections satisfying and w_{n }≥ 0 for all n. The simultaneous projections converge weakly to a solution such that a weighted set distance function is minimized. Note that the simultaneous projections only linearly combine the solutions projected onto all constraint sets, which is more reasonable than the strategy that linearly combines constraints and then finds a solution under the new constraint. Figure 1B shows the simultaneous projections for the inconsistent problem in Eq. (4), where the thick black dot is an approximately best solution minimizing the weighted set distance from gene expression constraint C_{1 }and GO constraint C_{2}, respectively.
In practice, both C_{1 }and C_{2 }are often nonconvex. A set is convex if and only if λX_{a }+ (1  λ)X_{b }is in the set when X_{a }and X_{b }are in the set for 0 ≤ λ ≤ 1. The constraint sets contain many locally optimal clustering "solutions" and the interpolation of the solutions, i.e., the weighted sum λX_{a }+ (1 λ)X_{b}, has no mathematical meaning. Thus, we cannot use the classic POCS procedure (2). Nevertheless, we can still use the generalized projections (3) to solve the problem within the POCS framework [[17], Chapter 5], which do not require the sets be convex. In practice it is difficult to minimize the distance functions (3) under both constraints at the same time, so we do it iteratively based on generalized projections. The generalized projector iteratively minimizes the distance function (3), and will terminate if the distance in the next step cannot decrease. From the regularization point of view, the solution is regularized under different constraints simultaneously, and the final solution is a linear combination of each regularized solution in Eq. (4). The simultaneous projection weights w_{n }can be fixed empirically according to prior knowledge. To summarize, Algorithm 1 shows the simultaneous projection algorithm.
input: , 1 ≤ i, i' ≤ I, 1 ≤ k ≤ K, J.
output: , 1 ≤ i ≤ I, 1 ≤ k ≤ K.
begin
for j ← 1 to J do
for i ← 1 to I do
for k ← 1 to K do
;
;
end
end
end
end
Algorithm 2: The relaxation labeling projector.
Now we design the generalized projector based on the iterative RL algorithm [20,21,23], which can find the soft cluster label for each gene under a certain constraint. Given the clustering solution X and the constraint , minimizing (3) is equivalent to maximizing the corresponding gain function,
where i' ∈ ∂_{i }is a set of neighbors of the ith gene, and the term exp() increases with the similarity between two genes according to the constraint . The neighborhood system ∂_{i }is defined as the ten nearest genes i' with top similarity values . The term encourages that if the genes have a high similarity value they also have a high similarity value in soft cluster labeling configurations. The RL algorithm iteratively updates the initial X^{1 }by the gradient of the gain function (5) until j reaches the fixed maximum number J as shown in Algorithm 2. The value of J is determined experimentally to ensure that the gain function is maximized. That is, after J iterations, the RL algorithm converges to the local maximum of the gain function in terms of X^{J}. In the meanwhile, the distance function (3) is also minimized by X^{J}, where X^{J }is equivalent to in (3). Algorithm 2 shows the projection of X^{1 }satisfying one constraint . Note that J is the number of iterations of the RLbased projector in Algorithm 2, while M is the number of iterations in the simultaneous projection in Algorithm 1. The RLbased projector is a fast algorithm and practically J = 5 is enough.
Gene log likelihood
If we have a reference gene clustering solution Y, we can calculate the distance between the predicted clustering solution X and the standard reference Y for the performance evaluation. The reference clustering solution is a matrix, Y = (y_{iw}), 1 ≤ i ≤ I, 1 ≤ w ≤ W, where y_{iw }= 1 denotes that the ith gene belongs to the wth cluster. The number of reference clusters W may not equal to the predicted number of clusters K in most cases. Because a gene may belong to multiple clusters due to multiple functions, the vector y_{i }= (y_{i1}, . . ., y_{iw}, . . ., y_{iW}) may contain multiple ones for the ith gene.
Based on the hard clustering solution X*, we may quantify the distance between X* and Y by normalized mutual information (NMI), which has been widely used in a lot of applications to measure the performance of clustering methods [12,19]. In information theory, the mutual information is defined as a quantity to measure the amount of information shared between two random variables. If one set of clusters is more consistent with the other set of clusters, the mutual information between two sets of cluster labels becomes larger. Generally, the mutual information is normalized because the range of the mutual information measures depends on the size of given sets of clusters. NMI is calculated as
where I is the number of genes, n_{w }is the number of genes in the wth reference cluster, n_{k }is the number of genes in the kth reference cluster, and n_{wk }is the number of genes in both wth reference cluster and kth predicted cluster. If two sets of clusters are identical, NMI between them reaches the maximum value of one.
However, NMI cannot be used if one gene may be in multiple clusters. So, we propose a new performance measure referred to as gene log likelihood (GLL) log P(YX) for gene clustering, which measures the likelihood in predicting a single gene in the reference cluster Y based on X. GLL has a simple meaning that the ith gene in the wth reference cluster Y is predicted with a likelihood proportional to the product of the likelihood that the wth cluster is generated by the kth cluster and the likelihood that the ith gene is generated by the kth cluster in X. Higher values are better, indicating the obtained clustering solution X has a higher likelihood to generate the reference gene clusters Y. Specifically we calculate GLL as follows,
where x_{i }= (x_{i1}, . . ., x_{ik}, . . ., x_{iK}) is the probability distribution over K clusters of the ith gene, i ∈ w denotes the set of all genes in the wth reference cluster with y_{iw }= 1, and p_{w }= (p_{w1}, . . ., p_{wk}, . . ., p_{wK}) is the probability distribution of the wth reference cluster over K predicted clusters. Empirically, this probability p_{wk }can be estimated by
where we assume that the genes are conditionally independent in the generative process. Indeed, this is a standard performance measure for word clustering in the text mining [24], which indicates the empirical likelihood in predicting a single word in a document.
Results and Discussion
Datasets
To calculate the gene expression constraint, we select four microarray timeseries datasets [35], monitoring genomewide mRNA levels for 6178 yeast Saccharomyces cerevisiae open reading frames simultaneously using several different methods of synchronization including four datasets: alpha, cdc15, cdc28 and elu datasets. Also we add the Hughes dataset [36] widely used in gene clustering [9,10], because it contains 300 time points while a small number of missing values. The missing values in the microarray data are interpolated by the POCSbased reconstruction method [18], which uses multiple constraints such as synchronization loss. To calculate the GO constraint, the GO (version 20080225) and annotation (version 1.1384) databases of yeast are downloaded from the GO official website. The yeast annotation file includes 6345 gene products annotated with 77152 GO terms.
To evaluate MGC methods for gene clustering, we generate two different sets of reference gene clusters with true cluster labels from KEGG [37] and SGD (Saccharomyces Genome Database) http://www.yeastgenome.org/ webcite referred to as KEGG clusters [12] and SGD clusters [19], respectively. The KEGG pathway maps are generally classified into six major categories including metabolism. We use ten subcategories under the metabolism category as KEGG clusters, which includes a total of 531 genes. Note that a gene can be in more than one cluster. Table 1 lists the KEGG clusters and the number of genes in the corresponding cluster. We also use the gene annotation and classification information in yeast biochemical pathways as SGD clusters. There are 142 pathways involved with 835 genes, among which only 26 pathways contain more than 10 genes, where a gene can be in more than one pathway. Table 2 summarizes the list of pathway clusters and the number of genes in the corresponding cluster. The reason why we use two different sets of reference clusters lie in the fact that gene clusters are variable depending on the different partitioning criteria. If the predicted clusters by the POCSbased method are close to both reference clusters, we may make a safe conclusion that this method is robust to annotate gene functions under different conditions.
Table 1. 10 reference gene clusters from KEGG
Table 2. 26 reference gene clusters from yeast biochemical pathways
Comparative results
The POCSbased MGC method requires two key parameters, the number of simultaneous projections M and the weight on projections w_{n}, in Algorithm 1. Because we have two constraints, the weight for the GObased constraint is w, and thus the weight for the gene expression constraint is 1  w. Through experiments on the alpha dataset, we can determine proper M and w for desirable gene clustering performance. The parameters M and w are adjusted so that we can obtain the desirable result within the POCS framework. It is possible that another iterative method can estimate the parameters better. However, in many cases, such a betterperforming method is a supervised learning procedure using reference gene clusters, and can be incorporated into the POCS procedure to achieve an even better performance or robustness. That is, POCS is useful for combining information from different sources if we can formulate corresponding constraint sets and projections.
To determine M, we randomly initialize the clustering solution, and the weight w = 0.5. Figure 2 shows the GLL values on the KEGG and SGD reference clusters when 10 projections are used. From different number of clusters K = 10, 15, 20, 25, we see that all GLL values do not increase significantly after two or three projections. So, we believe that M = 3 is enough to produce desirable clustering results in this task. From this experiment, we also see that Algorithm 1 converges quickly after a few projections. Then, we fix M = 3 and tune the weight w ∈ [0, 1]. By using M = 3 projections in practice, POCS does not increase the computational cost very much, which makes this algorithm very attractive in combining more constraints for gene clustering.
Figure 2. GLL of the alpha dataset on the KEGG and SGD when 10 projections are used.
Figure 3 shows the GLL values on the KEGG and SGD reference clusters by increasing the weight at the step 0.1. We observe that the performance highly depends on different projection weights. If we use KEGG reference clusters, we find that weight w = 0.7 can produce higher GLL value on average. The gene expression constraint alone w = 0 does not ensure the best clustering result, while the GO constraint alone w = 1 does not ensure the best clustering result either. We see that the GO constraint can produce more reliable clustering result than the gene expression constraint, because the GO annotation is based on prior knowledge of biologists more reliable than gene expression data. Furthermore, we often assume that anticorrelated genes are not within the same cluster, but in some cases this assumption is not true. However, when the weight w increases, the final performance does not always increase and w = 0.5 produces a local minimum of the GLL value. After that, the GLL value continue to increase to the next local maximum of the GLL value. The SGD reference cluster reconfirms that the GObased constraint is more reliable. The best clustering performance occurs often when w = 0.9 on average. Therefore, we adopt the weight w = 0.8 for the simultaneous projection in all our experiments.
Figure 3. GLL of the alpha dataset on the KEGG and SGD when different weights w are used.
As far as Figure 3 is concerned, one major reason why GO information is more reliable for clustering is that the reference gene clusters from KEGG and SGD (Tables 1 and 2) are partly correlated with GO annotations. Therefore, we need to delete a certain fraction of GO annotations when perform clustering, and use only the gene expression constraint to predict the new gene functions compared with reference gene clusters. In this paper, we adopt the crossvalidation procedure [10] to validate the POCSbased MGC method. More specifically, we perform a fivefold crossvalidation by deleting 20% GO constraints from the datasets in turn. We shall examine whether the POCSbased MGC clustering method can predict the functions for those 20% genes without GO constraints as compared to reference KEGG and SGD gene clusters. We repot the average prediction performance for the fivefold crossvalidation.
After we fix M = 3 and w = 0.8, we compare our POCSbased MGC method with three stateoftheart MGC methods: kmedoids [10], ICM [12] and FCM [13]. Both kmedoids and ICM first linearly combine two constraints and , and then use the ICM and kmedoids algorithms to partition the genes into different clusters. We empirically determine the linear combination weight of the GO constraint w = 0.9 for kmedoids, which can produce the desirable clustering results in terms of GLL on average. For the ICM algorithm [12], we choose the best recommended parameter w = 0.2, which is biased toward the gene expression constraint. On the other hand, FCM uses GO annotations to initialize X_{0}, and uses both initial X_{0 }and gene expression values to update X_{0 }until it converges to a new clustering solution X_{M }. We use the best suggested weight w = 0.8 for FCM [13], which is biased toward the GO constraint for soft clustering.
Tables 3, 4, 5 and 6 show the average clustering performance and standard deviation in terms of GLL and NMI based on soft clustering solution X and the hard clustering solution X*, respectively. We see that the POCS produces the highest GLL value among all MGC methods, which means that its soft clustering solution is the most likely to generate both KEGG and SGD reference clusters. The kmedoids algorithm performs the worst, partly because it is easy to fall into the local optimal clustering solution. ICM uses an iterative procedure to find a better clustering solution by the combined constraint, but it is biased to the unreliable gene expression constraint. FCM performs slightly better than ICM partly because it is biased to the more reliable GO constraint. Compared with FCM, POCS significantly increases the GLL value around 15% on both KEGG and SGD reference clusters. Another observation is that the Hughes dataset has the highest GLL value, partly because it contains much longer gene expression profiles than alpha, cdc15, cdc28 and elu datasets. The longer gene expression profiles are more reliable for gene clustering. The NMI values are consistent with the GLL values, where if the soft clustering solution has a higher GLL value the corresponding hard clustering solution by the winnertakeall strategy also has a higher NMI value. Thus, the performance measure GLL can best account for this soft clustering solution, where the higher GLL value corresponds to better soft clustering solution. However, we observe that the GLL value varies much more than the NMI value, mainly because the soft clustering solution space is larger than that of the hard clustering. In some cases, the difference of NMI values between POCS and FCM is not significant. Thus, we need to examine the statistical significance in the difference of NMI values between POCS and FCM. Table 7 shows the pvalues of pairwise ttest [38] over all five microarray datasets, which indicates that the NMI value of POCS is higher than the corresponding FCM results with a statistical significance of more than 99% for all datasets.
Table 3. Fivefold crossvalidation of the GLL values on KEGG clusters
Table 4. Fivefold crossvalidation of the NMI values on KEGG clusters
Table 5. Fivefold crossvalidation of the GLL values on SGD clusters
Table 6. Fivefold crossvalidation of the NMI values on SGD clusters
Table 7. Pvalues of pairwise ttest of POCS and FCM
To further confirm the effectiveness of POCSbased MGC method, we show two clustering examples. First, the gene YPR145W involves two KEGG pathways "Amino acid metabolism" and "Energy metabolism" in Table 1. All other MGC algorithms misclassify this gene into a single cluster, but our POCS algorithm successfully classify it into two clusters with probabilities 0.7 and 0.3. This example confirms the effectiveness of our method for identifying genes in multiple functions. Second, we examine the gene YJL052W involving two SGD pathways "glycolysis" and "gluconeogenesis" in Table 2. We compute the pvalues between each gene function in GO and the cluster (alpha dataset when K = 10) containing the gene YJL052W using Gene Ontology Term Finder http://db.yeastgenome.org/cgibin/GO/goTermFinder.pl webcite. We then rank the gene functions according to their pvalues, and the top function is assigned to the gene cluster. We find that the top function is "glycolysis" with the pvalue 3.12e  41, which is consistent with one of SGD pathways in which YJL052W involves. This example further confirms that the discovered clusters indeed reflect the true biological functions in terms of pathways.
Conclusion
This paper presents a novel MGC method within the generalized POCS framework, which successfully combines two constraints from different nature for gene clustering. In addition, we also propose the GLL to measure the soft clustering performance. Experimental results of fivefold crossvalidation on different microarray datasets show that the POCSbased MGC method is competitive or superior to other stateoftheart MGC methods based on KEGG and SGD reference gene clusters. In the future, we aim to incorporate more constraints such as DNA sequence features and gene network structures to improve gene clustering performance further. For example, the structural profiles of DNA sequences play important roles in key genetic processes such as transcription [39], replication [40], proteinDNA recognition [41], and tissue specificity [42]. We may use the similarity between structural profiles of DNA sequences as a new constraint for gene clustering. On the other hand, we may also develop more efficient supervised learning strategies to automatically determine the weights of simultaneous projections in Algorithm 1. For example, we may choose decision trees [43] or ensemble learning methods [44] to learn the weights of different constraints from training data, and apply these weights to clustering unknown genes for function prediction.
Authors' contributions
JZ developed this methodology, carried out experiments and drafted the manuscript. ZSF and AWL provided useful comments on methodology and helped revise this manuscript. HY initiated the project and participated in project design and helped revise the manuscript. All authors read and approved the final manuscript.
Acknowledgements
Great thanks are due to XiaoQin Cao and XiaoYu Zhao for their assistance in code implementation. This work is supported by the Hong Kong Research Grant Council (Project CityU 122607). This work is also supported by the National Nature Science Foundation of China (No. 60903076) and the Shanghai Committee of Science and Technology, China (No. 08DZ2271800 and 09DZ2272800).
References

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.

Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation.

Dembélé D, Kastner P: Fuzzy Cmeans method for clustering microarray data.

Schliep A, Schönhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data.

Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.

Hanisch D, Zien A, Zimmer R, Lengauer T: Coclustering of biological networks and gene expression data.

Pan W: Incorporating gene functions as priors in modelbased clustering of microarray gene expression data.

Huang D, Pan W: Incorporating biological knowledge into distancebased clustering analysis of microarray gene expression data.

Aubry M, Monnier A, Chicault C, de Tayrac M, Galibert MD, Burgun A, Mosser J: Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets.

Shiga M, Takigawa I, Mamitsuka H: Annotating gene function by combining expression data with a modular gene network.

Tari L, Baral C, Kim S: Fuzzy cmeans clustering with prior biological knowledge.

Tritchler D, Parkhomenko E, Beyene J: Filtering genes for cluster and network analysis.

Zhu S, Zeng J, Mamitsuka H: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity.

Zhu S, Takigawa I, Zeng J, Mamitsuka H: Field independent probabilistic model for clustering multifield documents.

Stark H, Yang Y: Vector space projections: a numerical approach to signal and image processing, neural nets, and optics. New York: Wiley; 1998.

Gan X, Liew AWC, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge.

Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms.

Zeng J, Liu ZQ: Markov Random Fieldbased Statistical Character Structure Modeling for Handwritten Chinese Character Recognition.

Zeng J, Liu ZQ: Type2 fuzzy Markov random fields and their application to handwritten Chinese character recognition.

Feng W, Liu ZQ: RegionLevel Image Authentication Using Bayesian Structural Content Abstraction.

Zeng J, Feng W, Xie L, Liu ZQ: Cascade Markov random fields for stroke extraction of Chinese characters.

Zeng J, Liu ZQ: Type2 Fuzzy Hidden Markov Models and Their Application to Speech Recognition.

Ramoni MF, Sebastianidagger P, Kohane IS: Cluster analysis of gene expression dynamics.

Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.

Adryan B, Schuh R: GeneOntologybased clustering of gene expression data.

Bolshakova N, Azuaje F, Cunningham P: A knowledgedriven approach to cluster validity assessment.

Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways.

Wolting C, McGlade CJ, Tritchler D: Cluster analysis of protein array results via similarity of Gene Ontology annotation.

Steuer R, Humburg P, Selbig J: Validation and functional annotation of expressionbased clusters based on gene ontology.

Schlicker A, Domingues FS, Rahnenführer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology.

Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, MartinezCruz LA, Corrales FJ, Rubio A: Correlation between gene expression and GO semantic similarity.

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

Hughes TR, et al.: Functional discovery via a compendium of expression profiles.

Kanehisa M, Goto S, Hattori M, AokiKinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG.

Kreyszig E: Introductory Mathematical Statistics. New York: John Wiley & Sons; 1970.

Cao XQ, Zeng J, Yan H: Structural property of regulatory elements in human promoters.

Cao XQ, Zeng J, Yan H: Structural properties of replication origins in yeast DNA sequences.

Cao XQ, Zeng J, Yan H: Physical signals for proteinDNA recognition. Phys.

Zeng J, Cao XQ, Zhao H, Yan H: Finding human promoter groups based on DNA physical properties.

Zeng J, Zhao XY, Cao XQ, Yan H: SCS: Signal, context and structure features for genomewide human promoter recognition.

Zeng J, Zhu S, Yan H: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods.