Skip to main content
  • Methodology article
  • Open access
  • Published:

Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization

Abstract

Background

The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only [1]. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.

Results

We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.

Conclusion

We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.

Background

Gene expression matrix

Data from microarray experiments [2, 3] is frequently given as a large matrix showing expression levels of genes (rows) under different experimental conditions (columns). The so-called gene expression data can thus be written as a matrix of size m × n where m is the number of genes and n is the number of experimental conditions. Typically m is much greater than n. For example, (m, n) is (6220, 15) and (4026, 47) respectively for the time series yeast samples [4] and lymphoma specimens [5]. One of the challenges in microarray data analysis is to identify groupings of genes with similar behaviours/functions. Several clustering algorithms have been applied to DNA gene expression data to identify biologically relevant groupings based on similarity in expression profiles [6–10]. However, traditional clustering techniques are global in nature in which the expression patterns are grouped either along the entire row or along the entire column [1, 11]. This implies that one would find the grouping of genes that would express similarly for all conditions, or the groupings of conditions in which all genes exhibit similar behaviour. However, in practice only a subset of genes is highly correlated under a subset of conditions. This requires simultaneous clustering along both the row and column directions, and is often called biclustering [11–16]. A bicluster often exhibit certain kinds of homogeneity, for example constant level of expression throughout the whole bicluster (constant bicluster), constant level of expression along either rows or columns (constant rows and constant columns), and rows/columns that are related by additions or multiplications [15], as shown in Figure 1. We have recently shown that the different bicluster patterns have a simple geometric interpretation as linear objects in a high dimensional feature space [14, 15]. A comprehensive survey on different biclustering algorithms was given in references [11, 13, 16].

Figure 1
figure 1

Examples of different biclusters. (A) A constant bicluster. (B) A constant row bicluster. (C) A constant column bicluster. (D) An additive-related bicluster. (E) A multiplicative-related bicluster. Note that Ci denotes the i- th experimental condition.

Parallel coordinate plots

The parallel coordinate (PC) technique is a powerful method for visualizing and analyzing high-dimensional data under a two-dimensional setting [17, 18]. In this technique, each dimension is represented as a vertical axis, and then the N-dimensional axis is arranged in parallel to each other. By giving up the orthogonal representation, the number of dimensions that can be visualized is not restricted to only two [19–21]. Studies have found that geometric structure can still be preserved by the PC plot despite that the orthogonal property is destroyed [17–21]. In gene expression matrix, each gene is represented by a vector of conditions (i.e., row) and each condition is considered as a vector of genes (i.e., column). Since gene expression data always involves a large number of genes as well as a certain number of experimental conditions, the PC technique is well suited to their analysis. Moreover, visualization of gene expression data is an important problem for biological knowledge discovery [22]. Thus, the PC plots have been studied for gene expression data visualization [23, 24]. Further details about visualization of biclusters using PC plots are provided in Additional file 1. In section "Method", a new greedy algorithm for bicluster identification is presented. Meanwhile, an interactive approach of parameter determination for the proposed biclustering algorithm based on PC visualization is discussed.

Methods

Identification of biclusters from difference matrix

The biclusters given in Figure 1(A)–(D) can be described by an additive model in which each pair of rows has the same difference in all the related columns or each pair of columns has the same difference in all the related rows. Thus, a difference matrix, each column of which represents the column differences between a pair of columns in a data matrix, provides useful information for identification of additive-related biclusters. Consider the data in Figure 2, there are two biclusters: the first one (shown in blue color) is a constant bicluster while the second one (shown in yellow color) is an additive-related bicluster. As the rows in a bicluster is supposed to correlate in a subset of columns, the column difference between every two columns is computed so as to identify this column subset. There are altogether 6(6-1)/2 = 15 permutations as shown in the difference matrix in Figure 3. In the difference matrix, we can find special features that are related to the biclusters. For example, consider column "C5-C3". There are only three distinct difference values: 0 (5 counts), 1 (1 count), 2 (5 counts). This suggests the existence of three biclusters formed between "C5" and "C3":

Figure 2
figure 2

An example of gene expression matrix with two embedded biclusters.

Figure 3
figure 3

The difference matrix for the dataset shown in Figure 2.

  • the first bicluster is for rows R1, R3, R5, R9 and R11 in which the difference between "C5" and "C3" is zero, i.e., a constant bicluster;

  • the second bicluster is for rows R2, R4, R6, R8 and R10 in which the difference between "C5" and "C3" is two, i.e., an additive bicluster; and

  • the third bicluster involves row R7 only, thus it is not considered to be a valid bicluster.

Analyzing the distribution along the column direction in the difference matrix thus helps to identify possible biclusters. In the above example, we have two valid biclusters. Thus, C3 and C5 are merged to form two groups as shown in Figure 4. The analysis can be repeated for each of these two groups to find out whether any other columns can be merged to {C3, C5}, i.e., using either C3 or C5 as a reference, we check whether C1, C2, C4 and C6 can be merged with {C3, C5}. In particular, if C3 is used as a reference, two difference matrices as shown in Figure 5 can be obtained. Note that their difference values can be read directly from the original difference matrix of Figure 3. By examining the first difference matrix in Figure 5, we see that two paired columns, "C1-C3" and "C2-C3", show a single bicluster with a difference value equals to zero. This suggests that columns C1 and C2 can be merged to {C3, C5} for rows R1, R3, R5, R9 and R11. The second difference matrix also has a single cluster with a difference value equal to 1 at paired column "C6-C3". Therefore, C6 can be merged to {C3, C5} for rows R2, R4, R6, R8 and R10. Thus by this repeated bicluster growing process – expanding the column set and refining the row set, we can identify possible biclusters embedded in the dataset. Also, note that the difference matrix needs to be calculated only once. This greatly reduces the computational complexity of our algorithm.

Figure 4
figure 4

The two different groups formed by merging columns "C5" and "C3".

Figure 5
figure 5

The difference matrix for the two different groups formed by merging columns "C5" and "C3".

Proposed algorithm for additive models

Additive-related biclusters can be found by progressively merging columns through studying the data distribution along each column in the difference matrix. If there is just one bicluster between two columns in the gene expression matrix, the distribution will have a single peak in one of the columns of the difference matrix. Related rows for this bicluster can then be identified. If there are multiple biclusters formed between two columns in the gene expression matrix, we can separate the rows into different groups by examining the distribution in the corresponding columns of the difference matrix. Therefore, by analyzing the distributions of difference values along columns of the difference matrix, peaks that correspond to different biclusters can be identified.

An overview of the procedure of our proposed biclustering algorithm is shown in the flow chart provided in Figure 6 while the details are described in the pseudo-code in Figure 7. There are four parameters in our algorithm: noise threshold ε, minimum number of rows N r , minimum number of columns N c and maximum bicluster overlap in percentage P o . The parameter ε specifies the noise tolerance as well as the homogeneity in the identified biclusters. On the other hand, N r and N c set the lower bounds of the number of rows and columns of the identified biclusters respectively. P o determines the maximum degree of overlap between identified biclusters. More specifically, no overlap exceeding P o percentage in both the row and column dimensions simultaneously is allowed. In this paper, a bicluster with a subset of rows R and a subset of columns C is denoted by (R, C). At the beginning, the first-level difference matrix D1 is calculated for the input expression matrix E as described in line 4 in Figure 7. Supposed that E has size m rows by n columns. There is altogether n(n - 1)/2 different number of permutations so the size of D1 is m × n(n - 1)/2. In order to derive possible biclusters, a simple clustering algorithm can be applied to identify clusters for each column (lines 6–12). Let X = {x1, x2,...,x N } be a set of N expression values. By comparing x i with all values in X, a set of values S i similar to x i can be found as follows,

Figure 6
figure 6

The flow chart of the proposed biclustering algorithm.

Figure 7
figure 7

The pseudo-code of the proposed biclustering algorithm.

***S i = {a ∈ X :|x i - a| <ε}

where i = 1, 2, ..., N. Also, the set of indices Q i associated with the values in S i can be obtained. Q i can be expressed by

Q i = {p ∈ {1, 2, ..., N}: x p ∈ S i } (3)

As an example, given that X = {1, 2, 9, 3} and ε = 2. S2 = {1, 2, 3} and Q2 = {1, 2, 4}. A clustering algorithm based on equation (1) would generate N clusters but these clusters may be very close to each other and have large overlap. In order to reduce unnecessary clusters, we adopt a two-step clustering approach presented in lines 56–84. In addition to the definitions in (1) and (2), let us denote the current collections of clusters and corresponding sets of indices by S and Q, which are both set to be empty initially. In the first step (lines 58–75), for i = 1, 2, ..., N, x i and its associated cluster S i are tested for the following three conditions with each S j ∈ S:

  1. (1)

    |x i - S ¯ j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4uamLbaebadaWgaaWcbaGaemOAaOgabeaaaaa@2EA5@ | ≥ ε where • ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafyOiGCRbaebaaaa@2D72@ denotes the average operation of a set.

  2. (2)

    |S i | ≥ N r , where |•| denotes the cardinality of a set.

  3. (3)

    | S ¯ i − S ¯ j | ≥ ε MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiiFaWNafm4uamLbaebadaWgaaWcbaGaemyAaKgabeaakiabgkHiTiqbdofatzaaraWaaSbaaSqaaiabdQgaQbqabaGccqGG8baFcqGHLjYScqaH1oqzaaa@38E1@ .

If the above three conditions are satisfied for all S j ∈ S, S i and Q i are added to S and Q respectively. In the second step (lines 76–82), the clusters are refined. Denote the sets of output clusters and the corresponding indices by S' and Q' respectively. First, S' and Q' are set to be empty. For each S j ∈ S, a new cluster S' j is derived as

S ' j = { a ∈ X : | a − S ¯ j | < ε } MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaei4jaCYaaSbaaSqaaiabdQgaQbqabaGccqGH9aqpcqGG7bWEcqWGHbqycqGHiiIZcqWGybawcqGG6aGodaabdaqaaiabdggaHjabgkHiTiqbdofatzaaraWaaSbaaSqaaiabdQgaQbqabaaakiaawEa7caGLiWoacqGH8aapcqaH1oqzcqGG9bqFaaa@43A4@
(3)

The corresponding set of indices Q' j is given by

Q' j = {p ∈ {1, 2, ..., N}:x p ∈ S' j } (4)

S' j and Q' j are added to S' and Q' respectively if |S j '| ≥ N r . For the first-level difference matrix D1, each Q' j contains the row indices of the cluster S' j . Each column of D1 consists of difference values between column i and j of the original expression matrix. Define the collection of row indices sets of the clusters to be U ij . After finding all U ij for all distinct column pairs (i, j), the row indices set of the clusters and their associated column pairs are collected to form a list of possible biclusters L1 which can be expressed by

L1 = {(R ij , (i, j)): U ij ≠ φ, R ij ∈ U ij , i = 1, 2, ..., n - 1 and j = i + 1, i + 2,...,n} (5)

As one always tries to find the biggest bicluster, a sorting is performed for the possible biclusters in L1 based on the number of rows in line 13 so that a bicluster with the largest number of rows can be processed first.

Starting from the biggest bicluster l1 in the sorted list of possible biclusters L2, the second-level difference matrix D2 is formed as in line 24 in which one of the bicluster columns (column ck 1or ck 2) is compared with all the remaining columns on those chosen rows (e.g. difference matrices illustrated in Figure 3). Note that the second-level difference matrix D2 can be obtained directly from the first-level difference matrix D1. Before D2 calculation, early termination can be introduced as presented in lines 17–23 as an optional step. In the early termination, the biclusters in L2 which significantly overlap with the identified biclusters are skipped as they are unlikely to derive a well-distinguishable bicluster according to the given parameter P o . Similar to the clustering done for D1, clustering and sorting are performed for D2 as described in lines 26–33. As a result, a list of possible column segments H2 for growing the current bicluster is obtained. In lines 34–41, a possible bicluster (R, C) is constructed based on the row intersection with each column segment in H2. Initially, (R, C) is set to be the current bicluster l k in L2. If the size of the row set R does not fall below the user-defined threshold N r after the row intersects with a column segment, the column is included in C and R is updated. Otherwise, the process is moved to the next column segments until the last one is examined. Finally, the bicluster is validated with respect to the given requirements in bicluster size and degree of overlap as depicted in lines 42–50. Only a valid bicluster is output (lines 51–53).

Relation to existing δ-pCluster approaches

The proposed algorithm identifies biclusters which are homogeneous in each column pair. In this section, we show that the biclusters can be expressed as δ-pClusters [25]. Hence, any sub-matrix in an identified bicluster has similar homogeneity to that bicluster and the problem of outliers as in Cheng and Church algorithm [12] can be avoided. Denote a bicluster with a subset of rows U and a subset of columns V by B = (U, V). The bicluster B is a δ-pCluster if for each 2 × 2 sub-matrix M, the following condition holds

|a ij - a in - (a mj - a mn )| ≤ δ (6)

where M = [ a i j a i n a m j a m n ] MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyta0Kaeyypa0ZaamWaaeaafaqabeGacaaabaGaemyyae2aaSbaaSqaaiabdMgaPjabdQgaQbqabaaakeaacqWGHbqydaWgaaWcbaGaemyAaKMaemOBa4gabeaaaOqaaiabdggaHnaaBaaaleaacqWGTbqBcqWGQbGAaeqaaaGcbaGaemyyae2aaSbaaSqaaiabd2gaTjabd6gaUbqabaaaaaGccaGLBbGaayzxaaaaaa@4104@ , a ij denotes a value of the expression matrix at position (i, j), i, m ∈ U and j, n ∈ V. In our algorithm, the clustering (the second step) performed in the second-level difference matrix ensures that there exists a column k ∈ V such that

|a ij - a ik -l jk | <ε for ∀j ∈ V and some constant I jk (7)

where ε is the noise threshold parameter of the proposed algorithm. Hence, for any i, m ∈ U, we have

|a ij - a ik - (a mj - a mk )| = |a ij - a ik - l jk - (a mj - a mk - l jk )| ≤ |a ij - a ik - l jk | + |a mj - a mk - l jk | < 2ε (8)

where the last inequality follows from inequality (7). For a column n ∈ V with n ≠ k, using inequality (8), it is shown that

|a ij - a in - (a mj - a mn )| = |a ij - a ik - (a mj - a mk ) + a ik - a in - (a mk - a mn )| ≤ |a ij - a ik - (a mj - a mk )| + |(a ik - a in ) - (a mk - a mn )| < 4ε (9)

This means that the bicluster B is a δ-pCluster with δ = 4ε. Although the biclusters identified by our algorithm are δ-pClusters, it should be emphasized that our algorithm is not designed specially for detecting δ-pClusters but rather is based on the clustering results in the difference matrix. Hence, there are some differences between our biclustering strategy and the other δ-pClusters algorithms like pCluster algorithm [25] and S. Yoon et al. approach [13]. Specifically, our algorithm takes into account the cluster density in which cluster centroids are considered. In contrast, the other two δ-pCluster based algorithms rely only on the inter-distances between elements in the difference matrix as defined by the inequality (6). This results in an exponential-time complexity in the worst case. Our proposed algorithm can be regarded as a greedy version of the other two algorithms. In particular, for each column-pair bicluster, our proposed algorithm derives a possible bicluster by greedily finding a larger column set through sequential intersection with other column-pair biclusters. The large column-pair biclusters usually contain the whole or a large part of the true gene set. On the other hand, these simplifications significantly reduce the complexity from exponential-time to polynomial-time.

Complexity estimation

In general, a biclustering problem is NP-complete [11]. However, we have adopted a simple clustering algorithm and bicluster growing strategy to reduce the complexity. Given a matrix of size m × n, the complexity of obtaining the difference matrix is O(mn2). The simple clustering algorithm applied on each column requires operations on the order of O(m2) because it involves comparing the value of each element with the others and the centroids of the found clusters. In addition, the total number of clusters found would not exceed m. Therefore, the complexity in obtaining clusters in the difference matrix is O(m2n2) and the number of clusters is at most mn(n-1)/2. The sorting of the clusters requires a complexity of O(mn2 log mn2). After that, each identified cluster is used as a seed to construct a bicluster. In the biclusters growing process, a seed is first checked if it has significant overlap with other identified biclusters for early termination. The overlapping in rows can be checked by sorting followed by element-wise comparison. The complexity is thus O(m logm). For columns, as a seed has only two columns, the complexity is O(n). Note that the number of identified biclusters is bounded by the number of seeds. Thus, the complexity for checking overlaps in all identified biclusters is O(mn2 (n + m log m)). If the seed is valid, a sub-matrix of the difference matrix is extracted as the second-level difference matrix. This step requires no arithmetic operations due to data reuse. Clustering and sorting procedure are then performed on this second-level difference matrix. As the matrix has n-1 columns only, the clustering and the sorting processes need operations on the order of O(m2n) and O(mn log(mn)), respectively. Note that there are at most (n-1)m clusters detected in the second-level difference matrix. In the bicluster construction, row intersection is performed. In total, the complexity is O(m2n log m). Finally, the new identified bicluster is validated (i.e. filtered) with respect to the number of columns and degree of overlap with other biclusters. The validation requires an additional complexity of O(mn2 (m log m + n log n)). Among the operations for obtaining each biclusters from the first-level difference matrix, the validation step dominates. So the entire processing for bicluster formation from seeds is O(m2n4 (m log m + n log n)). Since this cost dominates all other costs in previous steps, our algorithm has a polynomial-time complexity of O(m2n4 (m log m + n log n)). The above estimation shows the worst case complexity, in which the validation process dominates. In practice, the number of biclusters is far less than mn(n - 1)/2. Moreover, some of the validation steps can be avoided through early termination of invalid biclusters. Elimination of invalid biclusters reduces the number of potential biclusters and this in turn reduces the complexity inside the validation step.

Modification for multiplicative models

As seen in Figure 1(E), a multiplicative-related bicluster is a bicluster in which any two rows are related by the same ratio in all the related columns or any two columns are related by the same ratio in all the related rows. In order to modify the proposed framework for multiplicative models, the difference matrix is replaced by a ratio matrix which is in the form of c i /c j or c j /c i for all the n(n - 1)/2 distinct combinations between columns i and j where c k represents the values in the k-th column. In practice, we select the column which has the largest average magnitude as the denominator because quotient is sensitive to noise when the divisor is small. Thus, the major change for detecting multiplicative-related biclusters is to replace the difference matrix by a ratio matrix. Note that the complexity for multiplicative models is essentially the same as that for additive models.

Interactive adjustment of noise threshold using PC plots

The setting of the noise threshold ε is important for the proposed algorithm as it balances the homogeneity requirement and the noise tolerance in the identified biclusters. The noise threshold is determined through visual inspection of the homogeneity of the detected biclusters in the PC plots [26]. The PC visualization for a data matrix embedded with biclusters can be found in Additional file 1. Consider a noisy 100 × 10 dataset which contains uniformly distributed values between -5 and 5 embedded with a 30 × 4 additive-related bicluster shown in Figure 8. Furthermore, an additive Gaussian noise with variance of 0.2, which was chosen empirically for clear demonstration, was introduced. In this example, we varied the values of the noise threshold ε while fixing the values of the minimum number of rows N r , the minimum number of columns N c and the maximum overlap with other biclusters P o to be 20, 4 and 20%, respectively. Figure 9 shows a bicluster found by our algorithm when ε is set to 1.2. The four columns are found correctly, however, three rows are missed. Figure 9(A) shows the four columns but with all the rows in the original data while Figure 9(B) shows the difference between the last three columns with respect to the first column. Figures 9(C) and 9(D) illustrate the inconsistency between the identified bicluster and the true bicluster using PC plots of expression values and difference values respectively in which the three missed rows are displayed in blue. We can see that these three rows are missed because the noise threshold is not large enough. In practice, since we do not know the bicluster in advance, we should adopt an exploratory approach for setting the parameter ε. Start with the current value of ε, we gradually increase ε while visualizing the bicluster using the PC plot. Initially, we would see more and more related rows being included into the bicluster. Then, at some point, unrelated rows start to creep into the bicluster. When this is observed in the PC plot, we stop increasing the noise threshold. Using this procedure, we found that when ε is set to 1.5, all the rows are correctly detected. This example shows that the PC plot can be a powerful visualization and interactive tool that allows us to examine the biclusters found.

Figure 8
figure 8

The PC plot of the true additive-related bicluster.

Figure 9
figure 9

The PC plots of the bicluster identified using noise threshold of 1.2. The expression values of all rows in the four related columns and their difference between the last three columns and the first column are drawn using PC plots in Figures (A) and (B) respectively. In Figures (A) and (B), red color shows rows from the identified bicluster while blue color shows rows from the original dataset. Figures (C) and (D) illustrate the inconsistencies between the identified bicluster and the true bicluster in the four related columns and the column differences respectively. In Figures (C) and (D), the red color indicates rows of the true bicluster that are found by our algorithm while the blue color represents the three missed rows of the true bicluster.

Results and Discussion

Evaluation methods

We analyze the performance of our algorithm on both artificial datasets and a real dataset. For artificial datasets, biclusters information is known in advance. So accuracy in bicluster discovery can be measured using the overall match score [16]. The overall match score of a set of biclusters M1 with respect to another set of biclusters M2 is defined as,

S ∗ ( M 1 , M 2 ) = S U ∗ ( M 1 , M 2 ) × S V ∗ ( M 1 , M 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaWbaaSqabeaacqGHxiIkaaGccqGGOaakcqWGnbqtdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabd2eannaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0ZaaOaaaeaacqWGtbWudaqhaaWcbaGaemyvaufabaGaey4fIOcaaOGaeiikaGIaemyta00aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqWGnbqtdaWgaaWcbaGaeGOmaidabeaakiabcMcaPiabgEna0kabdofatnaaDaaaleaacqWGwbGvaeaacqGHxiIkaaGccqGGOaakcqWGnbqtdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabd2eannaaBaaaleaacqaIYaGmaeqaaOGaeiykaKcaleqaaaaa@4E34@
(10)

where S U ∗ ( M 1 , M 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdwfavbqaaiabgEHiQaaakiabcIcaOiabd2eannaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyta00aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaaa@3683@ and S V ∗ ( M 1 , M 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdAfawbqaaiabgEHiQaaakiabcIcaOiabd2eannaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyta00aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaaa@3685@ are gene and condition match scores respectively. S U ∗ ( M 1 , M 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdwfavbqaaiabgEHiQaaakiabcIcaOiabd2eannaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyta00aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaaa@3683@ is calculated as,

S U * ( M 1 , M 2 ) = 1 | M 1 | ∑ ( U 1 , V 1 ) ∈ M 1 max ( U 2 , V 2 ) ∈ M 2 | U 1 ∩ U 2 | | U 1 ∪ U 2 | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdwfavbqaaiabcQcaQaaakiabcIcaOiabd2eannaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyta00aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaamaaemaabaGaemyta00aaSbaaeaacqaIXaqmaeqaaaGaay5bSlaawIa7aaaakmaaqafabaWaaCbeaeaacyGGTbqBcqGGHbqycqGG4baEaSqaaiabcIcaOiabdwfavnaaBaaameaacqaIYaGmaeqaaSGaeiilaWIaemOvay1aaSbaaWqaaiabikdaYaqabaWccqGGPaqkcqGHiiIZcqWGnbqtdaWgaaadbaGaeGOmaidabeaaaSqabaqcfa4aaSaaaeaadaabdaqaaiabdwfavnaaBaaabaGaeGymaedabeaacqGHPiYXcqWGvbqvdaWgaaqaaiabikdaYaqabaaacaGLhWUaayjcSdaabaWaaqWaaeaacqWGvbqvdaWgaaqaaiabigdaXaqabaGaeyOkIGSaemyvau1aaSbaaeaacqaIYaGmaeqaaaGaay5bSlaawIa7aaaaaSqaaiabcIcaOiabdwfavnaaBaaameaacqaIXaqmaeqaaSGaeiilaWIaemOvay1aaSbaaWqaaiabigdaXaqabaWccqGGPaqkcqGHiiIZcqWGnbqtdaWgaaadbaGaeGymaedabeaaaSqab0GaeyyeIuoaaaa@6EA0@
(11)

where a bicluster with a subset of genes U i and a subset of conditions V i is denoted by (U i , V i ). S V ∗ ( M 1 , M 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdAfawbqaaiabgEHiQaaakiabcIcaOiabd2eannaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyta00aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaaa@3685@ is defined similarly with U replaced by V. Let M be the set of detected biclusters and M t be the set of true biclusters embedded in the artificial expression dataset. The overall match score S*(M, M t ) quantifies the average relevance of the detected biclusters to the true biclusters. Conversely, S*(M t , M) measures the average recovery of the true biclusters in the detected biclusters. To unify the two measures into a single quantity for evaluation, their average is computed as the biclustering accuracy.

The performance of the proposed algorithm for artificial datasets has been compared with two existing algorithms with the additive model assumption, namely the Cheng and Church (C&C) algorithm [12] and the pCluster algorithm [25]. We considered the biclustering accuracy together with other measures such as number of biclusters, bicluster size and processing time. The programs for both algorithms are publicly available [27, 28]. The proposed algorithm was implemented in a C MEX-file and ran in Matlab 6.5. All the experiments were conducted on the Window XP platform in a computer with 2.4 GHz Intel Pentium 4 CPU and 512 MB RAM. In identification of multiplicative-related biclusters, since C&C algorithm and the pCluster algorithm are designed for additive models, logarithm operation was applied to the expression data so that the multiplicative models become additive models. For comparison, we also applied the proposed algorithm for additive models to the logarithm values. Henceforth, the proposed algorithm for additive models and multiplicative models will be referred to as PA and PM respectively while the proposed algorithm for additive models with the logarithm operation as pre-processing will be referred to as PAL.

The evaluation on real datasets was performed on three aspects: biological, homogeneity and statistical assessment. In the biological assessment, we used the Gene Ontology (GO) annotations [29] to determine the functional enrichment of biclusters. The measure was the percentage of overrepresented biclusters in one or more GO annotation. A bicluster is said to be overrepresented in a functional category if it gives a small p-value. Given that a bicluster B with k genes is identified in a gene expression matrix with a gene set S of size N. For a functional category with C genes in S, the bicluster B possesses r genes. The p-value is defined as the probability of choosing k genes from S with r genes in that category [30], i.e.,

p − v a l u e ( B ) = ( C r ) ( N − C k − r ) / ( N k ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeyOeI0IaemODayNaemyyaeMaemiBaWMaemyDauNaemyzauMaeiikaGIaemOqaiKaeiykaKIaeyypa0tcfa4aaSGbaeaadaqadaqaauaabeqaceaaaeaacqWGdbWqaeaacqWGYbGCaaaacaGLOaGaayzkaaWaaeWaaeaafaqabeGabaaabaGaemOta4KaeyOeI0Iaem4qameabaGaem4AaSMaeyOeI0IaemOCaihaaaGaayjkaiaawMcaaaqaamaabmaabaqbaeqabiqaaaqaaiabd6eaobqaaiabdUgaRbaaaiaawIcacaGLPaaaaaaaaa@4A65@
(12)

In other words, the p-value is the probability of including genes of a given category in a cluster by chance. Thus, the overrepresented bicluster is a cluster of genes which is very unlikely to be obtained randomly. The annotations consist of three ontologies, namely biological process, cellular component and molecular function.

For the homogeneity aspect, mean squared residue score (MSRS) [12] and average correlation value (ACV) [31] were computed. For an m × n bicluster, the MSRS is defined as

MSRS = 1 m n ∑ i = 1 m ∑ j = 1 n ( a i j − a ˜ i • − a ˜ • j + a ˜ ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeyta0Kaee4uamLaeeOuaiLaee4uamLaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGTbqBcqWGUbGBaaGcdaaeWbqaamaaqahabaWaaeWaaeaacqWGHbqydaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabgkHiTiqbdggaHzaaiaWaaSbaaSqaaiabdMgaPjabgkci3cqabaGccqGHsislcuWGHbqygaacamaaBaaaleaacqGHIaYTcqWGQbGAaeqaaOGaey4kaSIafmyyaeMbaGaaaiaawIcacaGLPaaadaahaaWcbeqaaiabikdaYaaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd2gaTbqdcqGHris5aaaa@57ED@
(13)

where a ij is the value of the bicluster at position (i, j), a ˜ i • MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaGaadaWgaaWcbaGaemyAaKMaeyOiGClabeaaaaa@303B@ is the average of the i-th row, a ˜ • j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaGaadaWgaaWcbaGaeyOiGCRaemOAaOgabeaaaaa@303D@ is the average of the j-th column and a ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaGaaaaa@2D2F@ is the overall average. ACV is defined by

ACV = max { ∑ i = 1 m ∑ j = 1 m | c _ r o w i j | − m m 2 − m , ∑ i = 1 n ∑ j = 1 n | c _ c o l i j | − n n 2 − n } MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeyqaeKaee4qamKaeeOvayLaeyypa0JagiyBa0MaeiyyaeMaeiiEaG3aaiqaaeaajuaGdaWcaaqaamaaqadabaWaaabmaeaadaabdaqaaiabdogaJjabc+faFjabdkhaYjabd+gaVjabdEha3naaBaaabaGaemyAaKMaemOAaOgabeaaaiaawEa7caGLiWoaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGTbqBaiabggHiLdaabaGaemyAaKMaeyypa0JaeGymaedabaGaemyBa0gacqGHris5aiabgkHiTiabd2gaTbqaaiabd2gaTnaaCaaabeqaaiabikdaYaaacqGHsislcqWGTbqBaaGccqGGSaalaiaawUhaamaaciaajuaGbaWaaSaaaeaadaaeWaqaamaaqadabaWaaqWaaeaacqWGJbWycqGGFbWxcqWGJbWycqWGVbWBcqWGSbaBdaWgaaqaaiabdMgaPjabdQgaQbqabaaacaGLhWUaayjcSdaabaGaemOAaOMaeyypa0JaeGymaedabaGaemOBa4gacqGHris5aaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbGaeyyeIuoacqGHsislcqWGUbGBaeaacqWGUbGBdaahaaqabeaacqaIYaGmaaGaeyOeI0IaemOBa4gaaaGccaGL9baaaaa@7B35@
(14)

where c _row ij is the correlation coefficient between rows i and j and c _col pq is the correlation coefficient between columns p and q. ACV is applicable to additive models as well as multiplicative models but the MSRS is valid only for additive models. In order to measure homogeneity of multiplicative-related biclusters, logarithm was applied onto the expression values before calculating MSRS values so that a multiplicative-related bicluster can be formulated using an additive model. In order to avoid confusion, the MSRS for the logarithm of expression values is denoted by MSRSl. A bicluster with high homogeneity in expression levels should have a low MSRS/MSRSl value but a high ACV value. The minimum value of MSRS/MSRSl is zero while ACV has a maximum value of one.

The statistical properties of the biclustering results refer to quantities including the number of discovered biclusters and the bicluster size. Comparative studies were performed in the three aspects with several existing biclustering algorithms such as C&C, iterative signature algorithm (ISA) [32, 33], order-preserving submatrix (OPSM) approach [1] and xMotifs [34], which are available in [27]. In addition, the computational complexity of the proposed algorithm and other approaches is estimated using processing time as done for the artificial datasets. Despite the dependence of factors such as programming language and parameter settings, a rough comparison in complexity can still be achieved.

Datasets

Two types of artificial datasets were considered, one for the additive models and the other for the multiplicative models. The first type of dataset TD1 had a size of 200 rows by 40 columns. Uniformly distributed random values were first generated. Then four biclusters were embedded. Their details are as follows:

  • bicluster A is a constant row bicluster of size 40 × 7;

  • bicluster B is a constant row bicluster of size 25 × 10;

  • bicluster C is a constant column bicluster of size 35 × 8; and

  • bicluster D has coherent values related by additions of size 40 × 8.

Biclusters A and B have two columns in common but in different rows; bicluster B overlaps with bicluster C in five rows and three columns; biclusters C and D have one column in common but in different rows. Finally, Gaussian noise with different standard deviation (s.d.) was added to the dataset. At each non-zero noise level, five expression matrices were generated. Figure 10 shows the dataset TD1 with 4 embedded biclusters before noise was added.

Figure 10
figure 10

The first type of dataset with four additive-related biclusters before noise is added.

The second type of dataset TD2 consists of 60 × 15 positive values embedded with two 25 × 7 multiplicative-related biclusters. The two biclusters overlap in two columns. A positive-biased Gaussian noise was added to the dataset so that all the values in the resultant datasets remained positive. The positive-valued dataset was essential for Cheng and Church algorithm, the pCluster algorithm and our proposed algorithm for the additive models PAL due to the use of the logarithm operation. It should be noted that the proposed algorithm for multiplicative-related biclusters PM can be applied on datasets with negative values because no logarithm operation is needed. Figure 11 shows the dataset TD2 with two embedded multiplicative biclusters before noise was added. These two artificial datasets allowed us to test the performance of our algorithm in realistic situations as real expression data often involves various types of biclusters with overlaps (i.e. regulatory complexity) and noise.

Figure 11
figure 11

The second type of dataset with two multiplicative-related biclusters before noise is added.

The real dataset used was the yeast Saccharomyces cerevisiae cell cycle dataset as used in [12], which contains 2884 genes and 17 conditions. The non-missing values were all non-negative. As multiplicative models were also investigated, those zero non-missing values were set to some small positive values. The missing values were filled with positive uniformly distributed random values to minimize the influence to our analysis.

Performance on artificial datasets

For the artificial datasets with additive-related biclusters, biclusters with rows and columns more than or equal to 21 and 5 respectively were identified. It was further required that any detected bicluster cannot have more than 50% overlap with another bicluster simultaneously in the row and column dimensions. Since the Cheng and Church (C&C) algorithm and the pCluster algorithm cannot be directly configured to discover biclusters with all the given requirements, a post-filtering procedure was adopted to eliminate those invalid biclusters. The post-filtering parameters are provided in Table 1 together with the parameters of the biclustering algorithms. Note that parameters for noise tolerance (ε/δ) were determined for optimal performance under different noise levels. The biclustering accuracies are plotted against various noise levels in Figure 12. As can be seen, the proposed algorithm always has higher biclustering accuracy than C&C and the pCluster algorithm. For the expression dataset with noise of standard deviation at or below 0.1, we detect the four embedded biclusters perfectly. The pCluster algorithm did not attain perfect discovery even in the noise-free case because more than one maximal δ-pCluster (defined by equation (6)) exists for one or more column pair due to column overlap between some biclusters in the datasets [35]. In more noisy case such as when the noise s.d. is 0.5, the biclustering accuracy of our algorithm still has a high value of 0.89. In contrast, the accuracies of C&C and the pCluster algorithm are 0.70 and 0.26 respectively.

Table 1 Parameter settings for biclustering algorithms and post-filtering in the experiments on artificial datasets
Figure 12
figure 12

Biclustering accuracy against noise level for additive models. The biclustering accuracies of the proposed algorithm, Cheng and Church algorithm and pCluster algorithm are represented by the curves 'Proposed', 'C&C' and 'pCluster' respectively.

Statistical properties of the biclustering results before filtering are given in Table 2. Unlike the pCluster algorithm, the number of biclusters identified by the proposed algorithm is insensitive to noise level. On average, there were 6.6 biclusters identified at the highest noise level which was close to the true number 4. For the pCluster algorithm, a large number of biclusters with high overlap were detected under noisy situation. The post-filtering procedure was therefore necessary for the pCluster algorithm to extract the significant biclusters. The number of biclusters identified by C&C was 40 which is the same as that specified in its parameter setting. In fact, this parameter setting was necessary to acquire high biclustering accuracy. With respect to the biclusters size, the proposed algorithm shows the closest agreement to those embedded in the datasets. The average numbers of rows and columns in the biclustering results are always around 34 and 7.8 respectively while the actual average numbers of rows and columns are 35 and 8.25 respectively. The pCluster algorithm also produced good results. C&C gave the worst performance as it does not allow any constraints to be imposed on the biclusters dimensions. Therefore, the post-filtering procedure is essential for C&C to find the embedded biclusters.

Table 2 Statistical properties of biclustering results for the artificial datasets embedded with additive-related biclusters before post-filtering

For the datasets with two multiplicative-related biclusters, a bicluster was considered to be valid if its size is no smaller than 18 and 4 in row and column dimensions respectively and the overlap with other valid biclusters is less than or equal to 25%. The settings for the biclustering algorithms and the post-filtering procedure are also included in Table 1. The biclustering accuracies of the proposed algorithms PAL and PM, together with C&C and the pCluster algorithm (applied on log values) at various noise levels is shown in Figure 13. At all the noise levels, our two proposed algorithms outperform C&C and the pCluster algorithm. Both PAL and PM can exactly detect the true biclusters in the noise-free case while the other two algorithms fail to do so. In particular, the failure of perfect discovery in the pCluster algorithm can be attributed to the column overlap in the datasets. The performance of PM is slightly better than that of PAL in general. The biclustering accuracy decreases when the noise level increases except in the case of C&C when noise level changes from 0.4 to 0.5. It was probably because outlier is less likely to be included in biclusters at high noise levels. In terms of the statistical properties given in Table 3, the two proposed algorithms exhibit closest match to the true embedded biclusters, with PM performs slightly better than PAL. Similar to the case of the additive models, the proposed algorithms can return more reasonable number of biclusters with similar dimensions to those embedded than the other two algorithms without any post-filtering procedure.

Figure 13
figure 13

Biclustering accuracy against noise level for multiplicative models. The biclustering accuracy of the proposed algorithms for multiplicative models is denoted by 'PM' while the proposed algorithms for additive models, Cheng and Church algorithm and pCluster algorithm on logarithm of expression data are labelled by 'PAL', 'C&C (log)' and 'pCluster (log)' respectively.

Table 3 Statistical properties of biclustering results for the artificial datasets embedded with multiplicative-related biclusters before post-filtering

In order to justify the efficiency of our proposed algorithms, processing time for the artificial datasets with noise s.d. of 0.3 was measured and provided in Table 4. The proposed algorithm PA required an average of 3.8 sec for the artificial datasets with additive-related biclusters. This showed substantial improvement over the pCluster algorithm which needed 2716 sec to finish. The reduction in computational complexity is achieved by the bicluster growing strategy in which similar patterns in column-pair are combined to form biclusters through row intersection. The proposed algorithm is also more efficient than C&C by a factor of 4.5. For datasets embedded with multiplicative-related biclusters, the matrices sizes are smaller than those used for additive-model experiments so less processing time was obtained in all the algorithms. However, it can be seen that the proposed approach PM has the lowest computational complexity. The average processing time was 0.0232, 1 and 105.2 sec for PM, C&C and the pCluster algorithm respectively. In conclusion, the results on artificial datasets demonstrate that our proposed algorithms have high accuracy in detecting additive-related and multiplicative-related biclusters, even in the presence of overlap and noise contamination. The computational complexity of the proposed algorithms is lower than several biclustering algorithms with similar model assumption.

Table 4 Average processing time for the artificial datasets

Performance on a real dataset

Experiments have been conducted on the yeast cell cycle dataset using the proposed algorithms and Cheng and Church (C&C) algorithm [12], iterative signature algorithm (ISA) [32, 33], order-preserving submatrix (OPSM) approach [1] and xMotifs [34]. Post-filtering was applied to the biclustering results in order to eliminate insignificant biclusters as well as impose common constraints for comparison. The parameter settings of various algorithms and post-filtering are provided in Table 5. These values were selected based on the guideline in [16] and our experimental work. The functional enrichment was studied over a number of upper bounds on p-value, p0 and illustrated in Figure 14. Compared with C&C which possesses the same model assumption as the proposed algorithm for additive model (PA), higher percentage of functionally-enriched biclusters were identified by the proposed algorithm at p0 ≥ 5 × 10-4, 1 × 10-2 and 5 × 10-3 in the biological process, cellular component and molecular function ontologies respectively. In particular, at p0 = 1 × 10-2, the percentage of functionally-enriched biclusters found by PA is 96.0%, 88.0% and 80.0% which correspond to an improvement of 18.6%, 13.8% and 9.0% to C&C in the biological process, cellular component and molecular function ontologies, respectively. At the lowest value of p0 = 1 × 10-5, our proposed algorithm PA outperforms C&C in the cellular component ontology but not in the other two ontologies. However, the reduction in the percentage of functionally-enriched biclusters is less than 7.5% in the biological process ontology and 2.5% in the molecular function ontology, which is relatively small compared with the improvement at the large values of p0. The homogeneous analysis provided in Table 6 shows that the biclusters identified by PA are more homogeneous than C&C with the average MSRS lower by 142.4 and the average ACV higher by 0.0199. From the statistical results in Table 7, it can be found that PA can also avoid identification of very large bicluster as is in the case of C&C. The largest bicluster size found using PA is 597 × 17 while that found using C&C is 1391 × 17.

Table 5 Parameter settings of algorithms and post-processing investigated in experiments based on the yeast dataset
Table 6 Homogeneity comparison of biclusters identified in the yeast cell-cycle dataset using various algorithms
Table 7 Statistical comparison of biclusters identified in the yeast cell-cycle dataset using various algorithms
Figure 14
figure 14

Percentage of additive-related biclusters enriched with GO annotations of different ontologies. (A) Biological process ontology. (B) Cellular component ontology. (C) Molecular function ontology.

When multiplicative model is concerned, i.e. the proposed algorithm for multiplicative model (PM) and C&C applied on log value (C&C (log)), the functional enrichment drops in general. At first glance, PM gives poorer performance in term of functional enrichment. Nonetheless, if the number of identified biclusters is also considered, PM actually outperforms C&C (log) by identifying more significant biclusters. The total number of biclusters identified by PM was 59 but CC only found 5 biclusters. In addition, the biclusters identified by PM exhibit higher homogeneity. The average values of MSRSl and ACV are 9.573 × 10-3 and 0.9219 for PM respectively. In comparison, the average values of MSRSl and ACV are 6.262 × 10-2 and 0.5740 for the C&C (log) respectively.

In addition to C&C based algorithms, Figure 14 shows the comparative results of ISA, OPSM and xMotifs for different values of p0. Although OPSM shows high percentage of functionally-enriched biclusters at large values of p0, there are only two biclusters found which are far from expectation. Thus, the proposed algorithms actually identify more functionally-enriched biclusters. Also, the percentage of functionally-enriched biclusters of OPSM drops to zero at low values of p0. At low values of p0, the results of ISA are the best in most cases. For p0 ≥ 5 × 10-4, the performance of the proposed algorithm PA, however, is close to or even better than that of ISA. For both OPSM and ISA, the identified biclusters are less homogeneous in terms of average MSRS and ACV because their bicluster models are different from those studied in this paper. PA and PM show better performance than xMotifs in the percentage of functionally-enriched biclusters despite that our algorithms have lower average value of ACV. The reason is that xMotifs is designed to find biclusters with coherent state in each gene, which is only a subclass of additive models. The homogeneity analysis suggests that the difference in biological relevance of identified biclusters between various algorithms such as the proposed algorithm PA and ISA is not merely due to implementation architecture but also due to the model assumption.

In addition to the identification of biologically-significant biclusters, the efficiency of the proposed algorithm is justified by the processing time provided in Table 8. PA and PM require 0.72 and 1.35 sec respectively to finish. The results are the best and show improvement by a factor of at least 23.7 compared with the others. This implies that our algorithms have low computational complexity.

Table 8 Processing time for the yeast cell-cycle expression dataset using various biclustering algorithms

Details of annotation results of the proposed algorithms PA and PM are shown in Tables 9, 10, 11 and Tables 12, 13, 14 at p-value < 0.001 respectively. In these tables, Bonferroni correction of p-value which adjusts the probability of random annotation for multiple tests [30] is provided. Consideration of the corrected p-value is important when multiple terms are tested for annotation in a single bicluster. The 4-th additive-related bicluster identified by PA has the lowest p-value in all the three ontologies. For the biological process ontology, 69 out of 201 genes are assigned to category "translation" at p-value of 6.92 × 10-44. The annotation is also significant after multiple test correction as it has a low corrected p-value of 8.51 × 10-42. For the cellular component ontology, 33 out of 201 genes are annotated with category "cytosolic small ribosomal subunit (sensu Eukaryota)" at p-value of 7.45 × 10-29 (corrected p-value of 5.29 × 10-27). For the molecular function ontology, 66 out 201 genes are associated with category "structural constituent of ribosome" at p-value of 7.61 × 10-47 (corrected p-value of 3.96 × 10-45). For the multiplicative model, the 24-th bicluster found by PM exhibits the lowest p-value in all the three ontologies. In fact, there are 141 genes shared between the biclusters with the lowest p-value identified by PA and PM, which correspond to 70.15% of genes in the bicluster identified by PA. As a result, the 24-th bicluster identified by PM are annotated with the similar categories as the 4-th bicluster found by PA in all the three ontologies. The annotations are also overrepresented in the bicluster as found in the experiments using PA except that the cellular component category with the lowest p-value is "cytosolic large ribosomal subunit (sensu Eukaryota)". The p-values are 6.16 × 10-69, 1.11 × 10-45 and 3.69 × 10-75 (corrected p-values of 7.51 × 10-67, 7.07 × 10-44 and 1.88 × 10-73) while out of 225 genes there are 91, 46 and 88 genes annotated in the biological process, cellular component and molecular function categories respectively.

Table 9 Annotations of biological process ontology for biclusters identified by the proposed algorithm for additive models at p-value < 0.001.
Table 10 Annotations of cellular component ontology for biclusters identified by the proposed algorithm for additive models at p-value < 0.001.
Table 11 Annotations of molecular function ontology for biclusters identified by the proposed algorithm for additive models at p-value < 0.001.
Table 12 Annotations of biological process ontology for biclusters identified by the proposed algorithm for multiplicative models at p-value < 0.001.
Table 13 Annotations of cellular component ontology for biclusters identified by the proposed algorithm for multiplicative models at p-value < 0.001.
Table 14 Annotations of molecular function ontology for biclusters identified by the proposed algorithm for multiplicative models at p-value < 0.001.

The experiments on the real dataset show that our proposed algorithms PA and PM can identify biclusters with high biological relevance efficiently. Furthermore, PA can always give a reasonable number of biclusters, and with a good degree of homogeneity. Although GO annotation only provides descriptions currently known in the biological community, the results still give a reasonable indication of performance. Furthermore, the biclusters which have no GO terms assigned should be investigated for any new biological discoveries.

Determination of biclusters homogeneity

In previous experiments, the homogeneity parameter, i.e. noise threshold ε of our algorithms is determined empirically. In fact, the aforementioned exploratory approach based on the PC plots can be employed to determine this parameter in an interactive manner for a given dataset. This exploratory approach uses an assumption that the homogeneity decreases monotonically with ε while the biclustering accuracy is a concave function of ε. To see this, we apply the proposed algorithm for additive models to artificial datasets with noise s.d. of 0.3 over a wide range of ε. Figure 15 shows the graphs of biclustering accuracy and ACV against ε. The biclustering accuracy first rises rapidly to its maximum value when ε changes from 0.5 to 1. The biclustering accuracy then decreases slightly until ε becomes 1.75. A steeper drop is found when ε is larger than 1.75. In other words, the biclustering accuracy is approximately concave with respect to ε. On the other hand, when ε increases, the average ACV of detected biclusters decreases as expected. From the graph, it can be observed that the ACV decreases faster when ε exceeds 1.25. Meanwhile, the biclustering accuracy remains high for ε between 1 and 1.25. These observations support the use of the interactive approach for parameter determination.

Figure 15
figure 15

Biclustering accuracy (solid line) and the ACV (dashed line) against noise threshold ε .

Conclusion

In this paper, a novel biclustering algorithm for additive models is proposed. First, we performed analysis on the difference matrix computed from a gene expression matrix. It was shown that the column-wise differences of an additive-related bicluster appear as clusters in each corresponding column in the difference matrix. Similarly, clusters can be found from the column-wise ratios calculated from multiplicative-related biclusters. These observations were then explored to construct biclusters greedily from the clustering results in column-wise differences or ratios in the proposed algorithms.

The proposed algorithms have been analyzed by comparing with pCluster algorithm. The results suggest that the proposed algorithms can be regarded as a greedy version of the pCluster algorithm. The biclusters found by the proposed algorithms can be expressed as δ-pClusters but clustering density is utilized in pattern discovery. Although the identified δ-pClusters is not guaranteed to be maximal, the proposed algorithm is much more efficient. Experiments showed that the computational time of the proposed algorithms is lower than that of the pCluster algorithm by a factor of hundreds or more. Moreover, we have verified that the worst case complexity of the proposed algorithms is polynomial-time instead of exponential-time as in the case of the pCluster algorithm or other δ-pCluster based approaches.

The robustness of our algorithms to noise and regulatory complexity has been verified empirically using artificial datasets. It was found that our algorithm is capable of discovering overlapping biclusters under noisy condition. Biological significance of biclustering results has been verified on the yeast cell-cycle dataset using Gene Ontology annotations. Comparative study shows that the proposed algorithm is the best or close to be the best one among several existing algorithms in terms of the percentage and the number of functionally-enriched biclusters for p-values below a range of value from 5 × 10-3 to 5 × 10-2. In particular, there are 96.0%, 88.0% and 80.0% of the biclusters annotated with p-value below 0.01. The proposed algorithm can identify biclusters with less deviation from the additive models. The identified biclusters also have reasonable size ranged from 10 to 597 genes and 5 to 17 conditions. Comparison in processing time suggests that the proposed algorithm has the highest efficiency.

In the proposed algorithm, the noise threshold is a crucial parameter as it balances the homogeneity requirement and the noise tolerance in the identified biclusters. In order to determine an appropriate value for the noise threshold, an exploratory approach based on the PC plots is adopted. We believe that the proposed biclustering algorithm and the interactive PC plots offer an effective data analysis tool for gene expression data. In future, our research will be focused on detecting bicluster types other than additive or multiplicative models, e.g. biclusters of coherent evolution.

Availability and requirements

Project home page: http://www.eie.polyu.edu.hk/~nflaw/Biclustering/index.html.

Operating system: Window XP

Programming language: Matlab 6.5 or above

License: Free for academic use. For non-academic use, please contact the author.

References

  1. Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem. Journal of Computational Biology. 2003, 10 (3–4): 373-384.

    Article  CAS  PubMed  Google Scholar 

  2. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470.

    Article  CAS  PubMed  Google Scholar 

  3. Lockhart DJ, Winzeler EA: Genomics, gene expression and DNA arrays. Nature. 2000, 405: 827-836.

    Article  CAS  PubMed  Google Scholar 

  4. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics. 1999, 22: 281-285.

    Article  CAS  PubMed  Google Scholar 

  5. Raychaudhuri S, Sutphin PD, Chang JT, Altman RB: Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology. 2001, 19: 189-193.

    Article  CAS  PubMed  Google Scholar 

  6. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proceedings of the National Academy of Sciences of the United States of America. 1999, 96 (12): 6745-6750.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95: 14863-14868.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Shamir R, Sharan R: Click: a clustering algorithm for gene Expression analysis. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 2000, AAAIPress, 307-316.

    Google Scholar 

  9. Wu S, Liew AWC, Yan H: Cluster Analysis of Gene Expression Data Based on Self-Splitting and Merging Competitive Learning. IEEE Transactions on Information Technology in Biomedicine. 2004, 8 (1): 5-15.

    Article  PubMed  Google Scholar 

  10. Szeto LK, Liew AWC, Yan H, Tang SS: Gene Expression data clustering and visualization based on a binary hierarchical clustering framework. Special issue on Biomedical Visualization for Bioinformatics, Journal of Visual Languages and Computing. 2003, 14: 341-362.

    Article  Google Scholar 

  11. Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004, 1 (1): 24-45.

    Article  CAS  PubMed  Google Scholar 

  12. Cheng Y, Church GM: Biclustering of expression data. Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology. 2000, 93-103.

    Google Scholar 

  13. Yoon S, Nardini C, Benini L, Micheli GD: Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2 (4): 339-354.

    Article  CAS  PubMed  Google Scholar 

  14. Zhao H, Liew AWC, Xie X, Yan H: A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. Journal of Theoretical Biology. 2008, 251 (2): 264-274.

    Article  CAS  PubMed  Google Scholar 

  15. Gan X, Liew AW, Yan H: Discovering biclusters in gene expression data based on high-dimensional linear geometries. BMC Bioinformatics. 2008, 9: 209-accepted

    Article  PubMed Central  PubMed  Google Scholar 

  16. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A Systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22 (9): 1122-1129.

    Article  CAS  PubMed  Google Scholar 

  17. Inselberg A, Dimsdale B: Parallel coordinates: a tool for visualizing multidimensional geometry. Proceedings Of Visualization. 1990, 361-378.

    Google Scholar 

  18. Wegman EJ: Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association. 1990, 85 (411): 664-675.

    Article  Google Scholar 

  19. Peng W, Ward MO, Rundensteiner EA: Clutter reduction in multi-dimensional data visualization using dimension reordering. Proceedings of IEEE Symposium on Information Visualization. 2004, 89-96.

    Chapter  Google Scholar 

  20. Ericson D, Johansson J, Cooper M: Visual data analysis using tracked statistical measures within parallel coordinate representations. Proceedings of the 3rd IEEE International Conference on Coordinated & Multiple Views in Exploratory Visualization. 2005, 42-53.

    Chapter  Google Scholar 

  21. Yang J, Ward MO, Rundensteiner EA: Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets. Computers & Graphics. 2003, 27 (2): 265-283.

    Article  Google Scholar 

  22. Prasad TV, Ahson SI: Visualization of Microarray Gene Expression Data. Bioinformation. 2006, 1: 141-145.

    Article  PubMed Central  PubMed  Google Scholar 

  23. Craig P, Kennedy J: Coordinated graph and scatter-plot views for the visual exploration of microarray time-series data. Proceedings of IEEE Symposium on Information Visualization. 2003, 173-180.

    Google Scholar 

  24. Hochheiser H, Baehrecke EH, Mount SM, Shneiderman B: Dynamic querying for pattern identification in microarray and genomic data. Proceedings of IEEE International Conference on Multimedia and Expo. 2003, 3: 453-456.

    Google Scholar 

  25. Wang H, Wang W, Yang J, Yu PS: Clustering by pattern similarity in large data sets. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 2002, 394-405.

    Chapter  Google Scholar 

  26. Cheng KO, Law NF, Siu WC, Lau TH: BiVisu: software tool for bicluster detection and visualization. Bioinformatics. 2007, 23 (17): 2342-2344.

    Article  CAS  PubMed  Google Scholar 

  27. BicAT (Biclustering Analysis Toolbox). 2006, [http://www.tik.ee.ethz.ch/sop/bicat/]

  28. Clustering by Pattern Similarity: the pCluster Algorithm. 2002, [http://wis.cs.ucla.edu/~hxwang/proj/delta.html]

  29. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nature Genetics. 2000, 25: 25-29.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  30. Castillo-Davis CI, Hartl DL: GeneMerge – post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003, 19 (7): 891-892.

    Article  CAS  PubMed  Google Scholar 

  31. Teng L, Chan L-W: Biclustering gene expression profiles by alternately sorting with weighted correlated coefficient. Proceedings of IEEE International Workshop on Machine Learning for Signal Processing. 2006, 289-294.

    Google Scholar 

  32. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nature Genetics. 2002, 31: 370-377.

    CAS  PubMed  Google Scholar 

  33. Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data. Bioinformatics. 2004, 20: 1993-2003.

    Article  CAS  PubMed  Google Scholar 

  34. Murali TM, Kasif S: Extracting conserved gene expression motifs from gene expression data. Pac Symp Biocomput. 2003, 77-88.

    Google Scholar 

  35. Yoon S, Nardini C, Benini L, Micheli GD: Enhanced pClustering and its applications to gene expression data. Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering. 2004, 275-282.

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous reviewers for their constructive comments. This work is supported by the Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering and the Hong Kong Polytechnic University (A-PA2P). K.O. Cheng acknowledges the research studentships provided by the University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alan Wee-Chung Liew.

Additional information

Authors' contributions

KOC worked on the proposed biclustering algorithm, implementation and experimental analysis. NFL formulated the biclustering problem and designed the algorithm. WCS initiated the project. AWCL worked on the experimental analysis. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2007_2195_MOESM1_ESM.pdf

Additional File 1: Supplementary materials. This file contains the content about the parallel coordinate (PC) representation of biclusters given in Figure 1 in the article, which can serve as preliminary for readers who are unfamiliar with PC plots. It also discusses visualization of biclusters in a data matrix which may help reader to understand the interactive adjustment of noise threshold for the proposed algorithm described in the article. (PDF 40 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cheng, KO., Law, NF., Siu, WC. et al. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics 9, 210 (2008). https://doi.org/10.1186/1471-2105-9-210

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-210

Keywords