Department of Information and Communications Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea

Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea

Abstract

Background

In a functional analysis of gene expression data, biclustering method can give crucial information by showing correlated gene expression patterns under a subset of conditions. However, conventional biclustering algorithms still have some limitations to show comprehensive and stable outputs.

Results

We propose a novel biclustering approach called “BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC)” to find comprehensive sets of correlated expression patterns in biclusters using clustered seeds and their expansion with correlation of gene expression. BICLIC outperformed competing biclustering algorithms by completely recovering implanted biclusters in simulated datasets with various types of correlated patterns: shifting, scaling, and shifting-scaling. Furthermore, in a real yeast microarray dataset and a lung cancer microarray dataset, BICLIC found more comprehensive sets of biclusters that are significantly enriched to more diverse sets of biological terms than those of other competing biclustering algorithms.

Conclusions

BICLIC provides significant benefits in finding comprehensive sets of correlated patterns and their functional implications from a gene expression dataset.

Background

Genes in common regulatory mechanisms under specific conditions are likely to show similar expression patterns. Identifying those patterns and the corresponding genes is one of the most important steps of microarray analysis to reveal the novel functions of genes, transcription factor-target relationships, and concerted gene functions in pathogenesis

A bicluster can be defined as a sub-matrix in a whole gene expression data matrix representing groups of genes that show coherent expression patterns under a subset of conditions

The output from conventional biclustering methods shows lack of stability. Since common biclustering methods depend on random starting seeds, the numbers and the contents of resulting biclusters are changing every time even though the biclustering algorithm is applied to the same microarray datasets. Moreover, random starting seeds cannot guarantee diverse searching of biclustering and coherent biclustering results. However, in conventional biclustering methods, the use of random starting seeds was inevitable choice to compromise the computation complexity and there have been a few studies to overcome this limitation. Erten and Sözdinler

The way to set the scoring function of bicluster is also important to improve the performance of biclustering. Mean squared residue, which measures variability of biclusters based on the arithmetic mean of gene expression, was the first scoring function used to find biclusters

In this point of view, the correlation coefficient can be an alternative scoring function to the mean squared residue. With this measure, correlated expression patterns, including both shifting and scaling patterns, can be detected and this is more relevant to the purpose of biclustering to find the co-expressed gene clusters under the same biological regulation. Allocco

Several correlation-based biclustering approaches have recently been proposed

In this paper, we propose a novel biclustering algorithm called BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC) aiming to search comprehensive sets of biclusters with correlated gene expression patterns. The primary process of BICLIC is not conducted with random seed biclusters, but with the full search of correlated seed bi-clusters that are determined by individual dimension-based clustering. Then comprehensive sets of correlated seed biclusters are expanded to larger biclusters using a greedy iterative heuristic approach with the Pearson correlation coefficient as the scoring function. As a result, BICLIC can find comprehensive biclusters accurately and also provides stable output in multiple runs.

We demonstrate that our proposed BICLIC method outperforms other conventional biclustering methods in finding correlated gene expression patterns both in simulated data sets and in real microarray datasets.

Results and discussion

The proposed BICLIC algorithm is implemented in the R language. R-code of the BICLIC algorithm is freely available from

In this section, the performance of our biclustering algorithm will be compared with those of three well-known existing bicluster algorithms: BCCA, CPB, and QUBIC. The BCCA, CPB, and QUBIC programs are from each paper’s cited sources. The performance comparison can be divided into two parts. In the first part, simulated datasets are used to test the accuracy and the coverage of the biclustering algorithm to identify implanted biclusters that have various correlated patterns. In the second part, a real microarray dataset is used to show that BICLIC can extract more diverse sets of correlation-based biclusters than those extracted by compared methods, BCCA and QUBIC, and the extracted biclusters from BICLIC are significantly enriched in biological terms, such as the gene ontology (GO) functional category

Simulated datasets

The purpose of this test is to verify the ability of BICLIC to search comprehensive correlated patterns as well as to compare the performance of BICLIC with that of the BCCA and QUBIC algorithms. BCCA is a correlation-based biclustering algorithm, whereas QUBIC is known for its ability to detect various patterns of biclusters, including correlation patterns. BICLIC can find diverse sets of correlated patterns, such as shifting, scaling, and shifting-scaling patterns. Shifting and scaling patterns are defined in

The expression of the _{
ij
}, is a shifted expression of a base expression

The expression in the _{
ij
}, is a scaled expression of a base expression

Bozda˘ g proved that the value of the Pearson correlation coefficient is 1 for a perfect shifting, scaling, and shifting-scaling pattern

To simulate each correlated pattern, a 1000 X 100 data matrix is generated with random values in a normal distribution whose mean is 0 and standard deviation is 1. For each type of correlated pattern, 10 data matrices are generated, resulting in a total of 30 data matrices. For each data matrix, 10 non-overlapping biclusters of size 100 X 10 are implanted in the matrix. Shifting, scaling, and shifting-scaling patterns of biclusters are generated from equations 4, 5, and 6, respectively. Shifting and scaling factors are randomly generated from a normal distribution whose mean is 0 and standard deviation is 1. To generate positively correlated patterns, randomly generated scaling factors are changed to absolute values of the original random values.

In addition, simulated datasets that have implanted biclusters with different-sized columns are generated to study the effect of column size on the performance of the biclustering algorithms. The size of the whole data matrix is 1000 X 100, the same as that of the previous simulated dataset. The number of rows of a bicluster is fixed as 100, but the number of columns varies from 20 to 100. Five different sized biclusters are implanted in each 1000 X 100 data matrix. These simulated datasets are also generated for three kinds of correlated patterns: shifting, scaling, and shifting-scaling.

To compare the accuracy of different biclustering algorithms on simulated datasets, the average match score proposed by Prelic

_{
1
} and _{
2
} are gene sets in a bicluster set _{
1
} and _{
2,} respectively. |_{
1
}∩_{
2
}| is the number of data elements in the intersection of _{
1
} and _{
2
} and |_{
1
}∪ _{
2
}| is the number of data elements in the union of _{
1
} and _{
2
}. _{
G
}(_{
1
}, _{
2
}) represents the average of the maximum match score for all biclusters in _{
1
} when compared to biclusters in _{
2
}. If _{
1
} is the set of implanted true biclusters and _{
2
} is a set of generated biclusters, _{
G
}(_{
1
},_{
2
}) represents the average recovery score. The average recovery score measures how well the biclustering algorithm recovers implanted true biclusters. Conversely, if _{
1
} is the set of generated biclusters and _{
2
} is the set of implanted true biclusters, the average match score, _{
G
}(_{
2
}, _{
1
}), represents the average relevance score. The average relevance score measures the level of similarity of all generated biclusters compared to implanted biclusters. A correlation threshold is required to run BICLIC, BCCA and CPB. Since all biclusters in the simulated datasets are perfectly correlated, 1 is used as the correlation threshold to run BICLIC, BCCA, and CPB. The minimum numbers of rows and columns in biclusters, additional parameters of BICLIC, are set to five in order to filter out excessively small biclusters. We set

**Algorithm**

**Shifting**

**Scaling**

**Shifting-Scaling**

The maximum and minimum numbers of the average recovery score are 1 and 0, respectively. Each average recovery score in Table

BICLIC

1

1

1

BCCA

0.141

0.181

0.168

CPB

1

0.996

0.915

QUBIC

0.431

0.169

0.466

**Algorithm**

**Shifting**

**Scaling**

**Shifting-Scaling**

The maximum and minimum numbers of the average recovery scores are 1 and 0, respectively. Each average relevance score in Table

BICLIC

1

1

1

BCCA

0.060

0.109

0.094

CPB

0.143

0.297

0.258

QUBIC

0.038

0.043

0.107

Effect of column fraction level on average recovery score in shifting, scaling, and shifting-scaling pattern

**Effect of column fraction level on average recovery score in shifting, scaling, and shifting-scaling pattern.** Each average recovery score is the mean value of average recovery scores from 10 independent datasets.

Experimental dataset

To investigate the usefulness of BICLIC in searching comprehensive sets of correlation-based biclusters, a yeast Saccharomyces cerevisiae dataset

Table

**Method**

**Count**

**Average |I x J|**

**Gene cov.**

**Condition cov.**

**Cell cov.**

Values in parentheses denote the values of seed biclusters of BICLIC. The columns “Count”, “Average |I x J|”, “Gene cov.”, “Condition cov.”, and “Cell cov.” show the numbers of biclusters, average sizes of biclusters, coverage of biclusters in the gene dimension, coverage of biclusters in the condition dimension, and coverage of biclusters for all cells in the matrix.

BICLIC

14791

2249.3

1

1

0.999

(11172)

(7.2)

(0.905)

(1)

(0.109)

BCCA

8163

2936.8

0.776

1

0.317

CPB

3634

8413.6

0.512

1

0.185

QUBIC

2146

847.4

0.884

0.746

0.112

Table

**Method**

**Count**

**Average |I x J|**

**Gene cov.**

**Condition cov.**

**Cell cov.**

Values in parentheses denote the values of seed biclusters of BICLIC. The columns “Count”, “Average |I x J|”, “Gene cov.”, “Condition cov.”, and “Cell cov.” show the numbers of biclusters, average sizes of biclusters, coverage of biclusters in the gene dimension, coverage of biclusters in the condition dimension, and coverage of biclusters for all cells in the matrix.

BICLIC

6019

2302.8

1

1

0.999

(3734)

(4.2)

(0.389)

(1)

(0.021)

CPB

386

4594.8

0.672

1

0.344

QUBIC

1355

68.2

0.543

1

0.048

In an additional experiment, the overlap level of extracted biclusters in yeast stress dataset was evaluated for BICLIC, BCCA, CPB, and QUBIC. All searched biclusters for each biclustering algorithm were arranged in decreasing order of bicluster size. When a bicluster had

Proportion of the remaining biclusters after removing overlapping biclusters in each biclustering algorithm for yeast stress dataset

**Proportion of the remaining biclusters after removing overlapping biclusters in each biclustering algorithm for yeast stress dataset.**

Summary statistics of remaining biclusters dataset after removing overlapped biclusters by varying the overlap level for the yeast stress in each biclustering algorithm. **Figure S1**. The number of significantly enriched biological terms for thee biclustering algorithms in four functional categories on various significance levels. (a) GO Biological Process, (b) GO Cellular Component, (c) GO Molecular Function, (d) KEGG Pathway.

Click here for file

Function enrichment evaluation

To investigate the biological relevance of extracted biclusters, functional enrichment of extracted biclusters was conducted with the GO functional category and the KEGG biological pathway for each biclustering algorithm. A modified version of COFECO (composite function annotation enriched by protein complex data) was used for functional enrichment analysis

The number of significantly enriched biological terms for four bi-clustering algorithms in four functional categories at 1% significance threshold for yeast stress data set

**The number of significantly enriched biological terms for four bi-clustering algorithms in four functional categories at 1% significance threshold for yeast stress data set.**

BICLIC found the largest number of significantly enriched functional terms compared to BCCA, CPB, and QUBIC in GO BP, GO CC, GO MF, and KEGG. Compared to QUBIC, BCCA and CPB found fewer unique functional terms, despite the fact that BCCA and CPB found more and larger biclusters. This means that there are a number of highly overlapped genes and conditions in the biclusters found by BCCA and CPB. Furthermore, the functional enriched terms are also highly redundant in BCCA and CPB. In contrast, BICLIC found comprehensive sets of biclusters. Moreover, it could obtain a number of significant results from the functional enrichment process with GO BP, GO CC, GO MF, and KEGG.

We also conducted functional enrichment of extracted biclusters in the lung cancer dataset with the same way of analysing the yeast stress dataset mentioned above. Figure

The number of significantly enriched biological terms for three bi-clustering algorithms in four functional categories at 1% significance threshold for lung cancer data set

**The number of significantly enriched biological terms for three bi-clustering algorithms in four functional categories at 1% significance threshold for lung cancer data set.**

Conclusions

In this paper, we proposed a novel biclustering method, BICLIC, to search for comprehensive sets of correlation-based biclusters. Our algorithm conducts individual dimension-based clustering for efficient determination of comprehensive sets of correlated seed biclusters, which are further expanded to larger correlation-based biclusters. Simulated and real microarray datasets were used to perform several experiments, and the results were compared to those obtained using BCCA, CPB, and QUBIC. The experiments showed that BICLIC could find implanted correlated biclusters accurately while other competing methods such as BCCA and QUBIC performed poorly. In addition, BICLIC was able to extract more comprehensive sets of biclusters than other biclustering algorithms. Although CPB performed well in the simulated dataset, it performed poorly in the real microarray datasets. Finally, the biclusters searched by BICLIC could be enriched to more diverse biological terms in GO and KEGG.

Methods

BICLIC biclustering method consists of four phases: finding comprehensive seed biclusters, expanding seed biclusters, filtering less correlated genes and conditions, and checking and removing duplicated biclusters. The process of finding comprehensive seed biclusters is summarized at Figure

Schematic diagram of determining seed biclusters

**Schematic diagram of determining seed biclusters.**

Schematic diagram of expanding seed biclusters

**Schematic diagram of expanding seed biclusters.**

Definitions

Definition 1

An input microarray matrix, _{
1
}
_{
2
}
_{
i
}
_{
n-1
}
_{
n
}} is a set of genes and C = {_{
1
}, _{
2
}, …, _{
j
}, … _{
m-1
}, _{
m
}} is a set of conditions.

Definition 2

A seed bicluster,

Definition 3

An expanded bicluster,

Finding comprehensive seed biclusters

The generation of seed biclusters is illustrated in Algorithm 1. In this phase, comprehensive sets of initial biclusters are to be found and they will be expanded in a later phase. This phase consists of two steps: individual dimension-based clustering and seed bicluster determination. An

An individual dimension-based clustering method is more efficient than those conventional approaches although conventional clustering algorithms such as

After individual dimension-based clustering, the genes that have similar expression values in each individual condition are labeled with the same cluster index. The

Algorithm 1 Seed Bicluster Extraction Algorithm

**Input:**

**Output:**

Steps:

1. Individual dimension-based clustering

For each m individual condition, do:

A. Align gene set _{
1
},_{
2
},…_{
n
}} to _{
1
}‘,_{
2
}‘,..,_{
n
}‘ } in increasing order of gene expression value, where _{
1
}‘ ≤ _{
2
}
_{
n-1
}‘ ≤ _{
n
}‘.

B. Initially, set gene index

C. Measure standard deviation of all genes in this condition and set it as

D. Let _{
KI
} for set of cluster member genes when cluster index is _{
KI
} == NULL

a. Set cumulative number of genes in cluster set,

b. _{
KI
} = _{
KI
} ∪ {_{
i
}‘ }.

c. Assign cluster index

E. If cluster _{
KI
} !=NULL, then

a. Set

b. Set

c. Set _{
KI
} = _{
KI
} ∪ {_{
i
}‘ }.

d. Measure standard deviation of _{
KI
} when number of member gene in cluster set is _{
KI, cum
}).

e. While sd(_{
KI, cum
}) ≤ sd(_{
KI, cum-1
}) and sd(_{
KI, cum
}) ≤

i. Set

ii. Set _{
KI
} = _{
KI
} ∪ {_{
i
}’ }.

iii. Assign cluster index

f. If sd(_{
KI, cum
}) > sd(_{
KI, cum-1
}) or sd(_{
KI, cum
}) >

i. Set

ii. Set _{
KI
} = _{
KI
} ∪ { _{
i
}’ }.

iii. Assign cluster index

iv. Set

F. Repeat Step 1D to 1E until

G. Align cluster indexed genes i.e. {1, 1, 2, 2, …, _{
1
},_{
2
},…_{
n
}}.

H. Combine m cluster index vector to original

2. Seed bicluster determination

For each m individual condition, do:

A. Initially, Set seed bicluster set

B. For

a. Let _{
s
}) for rows of genes when cluster index

b. Set seed cluster condition set,

c. For

i. Let _{
s, j
}) for the collection of genes when genes are in _{
s
}) rows and condition is in

ii. If genes in _{
s, j
}) have same kinds of cluster index, then set _{
j
}}

iii. If the number of elements in

-Set seed bicluster, _{
s
}) and

-Add each seed bicluster,

Expanding seed biclusters

In this phase, previously determined comprehensive sets of seed biclusters are expanded to larger biclusters with correlated patterns. The Pearson correlation coefficient is used as scoring function to measure correlation between pairs of genes over subsets of conditions when seed biclusters are expanded, while maintaining similarity over a correlation threshold. BICLIC uses a heuristic approach to expand seed biclusters efficiently by merging each gene or each condition from the most similar one to the least similar one with a seed bicluster. Each seed bicluster is expanded in two ways, gene-wise and condition-wise, while maintaining the average Pearson correlation coefficient of pairs of genes over conditions in each expanded bicluster above the correlation threshold. The computation required in this heuristic approach is considerably less than that in the approach of exhaustive search of all possible combinations of genes and conditions. Although less comprehensiveness in the expanded biclusters may appear in the proposed heuristic approach than in an iterative approach, this disadvantage can be alleviated by the existence of comprehensive sets of correlated seed biclusters.

In gene-wise expansion, the minimum number of conditions in seed biclusters must be equal to or greater than 3. Otherwise, the average Pearson correlation coefficient of gene-wise expanded biclusters will be +1, -1, or non-computable. For each seed bicluster, the Pearson correlation coefficient value between a seed bicluster and each gene vector is calculated to find candidate sets of correlated genes to expand. Then, each gene is merged to a seed bicluster in decreasing order of correlation coefficients between gene vectors and the seed bicluster to add similar genes to the seed bicluster efficiently, until the average Pearson correlation coefficient of the gene-wise expanded biclusters is no longer smaller than the correlation threshold value, θ. Such an efficient gene expansion approach also leads to stable expansion results because the order of genes to expand is determined when calculating the Pearson correlation coefficient value between a seed bicluster and each gene vector. The Pearson correlation coefficient between a seed bicluster and a gene vector is calculated using equation 1.

_{
mean
} is the mean expression vector of a seed bicluster and _{
i
} is the

In condition-wise expansion, the correlation coefficient of an expanded seed bicluster is computed when each candidate condition is merged to a seed bicluster. Condition-wise expansion checks whether genes in a seed bicluster have additional correlated expression patterns in the remaining conditions. If the average correlation coefficient of a condition-wise expanded bicluster is greater than the correlation threshold, genes in such biclusters show a correlated expression pattern over both conditions in the seed bicluster and expanded conditions. The average Pearson correlation coefficient of biclusters after expanding condition

If _{
j
} is greater than the overall correlation threshold

After expanding a seed bicluster in gene-wise and condition-wise directions, a vertically and horizontally long matrix can be acquired, respectively. These two matrices can be combined to form a larger matrix that has rows in the gene-wise expanded bicluster and columns in the condition-wise expanded bicluster. This combined matrix is theoretically the largest size of matrix to which a seed bicluster can be expanded. The correlation coefficient of this matrix is less than the correlation threshold

Filtering less correlated genes and conditions

Each correlated seed bicluster is enlarged to a larger candidate bicluster by combining gene-wise expanded biclusters and condition-wise expanded biclusters. Although not all genes may show correlated patterns over all conditions in a candidate bicluster matrix, at least all genes and conditions in this candidate bicluster are correlated with the seed bicluster. Correlation-based biclusters can be acquired by backwardly eliminating less correlated sets of genes and conditions. Algorithm 2 illustrates the steps of filtering less correlated genes and conditions. The average Pearson correlation coefficient, _{
CB
}
_{
CB
} is defined in equation 3. The vectors g_{p} and g_{q} are the

In each iteration, the less correlated set of genes and the least correlated condition are calculated from a candidate bicluster matrix. The least correlation condition is eliminated, and then, the degree of increase in the average Pearson correlation coefficient (APCC) of the remaining matrix is measured. While the former result is set aside, in turn, less correlated set of genes of the original matrix are to be eliminated. The degree of increase in the APCC is measured. Then, the two degrees of increased APCC from the previous steps are compared to eliminate the one that has higher degree. For instance, when the degree of increase in the APCC of the least correlation condition is higher than that of less correlated set of genes, the former is eliminated and the latter is remained and vice versa. The number of conditions represents the length of a correlated expression pattern. Therefore, the least correlated condition is compared to a set of less correlated genes to extract a large correlated expression pattern. After removing less correlated sets of genes or the least correlated condition in a repeated way until the average correlation coefficient of the matrix is equal to or greater than the correlation threshold, a correlation-based bicluster matrix is acquired.

Algorithm 2 Filtering less correlated genes and conditions Algorithm

**Input:**

**Output:**

Steps:

1. Calculating average Pearson correlation coefficient of candidate bicluster matrix

A. Calculate average Pearson correlation coefficient of all genes in candidate bicluster matrix, _{
CB
},

B. If _{
CB
}, ≥

C. If _{
CB
}, <

2. Calculating average Pearson correlation coefficient after eliminating less correlated sets of genes.

A. Initially, Set less correlated gene set

B. For

a. Calculate average Pearson correlation coefficient after eliminating gene _{
i
}, _{
CB, g }

b. If _{CB, g} > _{CB}, then set _{
i
}}

C. Calculate average Pearson correlation coefficient after eliminating less correlated gene set _{
CB, lg}

3. Calculating average Pearson correlation coefficient after eliminating the least correlated condition.

A. Initially, Set less correlated condition set

B. For

a. Calculate average correlated coefficient after eliminating condition _{
j
}, _{
CB
}, _{
j
}

b. If _{
CB, cj
} > _{
CB
}, then Set _{
j
}} and _{
CB
}, _{
j
} }

c. Select maximum of _{
j,max
}

4. Comparing average Pearson correlation coefficient increase between eliminating set of genes and condition

A. If _{
CB
},

B. If _{
CB
}, _{
j
}, max from _{
j, max
}

5. Repeat step 1 to 4 until _{
CB
} ≥

6. If _{
CB }≥

Checking and removing duplicated biclusters

After all seed biclusters are expanded, different seed biclusters can be expanded to the same biclusters. Also, some biclusters may include other biclusters. Therefore, it is necessary to examine whether there are duplicated biclusters. All biclusters are ordered in an increasing order of bicluster size. Composition of genes and conditions in a bicluster is compared to that the same-size or larger biclusters from the smallest bicluster size to the largest. If every gene and condition in a certain bicluster is included in other bicluster, those biclusters are removed. After removing duplicated biclusters, the remaining biclusters have unique composition of genes and conditions.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

TY wrote the R-based algorithm, performed experiments, and wrote the manuscript. GSY conceived and supervised this study, and reviewed the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This research was supported by the National Research Foundation of Korea (NRF) grant No. 2012–0001001, the Converging Research Center Program grand No. 2012 K001442, and the KAIST Future Systems Healthcare Project funded by the Ministry of Education, Science and Technology (MEST) of Korea government.