School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA

Department of Statistics, Harvard University, Cambridge, Massachusetts, USA

Abstract

Background

Biclustering of gene expression data searches for local patterns of gene expression. A bicluster (or a two-way cluster) is defined as a set of genes whose expression profiles are mutually similar within a subset of experimental conditions/samples. Although several biclustering algorithms have been studied, few are based on rigorous statistical models.

Results

We developed a Bayesian biclustering model (BBC), and implemented a Gibbs sampling procedure for its statistical inference. We showed that Bayesian biclustering model can correctly identify multiple clusters of gene expression data. Using simulated data both from the model and with realistic characters, we demonstrated the BBC algorithm outperforms other methods in both robustness and accuracy. We also showed that the model is stable for two normalization methods, the interquartile range normalization and the smallest quartile range normalization. Applying the BBC algorithm to the yeast expression data, we observed that majority of the biclusters we found are supported by significant biological evidences, such as enrichments of gene functions and transcription factor binding sites in the corresponding promoter sequences.

Conclusions

The BBC algorithm is shown to be a robust model-based biclustering method that can discover biologically significant gene-condition clusters in microarray data. The BBC model can easily handle missing data via Monte Carlo imputation and has the potential to be extended to integrated study of gene transcription networks.

Background

Clustering gene expression data has been an important problem in computational biology. While traditional clustering methods, such as hierarchical and K-means clustering, have been shown useful in analyzing microarray data, they have some limitations. First, a gene or an experimental condition can be assigned to only one cluster. Second, all genes and conditions have to be assigned to clusters. However, biologically a gene or a sample could participate in multiple biological pathways, and a cellular process is generally active only under a subset of genes or experimental conditions. A biclustering scheme that produces gene and condition/sample clusters simultaneously can model the situation where a gene or a condition is involved in several biological functions. Furthermore, a biclustering model can avoid those “noise” genes that are not active in any experimental condition.

Biclustering of microarray data was first introduced by Cheng and Church

where noise ε_{ij}^{2}). It further assumes that the expression values of two overlapping biclusters are the sum of the two module effects. The plaid model uses a greedy search strategy, so errors can accumulate easily. Also in multiple clusters case, the clusters identified by the algorithm tend to overlap to a great extent. Tanay et al.

We here propose a Bayesian biclustering (BBC) model. For a single bicluster, we assume the same model as in the plaid model

Results and discussion

Simulation results

Bayesian biclustering in various simulated scenarios

We simulated a dataset with 400 genes and 50 samples. The background data is i.i.d. from _{1} ~ _{2} ~ _{i1}, α_{i2} ~ _{j1}, β_{j2} ~ _{ij1} ~ _{ij2} ~

We considered three scenarios for datasets with two clusters: the two clusters have some common conditions but distinct genes (Figure

Simulated data with two biclusters and the results of the BBC analysis

**Simulated data with two biclusters and the results of the BBC analysis.** Bayesian biclustering for simulated datasets. (a) A dataset with two non-overlapping clusters. (b)-(c) The two clusters found by the Bayesian biclustering model from (a). (d) A dataset with two clusters with common genes. (e)-(g) The three clusters found by the Bayesian biclustering model from (d). (h) A dataset with two clusters with both common samples and common genes. (i)-(k) The three clusters found by the Bayesian biclustering model from (h).

Comparison of biclustering algorithms on data simulated from statistical models

We compared six biclustering methods: the BBC method, the plaid model, ISA, SAMBA, OPSMs, and Cheng and Church's biclustering (CC). We considered both the single cluster case and the multiple clusters case using simulated data from the plaid model. A single cluster dataset is shown in Figure _{1} ~ _{i1}, β_{j1} ~ _{1} ~ _{2} ~ _{i1}, β_{ji}, α_{i2}, β_{j2} ~

Datasets simulated according to the plaid model

**Datasets simulated according to the plaid model** Datasets for comparison. (a) A dataset with one single cluster (b) A dataset with two clusters, of which both genes and samples overlap.

Since each method searches for biclusters with different structures, comparing biclustering results is not very straightforward. In order to carry out a comprehensive comparison among various biclustering results for simulated datasets, we use the following four characteristics: sensitivity, specificity, overlapping rate, and number of clusters. Since we know which gene-condition combination belongs to the true biclusters, we use the standard definition for sensitivity and specificity, both of which are values between 0 and 1. A higher sensitivity suggests that more “true” members of the clusters have been identified by the algorithm, while a higher specificity suggests that more background data points are excluded from the clusters. The overlapping rate is defined as

Thus, if there is no overlap between the identified clusters, the overlapping rate is 0. On the other hand, if the identified clusters greatly overlap with each other, the overlapping rate is close to 1.

We used the BicAT software package

Biclustering results of different methods for simulated data using the plaid model

Sensitivity

Specificity

Overlapping rate

# of clusters

ISA (0.6, 1)

1

0.84

0.99

0.84

0

0.12

1

3

ISA (0.6, 1.2)

0.95

0.53

0.84

0.90

0.06

0.08

10

8

ISA (0.7, 1.1)

0.84

0.68

0.91

0.84

0

0.16

10

8

SAMBA

0.43

0.39

0.99

0.99

0.31

0.3

7

14

CC*

1

0.98

0

0

0.02

0

10

10

OPSMs

0.38

0.25

0.94

0.96

0.3

0.5

11

12

Plaid

1

1

1

0.73

0

0.63

1

11

BBC**

1

1

1

1

0

0

1

3

Note: *In CC's method, the number of clusters is preset to be 10. **In BBC, the overlapping rate is automatically 0.

It can be seen that the ISA method is very sensitive to the choice of the thresholds. The performance of ISA also degrades in the case of multiple overlapping clusters. The SAMBA method and the OPSMs method correctly identified almost all background noises, but tends to exclude some meaningful patterns. The CC method includes too much background data in clusters. The plaid model performs well in the single cluster case. But it identifies too many overlapped clusters in the multiple clusters case. Our BBC method performs well in both cases, even though the data generation model for the overlapping part in the second case does not satisfy the BBC model assumption.

Comparison of biclustering algorithms on data simulated with biological characteristics

People are mostly interested in how different biclustering methods perform for real microarray datasets. We next carry out a comparison using simulated microarray datasets with realistic characteristics. As shown in Figure _{I}_{A}

The Simulated dataset with realistic characters

**The Simulated dataset with realistic characters**

where _{i}_{1}, _{2} …, _{MI}_{1}, _{2}, … _{MA}_{1}, _{2} …, _{MI}_{1}, _{2}…, _{MA}_{i})

where _{basal}_{j}_{k}_{i}_{k}_{i}_{k}_{j}_{k}

We added real noise from the well known Leukemia expression dataset

Biclustering results of different methods for simulated data with realistic characteristics

Sensitivity

Specificity

Overlapping rate

# of clusters

ISA (0.6, 1)

0.98

0.70

0.76

0.78

0.51

0.65

7.2

9.8

ISA (0.6, 1.2)

0.90

0.75

0.79

0.73

0.57

0.57

11.2

13

ISA (0.7, 1.1)

0.94

0.76

0.80

0.79

0.48

0.59

8.3

10.9

SAMBA

0.38

0.28

0.99

0.99

0.37

0.37

5.8

5.3

CC*

0.84

0.70

0.15

0.25

0.02

0.01

10

10

OPSMs

0.21

0.16

0.91

0.91

0.35

0.35

9.3

8.9

Plaid

1.00

0.99

0.48

0.61

0.30

0.18

5

2.9

BBC**

1

0.97

0.99

0.97

0

0

2

2

Note: *In CC's method, the number of clusters is preset to be 10. **In BBC, the overlapping rate is automatically 0.

The BBC model performed the best among these methods. Again the ISA method was sensitive to thresholds. It also had some false positives. The OPSMs method missed most of the significant patterns. The SAMBA method found some small and tight biclusters of genes and conditions, but also excluded many significant patterns. CC's method misidentified many noisy data points as biclusters. The plaid model recognized almost all significant patterns, but its specificity was low. Interestingly, the plaid model gave better results for low SNR case, which was due to the fewer number of clusters found by the plaid model.

Effects of normalization for Bayesian biclustering model

Data normalization is an important step for microarray analysis. Although some clustering methods such as ISA incorporate the normalization step in their procedures, most clustering methods work on normalized microarray data. The BBC model belongs to the latter. Since the normalization procedure greatly changes the microarray data, different normalization procedures may lead to very different clustering results.

We conducted a study on how normalization methods affect the biclustering results. Five normalization procedures including column standardization (CSN), row standardization (RSN), quantile normalization on gene level (QNGL), the interquartile range normalization (IQRN), and the smallest quartile range normalization (SQRN) were considered. In CSN (or RSN), each column (or row) is re-centered and re-scaled, so that the sample mean of each column (or row) becomes 0, and the sample variance becomes 1. These are quite crude methods, but are still used in many clustering applications. QNGL used the same technique with quantile normalization

IQRN and SQRN are two new methods we propose here. They are inspired by CSN, but are more robust to outliers. In IQRN, one first sorts the data in each column, trims off α/2% of the data from each tail, and computes the α%-trimmed mean and standard deviation. Then, all data in that column is standardized by subtracting the trimmed mean and being divided by the trimmed standard deviation. This normalization method can reduce the artificial normalization effect caused by outliers. In SQRN, instead of using the middle (100-α)% of the data, one first finds for each column the shortest interval that contains a certain percentage (e.g., 50%) of the data. Then the data of that column is standardized by the sample mean and variance of the data inside the shortest quartile range. If distributions of the data in each column are symmetric and unimodal, then SQRN is equivalent to IQRN. But SQRN gives better results for skewed distributions. We applied the above five normalization methods on the same simulated dataset as in Figure

As shown in Table

Comparison of normalization methods for Bayesian Biclustering Model

Sensitivity

Specificity

Overlapping rate

# of clusters

RSN

0.84

0.85

0

3

CSN

0.95

0.58

0

3

QN

1

0.44

0

4

IQRN

1

1

0

3

SQRN

1

1

0

3

Bayesian biclustering for yeast datasets

We analyzed the same yeast expression data as in

We analyzed the clustering results from three aspects. First, we identified the significant categories of experimental conditions for each cluster. More precisely, we classified the 250 experimental conditions into 22 categories according to the biological nature of each experiment. Some examples of categories are heat shock stress, amino acid starvation, and

Out of the 57 clusters, 36 have significant gene functions enrichments, 26 have significant TFBS enrichments, 51 have significant experimental condition categories enrichments, 22 have all three types of enrichments and 54 have at least one type of enrichment. We named a few of these biclusters and listed them in Table

Bayesian Biclustering results for yeast expression data

Cluster name

size*

Significant conditions (P value)

Enriched TFBS (P value)

Enriched gene functions (P-value)

ribosome proteins

213,85

nitrogen depletion(7.1e-3), steady state (3.9e-4)

RAP1 (2.9e-60)

ribosomal protein (2.1e-160)

rRNA processing

329,113

steady state (8.9e-4)

ABF1 (5.2e-4), PAC (1.2e-127), RRPE (2.7e-63)

rRNA processing (4.3e-77), nucleic acid binding (1.6e-25)

ubiquitin

113,88

diamide stress(4.2e-3), menadione stress(2.7e-2)

RPN4 (4e-12)

ubiquitin / proteasomal pathway (8.3e-12)

oxidative stress

40,38

hydrogen peroxide stress (4.8e-8), menadione stress(4e-7), diamide stress (3.2e-6)

CAD1(5.7e-15), YAP1(1.9e-15)

oxidative stress response (9.3e-8), metabolism of phenylalanine (4.2e-8), metabolism of tyrosine (2.7e-8)

respiration

55,97

steady state(1.8e-7)

HAP4 (1.3e-16), SKN7(6.3e-8), MSN24a(7.4e-4)

respiration (2.5e-38), electron transport and membrane-associated energy conservation (5.1e-45)

purin metabolism

42,48

menadione stress (4.1e-6), amino acid starvation (4.8e-3)

BAS1 (3.2e-5)

purin nucleotide/nucleoside/nucleobase anabolism (6.2e-10)

stress response and protein folding

48,46

heat shock (4.5e-7), diamide stress (1.7e-4), osmolarity stress (6.5e-4) , MSN2/4 and YAP1 deletion (3.8e-3)

HSF1 (4.7e-3),

protein folding and stabilization (8e-8), stress response(3.0e-5)

stress response and heat shock

87,191

heat shock (5.2e-3)

HSF1 (1.5e-3), MSN24 (6.1e-11), MSN24a (9.6e-11), STRE (1.0e-5), GIS1 (1.9e-4)

C-compound and carbohydrate metabolism (1.0e-3), energy (7.4e-4)

cell cycle

86,87

α factor (3.5e-8), cdc15 (3.7e-8), cdc28 (4.5e-2), elu (4.0e-6)

MCM1 (1.0e-10), SWI4 (4.16e-7), FKH1 (6.6e-7), MBP1(3.6e-4), TATA (1.3e-4)

cell cycle and DNA processing (5.1e-9), cytokinesis (cell division) (2.9e-6), pheromone response (7.6e-4)

DNA topology

35,45

cln3, clb2 (2.1e-2)

GCN4(4.3e-6), MBP1 (2.0e-5), MCM1 (3.2e-3), SWI4 (1.1e-3), XBP1 (1.3e-5)

DNA topology (1.3e-22), somatic/ mitotic recombination (8.9e-9)

cell cycle (G1 phase)

108,62

α factor (3.35e-11), cdc 15 (2.5e-10), cdc28 (7.8e-6)

MBP1 (3.7e-14), SWI4 (6.4e-5)

cell cycle and DNA processing (1.4e-12)

nitrogen, sulfur & selenium metabolism

37,16

amino acid starvation (1.2e-5), nitrogen depletion (4.2e-2)

CBF1(3.3e-7), GCN4 (7.3e-5), MET31 (8.7e-4), MET4(1e-7)

amino acid metabolism (1.5e-30), nitrogen, sulfur and selenium metabolism (1.3e-13)

glycolysis regulation

38,78

Disulfide-reducing agent stress (1.6e-4), diamide (1.5e-3)

GCR1 (4.6e-3)

sugar, glucoside, polyol and carboxylate catabolism (3.3e-10), glycolysis and gluconeogenesis (3.1e-11)

*size:(the number of genes in the cluster, the number of conditions in the cluster)

Conclusions

We have presented a rigorous hierarchical Bayes model for clustering microarray data in both the gene and the experimental condition directions. We used Gibbs sampling and Bayesian information criterion to identify biclusters as well as the total number of clusters. Using simulated datasets, we showed that the BBC algorithm outperformed other clustering methods especially when multiple clusters were present. Moreover, the BBC method performed the best for simulated data based on biochemistry models with realistic noise background. We also discussed the impact of normalization procedures on the clustering results, and found that both the interquartile range normalization and the smallest-quartile range normalization are robust for our BBC model. When applied to a well-known yeast microarray dataset, the BBC procedure discovered many biologically significant clusters, from which significant enrichments of gene functions, associated experimental conditions, and related TFBS enrichment were found.

Unlike many other biclustering methods, the BBC is completely model-based and does not need to fine tune any threshold parameters. Because it is a full Bayesian model, the BBC can handle missing data extremely easily, and can also incorporate likelihood-based criterion, such as AIC, BIC, maximum likelihood, Bayes factors, etc., for model evaluations and comparisons. In addition, the BBC model has the potential to be extended by incorporating other types of data, such as the promoter sequence information into the model.

Methods

Bayesian biclustering model

Consider a microarray dataset with ^{th}^{th}_{ij}, i =

where _{k}_{ik}_{jk}_{ijk}_{ij}_{ik}_{jk}_{ik}_{ik}_{jk}_{jk}

When multiple biclusters are allowed, the original plaid model usually finds biclusters greatly overlapping with each other. This effect is quite artificial and is likely due to the nonidentifiability problem caused by the additive assumption made for overlapping clusters. We solve this problem by allowing biclusters to overlap only in one direction, either the gene or the condition direction, but not both. This results in two versions of the BBC model: non-overlapping gene biclustering and non-overlapping condition biclustering. In non-overlapping condition biclustering, a condition can be in one or none of the clusters, but a gene can be assigned to multiple clusters. Mathematically, this constraint can be written as

_{i}_{j}_{k}

where _{k}_{k}_{k}_{k}_{k}

for the yeast dataset.

We assume

The hyperpriors for the

In our model, an observation _{ij}_{ij}_{ij}

If _{ij}

With Gaussian zero-mean priors on the effects parameters, we get the marginal distribution of the _{ij}

where **Σ** is the covariance matrix of **Y**, and **Y** = (**Y _{0}**,

where **Σ** = cov(

Gibbs sampling for Bayesian biclustering

In order to make inference from the BBC model, we implement a Gibbs sampling method

Since each data point belongs to no more than one cluster, we can therefore divide data points into two sets given the current parameters except κ_{jk}**V**_{1} = {_{il}_{ik}_{lk}**V**_{2} = {_{il}_{ik} = 1,κ_{lk} = 1,_{ij}_{ik} = 1}.

Two data points are independent if they belong to different clusters, therefore we can write the joint likelihood of **Y** as a product of the joint likelihood for data in **V**_{1} and **V**_{2}, respectively. As a consequence, the log-posterior probability ratio can be simplified as

In order to calculate the likelihood term in the above ratio, we need to take the inverse and determinant of the covariance matrices for the vector **V**_{2} in both cases. The dimensions of these covariance matrices are huge in practice (in the order of thousands), so a brute force calculation would be expensive. Since the covariance matrices have the special structures as shown in equation (9), we can simplify the likelihood ratio term. The final simplified form only involves multiplications and additions of matrices with dimension _{k}_{k}_{k}

Similarly we can obtain the log- posterior probability ratio for gene indicators δ_{ik}

Since our model requires that _{ik}_{ik}

We also sample the effect parameters based on the indicators δ and κ. The gene effects α_{ik}_{jk}_{ij}

In the above procedure, we preset the value

An executable program for the BBC algorithm is available at

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JG and JSL designed the models and simulation studies together. JG implemented the method and analyzed the data. Both authors contributed to the writing of the manuscript.

Acknowledgements

This work was supported by grants from the National Institute of Health (R01-GM078990) and the National Science Foundation (DMS-0706989). We thank the four referees and members of the Liu lab for many helpful comments and discussions.

This article has been published as part of