Institute for Systems Biology, 1441 N. 34th St. Seattle, WA 98103-8904, USA

New York University Dept. of Biology, Dept. of Computer Science, New York, USA

Abstract

Background

The learning of global genetic regulatory networks from expression data is a severely under-constrained problem that is aided by reducing the dimensionality of the search space by means of clustering genes into putatively

Results

We have developed an algorithm, cMonkey, that detects putative co-regulated gene groupings by integrating the biclustering of gene expression data and various functional associations with the

Conclusion

We have applied this procedure to the archaeon

Background

The statistical elucidation of genetic regulatory networks from experimental data (commonly mRNA expression levels) is an important problem that has been the center of a large body of work

Co-regulated genes are often functionally (physically, spatially, genetically, and/or evolutionarily) linked

Because a biological system's interaction with its environment is complex and gene regulation is multi-factorial, genes might not be co-regulated across all experimental conditions observed in any comprehensive set of transcript or protein levels. Also, genes can be involved in multiple different processes, depending upon the state of the organism during a given experiment. Therefore, a biologically-motivated clustering method should be able to detect patterns of co-expression across subsets of the observed experiments, and to place genes into multiple clusters. So-called

We have previously described a procedure, the INFERELATOR

Guided by these motivations and requirements, we herein describe an algorithm that detects genes putatively co-regulated over subsets of experimental conditions by integrating the biclustering of gene expression data and multiple gene association networks with the

Results

In this section we summarize the results of the application of our algorithm to four organisms, and describe its usefulness as a first step in our modeling of the

The bacteriorhodopsin regulon in Halobacterium

The induction of phototrophic growth of

Bacteriorhodopsin

Bacteriorhodopsin **A**: expression ratios of the bicluster's genes, over all experimental conditions (conditions within the bicluster are to the left of the red dotted line). **B**: expression ratios over only the conditions within the bicluster. **C**: motif logos [74] and **D**: network of associations between the bicluster's genes in the various association networks used by CMONKEY, including operons, KEGG [48] metabolic pathways ("met" – see Methods; only present in Figures 4 and 5), and Prolinks [23] associations. The nodes are color coded by COG [89] functional groupings. Genes labeled in red text encode known or putative transcriptional regulators. **E**: diagram of the upstream positions of the motifs, colored red, green and blue for motifs #1, 2 and 3, respectively. The genes' names are color-coded by COG functional annotation as in the network subfigure. The colors of the lines for each gene's sequence correspond to those in the expression ratio plots.

Motif logo for Bat-binding motif discovered in the bicluster of Figure 1 (top) compared to the saturation mutagenesis pattern observed for this regulator [12] (bottom)

Motif logo for Bat-binding motif discovered in the bicluster of Figure 1 (top) compared to the saturation mutagenesis pattern observed for this regulator [12] (bottom).

Table

Means of Pearson correlation coefficients of genes in bR or DMSO putative regulons (rows) with mean profile of genes in bR or DMSO operons (columns) over conditions

bR

DMSO

bR in

0.951

0.866

DMSO in

0.833

0.967

bR out

0.838

0.475

DMSO out

0.442

0.837

The bicluster also includes

SirR as a regulator of transport processes in Halobacterium

cMonkey detected a bicluster (#76, Figure

A potential advantage of the inclusion of

Regulation of flagellar biosynthesis in E. coli and H. pylori

In ^{70 }and FlhDC activation complex, and encode flagellar structural and assembly proteins and two regulators (^{70}/FlhDC binding motif in any of these clusters due to the presence of many additional unrelated sequences that added noise to the search. The cMonkey bicluster included only two (of 11) annotated "chemotaxis"-related genes (which are all in Class-3, and do not contain the detected motif), whereas the larger SAMBA bicluster, for example, did not discriminate between these two related functions (containing 9 of the 11 genes). If MEME ^{70}/FlhDC binding motif in ~20 of them, while it does not detect a motif for the 11 chemotaxis-annotated genes (nor in the combined set of 54 sequences). This analysis suggests that while many genes in both Class-2 and Class-3 are co-expressed in the

Flagellar biosynthesis bicluster from

Flagellar biosynthesis bicluster from ^{70}(RpoD)/FlhDC activator complex binding site for activation of Class-2 flagellar genes.

The ^{54}-regulated flagellar regulon ^{70 }binding site in

Flagellar-function

Flagellar-function

A novel putative ricin-like toxin in H. pylori

The integrated analysis of the full set of biclusters in the context of additional biological knowledge (such as detailed annotations for individual genes) can result in biological insights into the combined roles of multiple biological modules. Such an analysis requires the presentation and integration of cMonkey biclusters with the visualization and exploration tools

Biclusters in S. cerevisiae

The algorithm detected many strongly significant biclusters in

Supplementary file containing additional figures and tables, with captions, referenced in the main manuscript.

Click here for file

Validation and comparisons with available methods

Tracking the cMonkey optimization

By tracking the mean progression of all biclusters during their optimization, we can quantify the degree to which the biclusters improved with regard to each model component (data type). Examples of such measures for

Mean external measures of

Mean external measures of

Testing the cMonkey model

Tests of data integration

We tested whether cMonkey is correctly optimizing the joint model with respect to the different data types by varying the weights which parameterize the influence of each of those data types on the joint model (the default for these mixing parameters is set such that the three major data types have roughly equivalent influence). When we down-weight the mixture parameter for a given data type and thus eliminate its influence on the bicluster optimization, as expected, we find that this down-weighted component is poorly-optimized. At the same time, the remaining components are almost always optimized better. Thus each model component serves to regularize the bicluster model, preventing the biclusters from being over-fit to one or more individual subsets of the data. Not surprisingly, we also find that when certain components are up-weighted, they are better optimized, at the expense of a somewhat diminished ability to optimize the remaining components. [

Additional tests of the relationship between multiple data types and model components

By successively removing individual components of the model, we can also characterize relationships that exist between an individual data type and the others, that have not been removed, by observing the degree to which the optimization of the removed data type still improves. For example, by turning off an individual network

Randomization and shuffling tests

As an alternative to the difficult task of generating biologically realistic "synthetic" data, we chose to randomize the data instead, in order to further assess the significance of patterns discovered by cMonkey. If we completely shuffle an individual data type, then we effectively eliminate any signal that exists in that component but preserve any influence that the noise component of that data type adds to the procedure (possibly interfering with optimization of other model components). The resulting effect is very similar to strongly down-weighting that component of the model, as described above. A more stringent test can be performed by randomizing only the associations between each gene's expression data, its sequence, and its location in the association networks. This preserves the higher-order structure of each data type, but scrambles the mutual support each data type might present to the overall model. On data randomized in this manner, cMonkey is unable to find biclusters that, on average, are as well-optimized (in terms of the "scores" described above) as in the original data. The significance of this result varies depending upon the organism and the quality and amount of data available; on the _{10}-unit higher than in the un-shuffled data. The algorithm does not find significant association subnetworks in any of the shuffled trial runs.

Comparison of cMonkey with other methods

In our assessment of cMonkey's performance, we compared cMonkey-generated biclusters against those generated using the following algorithms: Cheng-Church (CC

Comparison in the context of regulatory network inference

A major motivation of this work is to provide a method for deriving co-regulated groups of genes for use in subsequent regulatory network inference procedures. To do this, we wish to find coherent groups of genes over those conditions with a large amount of variation. In other words, we are hoping to detect submatrices in the expression data matrix which are coherent and simultaneously have high information content or overall variance. In addition, we need to find biclusters with many conditions/observations included, as this increases the significance of each bicluster and also of the subsequently inferred regulatory influences for that bicluster. Some relevant summary statistics of the runs of various algorithms on all four organisms are listed in [^{-22}).

Comparison against external measures

Defining an unbiased external measure of "success" of a bi/clustering algorithm is a very difficult problem

In general, we find that cMonkey performs well in comparison to all other methods when the trade off between sensitivity, specificity, and coverage is considered, particularly in context of the other bulk characteristics (cluster size, residual, etc.). We find that SAMBA also performs well when these measures are considered; however because its biclusters contain on average 3 × more genes than cMonkey's, and far fewer experiments (and therefore SAMBA, like most other methods, cover less of the data space), the direct comparison is difficult. cMonkey, as it was designed to do, covers more of the data space (and therefore more of the different Classes defined above) for each organism, and it is therefore more suitable for our regulatory network learning motivations. In particular, while it includes far more experiments per cluster and restricts its clusters to have significantly tighter co-expression, it still does comparably well when assessed against the external measures due to its data integration. [

Bicluster visualization

Because a population of biclusters will contain some overlapping elements which can confuse their interpretation, it is important to present them to the biologist in a format that promotes their interpretation and exploration in the context of supporting information, cMonkey automatically generates, for each bicluster, a "bicluster diagram" (example in Figure

Discussion and conclusion

The integration of clustering or biclustering of expression data with additional information is a problem of growing interest. The method presented here may be compared favorably with several recently published clustering and biclustering algorithms that have integrated different types of data, including

We believe that the ability for the cMonkey user to explicitly control the contribution of different data types through their weights opens up many potential uses for the algorithm beyond the basic identification of co-expressed clusters of genes. This flexibility enables the detection of biclusters which stress certain type(s) of biological information over others. Indeed, in many cases it is still not known whether a certain type of pair-wise association between genes is actually correlated with co-expression. Such "guilt-by-association" is often assumed,

For sake of simplicity, flexibility and statistical transparency, we have used simple models for each of the individual data types and logistic regression to integrate them into a joint model. However, this simplicity comes at the expense of several trade-offs, which could be improved upon. Whereas it may be more appropriate to treat some associations as a property of sets rather than networks, we have treated all the same. Certain types of associations (such as protein-DNA networks and functional annotation classes) could be treated differently. In addition, any confidence values associated with individual edges in some of the networks are currently ignored. While edge weights could currently be included, for example, by dividing the high and low confidence edges into separate networks with different weights, it would be preferable to more cleanly model such association evidence. Third, we have reason to believe that our use of MEME for motif detection may be increasing our sensitivity to noise. The method could benefit from an assessment of different algorithms for detecting motifs in conjunction with biclustering, or the consensus of more than one method can be integrated, as in

Because the goals of the development of cMonkey are unique relative to previous biclustering methods (

Methods

Materials and data

Expression data

Expression data for

Association and metabolic networks

Genetic associations derived from comparative genomics, such as phylogenetic profile, Rosetta Stone, gene neighbor and gene cluster, were compiled from

Interaction networks

Upstream sequences

Upstream sequences for all organisms were obtained from GenBank using the Regulatory Sequence Analysis Tools (RSAT

Functional annotations for comparison tests

Gene ontology (GO)

The bicluster model

Model overview

Each bicluster is modeled via a Markov chain process, in which the bicluster is iteratively optimized, and its state is updated based upon conditional probability distributions computed using the cluster's previous state. This enables us to define probabilities that each gene or condition belongs in the bicluster,

Each bicluster begins as a _{max}) of clusters have been generated, or significant optimization is no longer possible. The complete process is shown schematically in Fig.

A schematic diagram of the CMONKEY biclustering procedure

A schematic diagram of the CMONKEY biclustering procedure. The inner (red) loop depicts the optimization for each newly-seeded bicluster.

In the following discussion, let **K **is fully defined by its set of genes **I**_{k }and experimental conditions **J**_{k}. The membership _{lk }∈ {0, 1} of an arbitrary gene _{lk }= 1).

The expression component

The expression data is a set of measurements of genes **I **over experiments **J**, comprising a |**I**| × |**J**| matrix _{ij }∈ **X**. Each bicluster **I**_{k}| × |**J**_{k}| submatrix _{i'j' }∈ **X**_{k}: **I**_{k }⊂ **I**; **J**_{k }⊂ **J**. The variance in the measured levels of condition **I**_{k}, _{ij }relative to this mean expression level is

which includes the term _{j }over all genes **I **rather than a _{jk }computed over **I**_{k }results in a lower likelihood _{ij}) for those conditions

The likelihood of the measurements of an arbitrary gene _{ik}, and for each condition _{jk}, relative to bicluster

Sequence component (motif co-occurrence)

Each gene _{i }(a string of DNA nucleotides of length _{S}), and bicluster **S**_{k }for all _{i'}; **I**_{k}. The decision whether an arbitrary gene's upstream sequence, _{i}, shares common motif(s) with sequences **S**_{k}, is determined via a two-step process: (1) identify one or more motif(s) **M**_{k }that is (are) significantly overrepresented in many (if not all) bicluster sequences **S**_{k}, and then (2) scan **M**_{k}.

In this work, we are not advancing the basic methodology for motif detection (step 1), as relatively mature methods exist for finding motifs given a fixed set of sequences **S**_{k}, and (b) it can produce a score (preferentially a

Thus, using these two algorithms, we can detect a set of motifs **M**_{k }in sequences **S**_{k}, and compute a _{i }contains those motifs. Note that this _{ik}, for each gene

Association network component

To build up a highly-connected subnetwork among genes that are in a bicluster (given a full set of associations), we aim to add genes preferentially that have a greater number of connections to those currently in the bicluster than one would expect (at random) based upon the overall connectivity in the network. Thus, we compute **N**. In the following discussion, genes are the primary consideration, but networks of associations between experimental conditions are conceivable (**I**_{k }in bicluster **I**_{k }and

where **A **→ **B **represents the set of associations between the elements in gene set **A **with those in set **B**, and _{A→B }is the number of these associations. Expression (2) is analogous to the hypergeometric measure of mutual clustering coefficient described by **I**_{k}. This choice of connectivity measure allows a single value to be directly computed for each gene, relative to each cluster, and gives greater preference to an individual gene

The joint cluster membership probability

The ultimate goal is to decide gene or condition bicluster membership jointly on the basis of the three individual sets of _{ik}, _{ik}, and _{ik }computed above (for the remainder of this discussion, we now use _{ik }= 1) using the logistic function,

This model approximates a (probabilistic) discriminating hyperplane in the space defined by _{ik}, _{ik}, and _{ik}, parameterized by the four independent variables _{0 }(the intercept), and _{0}, _{0}, and _{0 }(the slope) that maximally discriminates the genes or conditions within the bicluster (**I**_{k}) from those outside (_{0 }(the motif parameter) is zero.

In practice, during early iterations when the bicluster is not well-discriminated from the background, such an unconstrained regression leads to unstable situations such as unwarranted over-weighting or inversion of one or more variables (_{0}, _{0}, or _{0}). Additionally, depending upon the quality of the data set(s) being used and the predisposition (or prior knowledge) of the researcher, different runs of the algorithm stressing different data types may be desired. Finally, there is good reason to expect that certain data types (

Therefore, we perform a _{ik}, _{ik}, and _{ik }into one dimension, projecting the log-_{ik},

where _{0}, _{0}, and _{0 }are specified for each iteration according to an "annealing schedule," described below. Here, each of the dimensions have been standardized to place the log-_{ik}) = [log(_{ik }) - _{k}]/_{k}, where _{k }and _{k }are the mean and standard deviation of the log-_{ik}) (_{ik }denotes either _{ik}, _{ik}, or _{ik}), only over the genes or conditions in the bicluster (_{k}). This is necessary because the _{0}, _{0}, and _{0}), with only the intercept _{0 }permitted to vary from cluster to cluster. These parameters may also be interpreted as mixing parameters that control the fractional contribution of each model component to the cluster likelihood, _{ik}. They may be defined by the user, and/or may be modified throughout the course of cluster optimization. For example, early in the procedure when the bicluster is a poorly-defined seed, co-expression and certain association networks (

_{ik }≡ _{ik }= 1|**X**_{k}, _{i}, **M**_{k}, **N**) ∝ exp(_{0 }+ _{1}g_{ik}), (5)

where parameters ** β **= [

One additional complication arises near the end of a bicluster optimization, that a bicluster may be perfectly discriminated from the background (resulting in an infinite negative log-likelihood and undefined regression). This may be addressed in two ways: the first is to constrain or fix the slope _{1 }of the regression, allowing only the intercept _{0 }to vary. We chose a second option, to perform a penalized maximum likelihood estimation described by

We now have a set of probabilities, _{ik}, that each gene or condition _{ik}. These probabilities may be further adjusted via additional (prior) constraints on the model, as described below.

The cMonkey iterative procedure

Seeding the clusters

The Markov chain process by which a bicluster is optimized requires "seeding" of the bicluster to start the procedure. We experimented with many data-driven methods for generating seed biclusters, including (a) single-gene seeds, (b) random or semi-correlated seeds using a pre-specified distribution of cluster sizes, and (c) seeding on the basis of co-expressed edges in association networks (for example, operon associations). In principle, any seeding method may be used, including the clusters produced by other clustering or biclustering methods. Many

Each bicluster is seeded using a random choice from one of a variety of methods, each of which utilizes one or more different types of input data. For each newly-seeded bicluster, **I**' be the set of genes that are currently not in any other biclusters, **I**' and **J**_{i }is the set of conditions in which

1. **J**_{i}. For the first few iterations of this bicluster's optimization, only gene additions are allowed (forcing the bicluster to grow in size, early on).

3.

2. **I**' are randomly chosen from those with a high Pearson correlation (_{c }> 0.8) with **J**_{i}. n is chosen randomly from a set of pre-defined cluster seeding sizes, currently 2, 5, 10, 20,

4. **I**' are added from those with _{c }> 0 with

5. **I**' are randomly chosen from those with _{c }> 0 with

Annealing the clusters

A newly-seeded bicluster _{ik }for each gene or condition _{ik }= 0 and _{ik }≈ 1), and to drop genes or conditions from that bicluster if they have a low probability of membership (_{ik }= 1 and _{ik }≈ 0). Moves which may decrease the likelihood of the cluster model are permitted, with a frequency that decreases during the course of the procedure, as parameterized by an annealing temperature

All moves are performed by sampling them from the probability in Eq. 6. This Simulated Annealing procedure is dampened by restricting the total number of gene/condition moves at each iteration to _{max }= 5, in order to reduce the chance that a bicluster will change drastically before its model is reevaluated. We find that Simulated Annealing, while not the most efficient search strategy available, improves upon greedy search strategies such as Expectation Maximization, by being able to escape local minima and therefore being able to more completely assign genes and conditions to clusters as appropriate

Additional model constraints: bicluster size and overlap

The search space for this problem is often dominated by very strong attractors and if we do not restrict the gene/condition move set, biclusters are likely to repeatedly descend into the same set of deep local minima (thereby increasing the bicluster overlap, or redundancy). This is an issue seen in many biclustering algorithms, and a commonly-practiced _{i }into which each gene _{i }is modeled as a Poisson process with cumulative distribution _{v}(z_{i}) (where _{ik}) and _{ik}) (Eq. 6), is multiplied with this prior probability of observing the gene in that many biclusters (relative to the expected number):

Thus the solution is constrained to what seems to be a more biologically intuitive model: include each gene in an average of

Bicluster sizes can also vary widely between biclustering methods; some generate biclusters with only three genes on average **I**_{k}|), using a cumulative Normal distribution **I**_{k}|, **I**_{k}|. This conditional distribution is applied by further adjusting the relative ratios of the distributions (Eq. 6) from which the gene moves are sampled:

The result is that if |**I**_{k}| <**I**_{k}| >

The annealing schedule

To enforce convergence we schedule the annealing temperature _{0}, _{0}, and _{0 }with each iteration. We have found that the most effective schedule up-weights the expression (_{0}) and certain association networks (_{0}; _{0}) as the biclusters become better-defined (Fig.

Example annealing schedule applied to the three CMONKEY model component weights (_{0}, _{0}, and _{0}) and annealing temperature

Example annealing schedule applied to the three CMONKEY model component weights (_{0}, _{0}, and _{0}) and annealing temperature

Implementation

cMonkey is implemented in the

- S

- N

- O

- W

Comparison with other biclustering and clustering methods

The different bi/clustering algorithms used for the comparative analysis included: Cheng-Church

Each different biclustering algorithm returned bicluster sets with wide differences in cluster count, cluster size (genes and experiments), amount of overlap/redundancy, expression coherence, and other general characteristics only related to their treatment of the expression data. We therefore computed "bulk" measurements for each bicluster set, such as those listed in [**X **which fall in at least one bicluster. A measurement that quantifies the degree to which each complete bicluster set recapitulates the variance in **X **is defined as follows:

where, as above, _{ij }= ∑_{k}**I**_{k }∧ **J**_{k }is the number of biclusters containing element _{ij}. This measure is dependent upon the fractional coverage **X**.

In an attempt to remove

Abbreviations

Markov chain Monte Carlo (MCMC), cumulative distribution function (CDF), Position specific scoring matrix (PSSM), Gene Ontology (GO)

Authors' contributions

**DJR **Developed and implemented the cMonkey algorithm; ran the procedure and analyzed the results; wrote this manuscript.

**NSB **Contributed to the inception of this project; provided important feedback on the results; assisted with the writing of this manuscript.

**RB **Conceived and initiated this project; contributed to the development and implementation of cMonkey; wrote this manuscript.

Acknowledgements

We gratefully acknowledge Armadeep Kaur and Min Pan for generating the