<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1752-0509-2-33</ui>
   <ji>1752-0509</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Extracting expression modules from perturbational gene expression compendia</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Maere</snm>
               <fnm>Steven</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>steven.maere@psb.ugent.be</email>
            </au>
            <au id="A2">
               <snm>Van Dijck</snm>
               <fnm>Patrick</fnm>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <email>patrick.vandijck@bio.kuleuven.be</email>
            </au>
            <au id="A3">
               <snm>Kuiper</snm>
               <fnm>Martin</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>martin.kuiper@psb.ugent.be</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Plant Systems Biology, VIB, Technologiepark 927, B-9052 Ghent, Belgium</p>
            </ins>
            <ins id="I2">
               <p>Department of Molecular Genetics, Ghent University, Technologiepark 927, B-9052 Ghent, Belgium</p>
            </ins>
            <ins id="I3">
               <p>Department of Molecular Microbiology, VIB, Kasteelpark Arenberg 31, B-3001 Leuven, Belgium</p>
            </ins>
            <ins id="I4">
               <p>Laboratory of Molecular Cell Biology, Katholieke Universiteit Leuven, Kasteelpark Arenberg 31, B-3001 Leuven, Belgium</p>
            </ins>
         </insg>
         <source>BMC Systems Biology</source>
         <issn>1752-0509</issn>
         <pubdate>2008</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>33</fpage>
         <url>http://www.biomedcentral.com/1752-0509/2/33</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18402676</pubid>
               <pubid idtype="doi">10.1186/1752-0509-2-33</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>18</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>10</day>
               <month>4</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>10</day>
               <month>4</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Maere et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Compendia of gene expression profiles under chemical and genetic perturbations constitute an invaluable resource from a systems biology perspective. However, the perturbational nature of such data imposes specific challenges on the computational methods used to analyze them. In particular, traditional clustering algorithms have difficulties in handling one of the prominent features of perturbational compendia, namely partial coexpression relationships between genes. Biclustering methods on the other hand are specifically designed to capture such partial coexpression patterns, but they show a variety of other drawbacks. For instance, some biclustering methods are less suited to identify overlapping biclusters, while others generate highly redundant biclusters. Also, none of the existing biclustering tools takes advantage of the staple of perturbational expression data analysis: the identification of differentially expressed genes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We introduce a novel method, called ENIGMA, that addresses some of these issues. ENIGMA leverages differential expression analysis results to extract expression modules from perturbational gene expression data. The core parameters of the ENIGMA clustering procedure are automatically optimized to reduce the redundancy between modules. In contrast to the biclusters produced by most other methods, ENIGMA modules may show internal substructure, i.e. subsets of genes with distinct but significantly related expression patterns. The grouping of these (often functionally) related patterns in one module greatly aids in the biological interpretation of the data. We show that ENIGMA outperforms other methods on artificial datasets, using a quality criterion that, unlike other criteria, can be used for algorithms that generate overlapping clusters and that can be modified to take redundancy between clusters into account. Finally, we apply ENIGMA to the Rosetta compendium of expression profiles for <it>Saccharomyces cerevisiae </it>and we analyze one pheromone response-related module in more detail, demonstrating the potential of ENIGMA to generate detailed predictions.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>It is increasingly recognized that perturbational expression compendia are essential to identify the gene networks underlying cellular function, and efforts to build these for different organisms are currently underway. We show that ENIGMA constitutes a valuable addition to the repertoire of methods to analyze such data.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Over the last decade, the availability of fully sequenced genomes and the development of high-throughput technologies such as DNA microarray-based transcript profiling have fuelled an exponential increase in the volume of functional genomics data. This has led to a renewed interest in the study of molecular biology at the system level <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>The central paradigm in systems theory is that one can learn about a system by perturbing it and measuring the response. This principle also applies to biological systems. Since mRNA levels can nowadays easily be measured on a genome-wide scale, expression profiling has become a first method of choice for assessing the molecular response to experimental perturbation (the molecular phenotype). Considerable efforts are put into creating compendia of expression profiles under genetic, chemical or environmental perturbations <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp> or in different tissues <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. Such data compendia basically constitute a series of snapshots of expression states under a variety of conditions, and they contain a wealth of information concerning the underlying transcriptional network structure of an organism. However, devising methods to efficiently and reliably extract that information is still a challenging task.</p>
         <p>Clustering of gene expression data allows the inference of functional correlations between genes through what was dubbed the 'guilt-by-association' principle <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. A classical clustering process consists of two steps <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. First, a matrix of distances between expression profiles is calculated using a distance or similarity measure, such as Pearson's centered correlation coefficient (PCC). Based on this distance matrix, the actual clustering algorithm, for instance average linkage hierarchical clustering, groups similar profiles together. Traditional clustering methods are well suited for analyzing time-series expression data, but they fall short when applied to perturbational data, because the underlying similarity measures, such as PCC, primarily capture global correlation tendencies. However, in compendia of perturbed expression profiles, genes do not necessarily show similar behavior under all experimental conditions: they may be coexpressed under some conditions and follow different expression regimes under other conditions. One of the consequences is that genes may be coexpressed with multiple expression modules depending on the conditions, or in other words, expression modules may overlap.</p>
         <p>These observations stimulated the development of alternative clustering strategies. The process of detecting subsets of genes that exhibit similar expression behavior across a subset of conditions is known as biclustering. Several biclustering strategies exist today, each using its own heuristic approach to tackle this complex problem (<abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and references therein). Some biclustering methods use a greedy iterative search strategy to uncover biclusters, progressively subdividing, or adding and removing rows and columns from the biclusters obtained in a previous iteration in order to maximize a local score function <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Others use linear algebra <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> or adopt a graph-theoretic approach to biclustering <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. Yet other methods identify biclusters by proposing a statistical model and estimating the distribution parameters that minimize a certain model fit criterion <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. A feature that most biclustering methods share is that they do not explicitly define similarity measures on the global space of expression profiles, but rely on the emergent properties of groups of genes and conditions in order to identify significant subpatterns in the data.</p>
         <p>Evidently, a wide variety of biclustering algorithms exist, each of them having their own strengths and weaknesses. For example, some of these methods are intrinsically less suited to find overlap between biclusters because they mask previously found biclusters with random noise <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B22">22</abbr></abbrgrp>, or because they partition the data <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B21">21</abbr><abbr bid="B24">24</abbr></abbrgrp>. Others require extensive parameter tweaking, require the user to specify the desired number of biclusters in advance or generate very small or large (amounts of) biclusters or highly redundant biclusters (see e.g. comparison in <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>). Some have no publicly available implementation or are rather cumbersome to use, and most of them, notable exceptions being SAMBA/EXPANDER <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B26">26</abbr></abbrgrp>, Genomica <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and cMonkey <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, do not integrate or overlay other types of biological data, hampering their use as biological discovery tools.</p>
         <p>Also, to our knowledge, none of the existing biclustering methods uses the variational information in replicated expression experiments. This information is routinely and successfully used to detect genes that are differentially expressed under a given perturbation <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. The main reason why biclustering methods do not use differential expression information is that they do not specifically focus on the analysis of perturbational data. Discretization-based biclustering methods such as SAMBA <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and BiMax <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> could probably easily be modified to assess up- and downregulation of gene expression based on <it>p</it>-values for differential expression. In their current implementation, however, these methods use rather arbitrary log-ratio or percentage cutoffs for this purpose.</p>
         <p>In this study, we present a novel method, called ENIGMA, that addresses some of these issues. Our goal was to build a method that: (i) leverages differential expression analysis results to extract co-differential expression networks and expression modules from perturbational gene expression data, (ii) is able to detect significant partial coexpression relationships between genes and overlap between modules, (iii) depends on parameters that can be automatically optimized or set on reasonably objective grounds. (iv) produces a realistic amount of modules, and (v) visually integrates the expression modules with other data types such as Gene Ontology (GO) information <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, transcription factor (TF) binding data, protein and genetic interactions, in order to facilitate the biological interpretation of the results. Below, we outline the ENIGMA algorithm, test our methodology on artificial expression data and compare its performance to other methods. We also apply ENIGMA to a perturbational microarray compendium for budding yeast <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> in order to assess its potential to generate testable hypotheses on real biological data.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Algorithm</p>
            </st>
            <p>A global overview of the methodology is given in Figure <figr fid="F1">1</figr>. Briefly, ENIGMA takes as input a set of perturbational expression data, externally calculated <it>p</it>-values for differential expression (e.g. using the limma package in Bioconductor <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>) and other data types if available. ENIGMA uses a novel combinatorial statistic to assess which pairs of genes are significantly co-differentially expressed (henceforth abbreviated as coexpressed for the purpose of readability). The resulting coexpression <it>p</it>-values are corrected for multiple testing and translated to edges in a coexpression network, which is clustered into expression modules (i.e. groups of significantly co-differentially expressed genes) using a graph-based clustering algorithm inspired on the MCODE algorithm <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The clustering procedure depends on two parameters that control the density of individual modules and the overlap between modules. The main reason why we chose a two-tier clustering approach (data &#8594; coexpression network &#8594; clustering) is that it allows simulated annealing-based optimization of the clustering parameters to obtain optimal coverage of the coexpression network, in terms of module overlap and redundancy. The graph clustering method we use is very fast, which allows the parameters to be optimized in a reasonable amount of time. In the postprocessing phase, ENIGMA determines relevant condition sets for each module, visualizes their substructure and overlap with other modules, screens the modules for enriched GO categories, suggests potential regulators for the modules based on regulator-module coexpression links and enrichment of TF binding sites, and overlays protein and genetic interaction data.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Global methodology overview</p>
               </caption>
               <text>
                  <p><b>Global methodology overview</b>. To the right is a figure of module 28, a module enriched in mating-related genes learned from the Rosetta dataset [4]. See Figure 4 for interpretation guidelines.</p>
               </text>
               <graphic file="1752-0509-2-33-1"/>
            </fig>
            <sec>
               <st>
                  <p>Combinatorial statistic</p>
               </st>
               <p>Consider the expression profiles of two genes <it>A </it>and <it>B </it>under <it>N </it>perturbations (see Figure <figr fid="F1">1</figr>). Each gene is represented by a profile of <it>N </it>fields. The gene expression values are discretized into three categories (upregulated, downregulated, unchanged) based on their differential expression <it>p</it>-value. If the gene is significantly upregulated in a given experiment (by default if <it>p </it>&lt; 0.01), the corresponding field is labeled blue. Experiments in which the gene is significantly downregulated are similarly labeled yellow, and the remaining fields are labeled black. Let us now assume that the profiles of <it>A </it>and <it>B </it>contain <it>a</it><sub><it>x </it></sub>and <it>b</it><sub><it>x </it></sub>blue fields respectively, as well as <it>a</it><sub><it>y </it></sub>and <it>b</it><sub><it>y </it></sub>yellow fields, and that they have <it>x </it>blue and <it>y </it>yellow fields in common. We want to assess whether this overlap is statistically significant. If the response of the genes <it>A </it>and <it>B </it>to the perturbations were uncorrelated (null hypothesis), the blue and yellow fields would be independently distributed on both profiles. Under this hypothesis, the probability that the profiles overlap on exactly <it>x </it>blue and <it>y </it>yellow positions is given by the following recursive formula:</p>
               <p>
                  <display-formula id="M1">
                     <m:math name="1752-0509-2-33-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo>,</m:mo>
                              <m:mi>y</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:msub>
                                                         <m:mi>a</m:mi>
                                                         <m:mi>x</m:mi>
                                                      </m:msub>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mi>x</m:mi>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:msub>
                                                         <m:mi>a</m:mi>
                                                         <m:mi>y</m:mi>
                                                      </m:msub>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mi>y</m:mi>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:mi>N</m:mi>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mi>x</m:mi>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mi>y</m:mi>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:msub>
                                                         <m:mi>b</m:mi>
                                                         <m:mi>x</m:mi>
                                                      </m:msub>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mi>x</m:mi>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:mi>N</m:mi>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:msub>
                                                         <m:mi>b</m:mi>
                                                         <m:mi>x</m:mi>
                                                      </m:msub>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mi>y</m:mi>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:msub>
                                                         <m:mi>b</m:mi>
                                                         <m:mi>y</m:mi>
                                                      </m:msub>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mi>y</m:mi>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mi>N</m:mi>
                                                </m:mtd>
                                             </m:mtr>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:msub>
                                                         <m:mi>b</m:mi>
                                                         <m:mi>x</m:mi>
                                                      </m:msub>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:mi>N</m:mi>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:msub>
                                                         <m:mi>b</m:mi>
                                                         <m:mi>x</m:mi>
                                                      </m:msub>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:msub>
                                                         <m:mi>b</m:mi>
                                                         <m:mi>y</m:mi>
                                                      </m:msub>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                              </m:mfrac>
                              <m:mo>&#8722;</m:mo>
                              <m:munder>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:munderover>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:msup>
                                                <m:mi>x</m:mi>
                                                <m:mo>&#8242;</m:mo>
                                             </m:msup>
                                             <m:mo>=</m:mo>
                                             <m:mi>x</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>min</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>a</m:mi>
                                                <m:mi>x</m:mi>
                                             </m:msub>
                                             <m:mo>,</m:mo>
                                             <m:msub>
                                                <m:mi>b</m:mi>
                                                <m:mi>x</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:munderover>
                                       <m:mrow>
                                          <m:mstyle displaystyle="true">
                                             <m:munderover>
                                                <m:mo>&#8721;</m:mo>
                                                <m:mrow>
                                                   <m:msup>
                                                      <m:mi>y</m:mi>
                                                      <m:mo>&#8242;</m:mo>
                                                   </m:msup>
                                                   <m:mo>=</m:mo>
                                                   <m:mi>y</m:mi>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>min</m:mi>
                                                   <m:mo>&#8289;</m:mo>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>a</m:mi>
                                                      <m:mi>y</m:mi>
                                                   </m:msub>
                                                   <m:mo>,</m:mo>
                                                   <m:msub>
                                                      <m:mi>b</m:mi>
                                                      <m:mi>y</m:mi>
                                                   </m:msub>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:munderover>
                                             <m:mrow>
                                                <m:mrow>
                                                   <m:mo>(</m:mo>
                                                   <m:mrow>
                                                      <m:mtable>
                                                         <m:mtr>
                                                            <m:mtd>
                                                               <m:msup>
                                                                  <m:mi>x</m:mi>
                                                                  <m:mo>&#8242;</m:mo>
                                                               </m:msup>
                                                            </m:mtd>
                                                         </m:mtr>
                                                         <m:mtr>
                                                            <m:mtd>
                                                               <m:mi>x</m:mi>
                                                            </m:mtd>
                                                         </m:mtr>
                                                      </m:mtable>
                                                   </m:mrow>
                                                   <m:mo>)</m:mo>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mo>(</m:mo>
                                                   <m:mrow>
                                                      <m:mtable>
                                                         <m:mtr>
                                                            <m:mtd>
                                                               <m:msup>
                                                                  <m:mi>y</m:mi>
                                                                  <m:mo>&#8242;</m:mo>
                                                               </m:msup>
                                                            </m:mtd>
                                                         </m:mtr>
                                                         <m:mtr>
                                                            <m:mtd>
                                                               <m:mi>y</m:mi>
                                                            </m:mtd>
                                                         </m:mtr>
                                                      </m:mtable>
                                                   </m:mrow>
                                                   <m:mo>)</m:mo>
                                                </m:mrow>
                                             </m:mrow>
                                          </m:mstyle>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msup>
                                       <m:mi>x</m:mi>
                                       <m:mo>&#8242;</m:mo>
                                    </m:msup>
                                    <m:mo>,</m:mo>
                                    <m:msup>
                                       <m:mi>y</m:mi>
                                       <m:mo>&#8242;</m:mo>
                                    </m:msup>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mo>&#8800;</m:mo>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:mi>x</m:mi>
                                    <m:mo>,</m:mo>
                                    <m:mi>y</m:mi>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:munder>
                              <m:mi>P</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:msup>
                                 <m:mi>x</m:mi>
                                 <m:mo>&#8242;</m:mo>
                              </m:msup>
                              <m:mo>,</m:mo>
                              <m:msup>
                                 <m:mi>y</m:mi>
                                 <m:mo>&#8242;</m:mo>
                              </m:msup>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaaLaeiikaGIaemiEaGNaeiilaWIaemyEaKNaeiykaKIaeyypa0tcfa4aaSaaaeaadaqadaqaauaabeqaceaaaeaacqWGHbqydaWgaaqaaiabdIha4bqabaaabaGaemiEaGhaaaGaayjkaiaawMcaamaabmaabaqbaeqabiqaaaqaaiabdggaHnaaBaaabaGaemyEaKhabeaaaeaacqWG5bqEaaaacaGLOaGaayzkaaWaaeWaaeaafaqabeGabaaabaGaemOta4KaeyOeI0IaemiEaGNaeyOeI0IaemyEaKhabaGaemOyai2aaSbaaeaacqWG4baEaeqaaiabgkHiTiabdIha4baaaiaawIcacaGLPaaadaqadaqaauaabeqaceaaaeaacqWGobGtcqGHsislcqWGIbGydaWgaaqaaiabdIha4bqabaGaeyOeI0IaemyEaKhabaGaemOyai2aaSbaaeaacqWG5bqEaeqaaiabgkHiTiabdMha5baaaiaawIcacaGLPaaaaeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqWGIbGydaWgaaqaaiabdIha4bqabaaaaaGaayjkaiaawMcaamaabmaabaqbaeqabiqaaaqaaiabd6eaojabgkHiTiabdkgaInaaBaaabaGaemiEaGhabeaaaeaacqWGIbGydaWgaaqaaiabdMha5bqabaaaaaGaayjkaiaawMcaaaaakiabgkHiTmaaxababaWaaabCaeaadaaeWbqaamaabmaabaqbaeqabiqaaaqaaiqbdIha4zaafaaabaGaemiEaGhaaaGaayjkaiaawMcaamaabmaabaqbaeqabiqaaaqaaiqbdMha5zaafaaabaGaemyEaKhaaaGaayjkaiaawMcaaaWcbaGafmyEaKNbauaacqGH9aqpcqWG5bqEaeaacyGGTbqBcqGGPbqAcqGGUbGBcqGGOaakcqWGHbqydaWgaaadbaGaemyEaKhabeaaliabcYcaSiabdkgaInaaBaaameaacqWG5bqEaeqaaSGaeiykaKcaniabggHiLdaaleaacuWG4baEgaqbaiabg2da9iabdIha4bqaaiGbc2gaTjabcMgaPjabc6gaUjabcIcaOiabdggaHnaaBaaameaacqWG4baEaeqaaSGaeiilaWIaemOyai2aaSbaaWqaaiabdIha4bqabaWccqGGPaqka0GaeyyeIuoaaSqaaiadaciy9daacIcaOiqdaciy9daadIha4zacaciy9daafaGamaiGG1paaiilaWIanaiGG1paamyEaKNbiaiGG1paauaacWaGacw=aaGGPaqkcWaGacw=aaGHGjsUcWaGacw=aaGGOaakcWaGacw=aaWG4baEcWaGacw=aaGGSaalcWaGacw=aaWG5bqEcWaGacw=aaGGPaqkaeqaaOGaemiuaaLaeiikaGIafmiEaGNbauaacqGGSaalcuWG5bqEgaqbaiabcMcaPaaa@CDC3@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>The probability of observing an overlap of at least <it>x </it>blue and <it>y </it>yellow fields by chance is then expressed by the cumulative distribution:</p>
               <p>
                  <display-formula id="M2">
                     <m:math name="1752-0509-2-33-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>P</m:mi>
                                 <m:mi>c</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo>,</m:mo>
                              <m:mi>y</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:msup>
                                          <m:mi>x</m:mi>
                                          <m:mo>&#8242;</m:mo>
                                       </m:msup>
                                       <m:mo>=</m:mo>
                                       <m:mi>x</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>min</m:mi>
                                       <m:mo>&#8289;</m:mo>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>a</m:mi>
                                          <m:mi>x</m:mi>
                                       </m:msub>
                                       <m:mo>,</m:mo>
                                       <m:msub>
                                          <m:mi>b</m:mi>
                                          <m:mi>x</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:munderover>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:msup>
                                                <m:mi>y</m:mi>
                                                <m:mo>&#8242;</m:mo>
                                             </m:msup>
                                             <m:mo>=</m:mo>
                                             <m:mi>y</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>min</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>a</m:mi>
                                                <m:mi>y</m:mi>
                                             </m:msub>
                                             <m:mo>,</m:mo>
                                             <m:msub>
                                                <m:mi>b</m:mi>
                                                <m:mi>y</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:munderover>
                                       <m:mrow>
                                          <m:mi>P</m:mi>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msup>
                                             <m:mi>x</m:mi>
                                             <m:mo>&#8242;</m:mo>
                                          </m:msup>
                                          <m:mo>,</m:mo>
                                          <m:msup>
                                             <m:mi>y</m:mi>
                                             <m:mo>&#8242;</m:mo>
                                          </m:msup>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiuaa1aaSbaaSqaaiabdogaJbqabaGccqGGOaakcqWG4baEcqGGSaalcqWG5bqEcqGGPaqkcqGH9aqpdaaeWbqaamaaqahabaGaemiuaaLaeiikaGIafmiEaGNbauaacqGGSaalcuWG5bqEgaqbaiabcMcaPaWcbaGafmyEaKNbauaacqGH9aqpcqWG5bqEaeaacyGGTbqBcqGGPbqAcqGGUbGBcqGGOaakcqWGHbqydaWgaaadbaGaemyEaKhabeaaliabcYcaSiabdkgaInaaBaaameaacqWG5bqEaeqaaSGaeiykaKcaniabggHiLdaaleaacuWG4baEgaqbaiabg2da9iabdIha4bqaaiGbc2gaTjabcMgaPjabc6gaUjabcIcaOiabdggaHnaaBaaameaacqWG4baEaeqaaSGaeiilaWIaemOyai2aaSbaaWqaaiabdIha4bqabaWccqGGPaqka0GaeyyeIuoaaaa@6214@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>Equation 1 can be understood by assuming that profile <it>A </it>is given, and that we randomly distribute <it>b</it><sub><it>x </it></sub>blue and <it>b</it><sub><it>y </it></sub>yellow positions on profile <it>B</it>. The denominator of the first term then represents the total number of possible profiles <it>B</it>. The numerator represents the combinations in which <it>x </it>blue and <it>y </it>yellow matching positions are selected, and the residual positions are chosen at random. However, in this manner, a number of combinations are selected while having more than exactly <it>x </it>blue and/or <it>y </it>yellow matching positions.</p>
               <p>Moreover, combinations with <it>x' </it>> <it>x </it>blue and/or <it>y' </it>> <it>y </it>yellow matching positions are counted C(<it>x'</it>, <it>x</it>)&#183;C(<it>y'</it>, <it>y</it>) times, hence the second term (see Additional file <supplr sid="S1">1</supplr>).</p>
               <suppl id="S1">
                  <title>
                     <p>Additional file 1</p>
                  </title>
                  <text>
                     <p>The supplementary pdf file accompanying this article contains the Supplementary Methods, Tables S1&#8211;S9 and Figures S1&#8211;S6. Additional supplementary material, including test datasets and module figures, can be downloaded from <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>.</p>
                  </text>
                  <file name="1752-0509-2-33-S1.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <p>Although the probabilistic question formulated above can be cast in terms of contingency tables, the hypothesis tested by our statistic is different from that tested by standard contingency table analysis methods such as the <it>&#967;</it><sup>2 </sup>test. For example, situations in which a large amount of blue (upregulated) fields in profile <it>A </it>are perfectly mapped onto the black fields (up nor down) in profile <it>B </it>would yield a significant <it>&#967;</it><sup>2 </sup><it>p</it>-value, whereas they would not yield a significant <it>p</it>-value using equation 2. Our statistic only considers mappings of up- and down-regulation of the expression of a gene to up- or down-regulation of another gene to be meaningful for assessing coregulation, a premise which is motivated by the perturbational nature of the data we aim to analyze. Black fields are considered less informative from the perspective of coregulation.</p>
            </sec>
            <sec>
               <st>
                  <p>Multiple testing correction of coexpression p-values</p>
               </st>
               <p>In our probabilistic setup, each comparison of two profiles can be considered an individual test. For <it>N </it>genes, <it>N</it>(<it>N </it>- 1)/2 tests are performed to fish for co-differential expression relationships. Consequently, the obtained <it>p</it>-values have to be adjusted in order to control the type I error rate. The raw <it>p</it>-values are corrected for multiple testing with the Benjamini &amp; Hochberg procedure, which controls the False Discovery Rate (FDR) <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>Graph-based clustering</p>
               </st>
               <p>The set of significant coexpression relationships at a certain FDR threshold (by default FDR = 0.05) is translated to a network, with nodes and edges representing genes and significant coexpression relationships, respectively. ENIGMA identifies coexpression modules from this network using a graph clustering technique inspired by the MCODE algorithm <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. To identify potential module seeds, all nodes <it>v </it>are weighted based on the density of the highest <it>k</it>-core of the node neighborhood <it>N</it><sub><it>v</it></sub>, denoted as the <it>k</it><sub><it>max</it></sub>-core of <it>v </it>(a <it>k</it>-core of a graph is a maximal subgraph in which each node has at least degree <it>k</it>). Analogous to Bader and Hogue <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, the core-clustering coefficient <it>C</it><sub><it>core</it>,<it>v </it></sub>is defined as the density of the <it>k</it><sub><it>max</it></sub>-core of <it>v</it>, and the weight <it>w</it><sub><it>v </it></sub>= <it>C</it><sub><it>core</it>,<it>v</it></sub>&#183;<it>k</it><sub><it>max</it>,<it>v</it></sub>.</p>
               <p>The <it>k</it><sub><it>max</it></sub>-core of the node with the highest weight is taken as the first module seed. This module seed then grows by accreting nodes on which it exerts a pull above a certain threshold <it>&#957;</it><sub>2</sub>. The pull of a module with seed <it>S </it>on a node <it>v </it>outside the module is defined as |<it>N</it><sub><it>v </it></sub>&#8745; <it>S</it>|/|<it>S</it>|. The next module is then initiated by taking the <it>k</it><sub><it>max</it></sub>-core of the node with the highest weight in the remaining graph. An additional constraint is set by requiring that the overlap between the new seed <it>S </it>and any existing module <it>M </it>does not exceed <it>&#957;</it><sub>1</sub>&#183;min(|<it>S</it>|,|<it>M</it>|). While the threshold <it>&#957;</it><sub>2 </sub>controls the size and density of individual modules, <it>&#957;</it><sub>1 </sub>controls the spacing or overlap between modules. Both parameters are optimized automatically.</p>
            </sec>
            <sec>
               <st>
                  <p>Clustering parameter optimization</p>
               </st>
               <p>In order to optimize the clustering parameters, the quality of the clustering for a given (<it>&#957;</it><sub>1</sub>, <it>&#957;</it><sub>2</sub>) is assessed by comparing the known input coexpression network (i.e. the network obtained in the first phase of the ENIGMA algorithm) with the output coexpression network inferred by the modules. The latter is constructed by translating the modules to fully connected components in the output network (see Additional file <supplr sid="S1">1</supplr> Figure S1 A). If we consider true/false positives (<it>tp </it>resp. <it>fp</it>) to be coexpression edges inferred by the clustering that are present/absent in the input coexpression network, and false negatives (<it>fn</it>) as edges present in the input network that are not inferred by the clustering, we can define the precision <it>P' </it>= <it>tp</it>/(<it>tp </it>+ <it>fp</it>) and the recall <it>R' </it>= <it>tp</it>/(<it>tp </it>+ <it>fn</it>) of the clustering result. ENIGMA uses the <it>F'</it>-measure, i.e. the harmonic mean of recall (<it>R'</it>) and precision (<it>P'</it>), <it>F' </it>= 2<it>P'R'</it>/(<it>P' </it>+ <it>R'</it>), as a measure for the quality of the clustering. We use the notation <it>P'</it>, <it>R'</it>, <it>F' </it>instead of the more commonly used <it>P</it>, <it>R</it>, <it>F </it>in order to distinguish between two different flavours of the <it>F</it>-measure used in this study for different purposes. In contrast to the regular <it>F</it>-measure (Additional file <supplr sid="S1">1</supplr> Figure S1 C), the <it>F'</it>-measure penalizes overpredicted edges in order to avoid unnecessary overlap between the expression modules: an edge (<it>A</it>, <it>B</it>) that is inferred multiple times from the clustering, because the genes <it>A </it>and <it>B </it>belongs to the intersection of multiple (say <it>x</it>) modules, is counted as 1 <it>tp </it>and <it>x </it>- 1 <it>fp</it>. This is equivalent to drawing <it>x </it>edges between the genes <it>A </it>and <it>B </it>in the output coexpression network. Since there is only one edge in the input network, the <it>x </it>- 1 remaining edges can be considered false. This penalization strategy has the intuitively pleasing property of not affecting the recall, but lowering the precision of the clustering result when the amount of edges 'explained' by multiple modules increases.</p>
               <p>The parameters <it>&#957;</it><sub>1 </sub>and <it>&#957;</it><sub>2 </sub>are now optimized by Monte-Carlo Simulated Annealing (MCSA) <abbrgrp><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp> using <it>F' </it>as the optimization criterion. Starting from an random initial guess for the parameters (<it>&#957;</it><sub>1</sub>, <it>&#957;</it><sub>2</sub>), random steps are taken in parameter space. A step is accepted if</p>
               <p>
                  <display-formula id="M3">rand(1) &lt;<it>e</it><sup>&#916;<it>F'</it>/<it>T</it></sup></display-formula>
               </p>
               <p>with rand(1) a random number drawn uniformly from the interval [0,1], &#916;<it>F' </it>the change in <it>F'</it>-measure and <it>T </it>the simulated annealing parameter or 'temperature', which gradually decreases during the course of the optimization according to an exponential scheme <it>T</it><sub><it>i </it></sub>= <it>r</it><sub><it>c</it></sub><it>T</it><sub><it>i</it>-1</sub>, with <it>r</it><sub><it>c </it></sub>the cooling rate. ENIGMA uses a two-stage MCSA procedure. In the first stage, a rough MCSA search of the clustering parameter space is performed in order to identify the most interesting parameter region (default MCSA settings: <it>T</it><sub>begin </sub>= 0.1, <it>T</it><sub>end </sub>= 0.001, <it>r</it><sub><it>c </it></sub>= 0.99, parameter step size = 0.05). In the second stage, a finer MCSA search is performed starting from the optimum obtained in the first stage (default MCSA settings: <it>T</it><sub>begin </sub>= 0.01, <it>T</it><sub>end </sub>= 0.0001, <it>r</it><sub><it>c </it></sub>= 0.995, parameter step size = 0.01). At the end of each stage, an additional gradient descent is performed toward the nearest local optimum of <it>F'</it>. By default, ENIGMA performs 3 MCSA runs, starting from randomly chosen (<it>&#957;</it><sub>1</sub>, <it>&#957;</it><sub>2</sub>). The convergence of the solutions of multiple runs can be used as a check on the adequacy of the MCSA parameter settings.</p>
            </sec>
            <sec>
               <st>
                  <p>Postprocessing of modules</p>
               </st>
               <p>For each gene module, ENIGMA determines a condition set by selecting those conditions that show enrichment of up- or downregulated genes in the module (hypergeometric test, default FDR = 0.05). Thus, for a given module, the condition set contains the experimental conditions that elicit a significant and specific response in the module (as compared to the overall response) and, by consequence, have been most influential in shaping the module. The resulting 'bicluster' does not necessarily have a uniform expression pattern over all genes, but may show subpatterns for some genes under certain conditions, possibly indicating involvement in other expression modules. These subpatterns are visualized by hierarchically clustering the module's expression data in both dimensions, using the cosine correlation coefficient (cos<it>&#952;</it>) as a similarity measure. The clustering tree can optionally be separated into leafs to make the subdivision more clear (default threshold cos<it>&#952; </it>= 0.65). Although conditions that show differential patterning within one module might appear to be irrelevant for the module as a whole, they are important for at least part of the module and may provide insight into inter-module connections or further substructure within the module.</p>
               <p>In an attempt to provide the user with clues on how the expression modules are regulated, ENIGMA searches for 'regulators' that are significantly more connected to a module, through positive or negative coexpression edges, than expected at random (hypergeometric test, default FDR = 0.05). Potential regulators are selected from a user-defined list or a user-defined set of GO classes. When chromatin immunoprecipitation (ChIP) or TF motif data are available, ENIGMA also screens the modules for enriched TF binding sites (hypergeometric test, default FDR = 0.05). The expression profiles of significantly coexpressed or binding regulators are visualized on top of the modules. Significantly enriched GO terms for both the gene and condition sets of the modules are determined using the BiNGO <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> software, which is incorporated in ENIGMA (hypergeometric test, default FDR = 0.05). Finally, ENIGMA visually maps the available protein interaction data and genetic interaction data on the modules.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Testing on artificial data</p>
            </st>
            <sec>
               <st>
                  <p>Generating artificial expression data</p>
               </st>
               <p>To assess the performance of our method and compare ENIGMA to other methods, we performed tests on artificial gene expression data. We generated two types of artificial expression data, namely expression data containing overlapping biclusters (modular data) and expression data containing partially coexpressed genes but no biclusters (non-modular data). In both cases, we built 10 expression datasets of 1000 genes by 100 experiments (in log<sub>2 </sub>ratio format). For each dataset, artificial background expression data were randomly sampled from a normal distribution with mean <it>&#956; </it>= 0 and variance <it>&#963;</it><sup>2 </sup>= 0.16. For the modular datasets, we implanted 20 biclusters in this background, each encompassing between 1&#8211;5% of all genes and 10&#8211;50% of all conditions. Bicluster sizes, member genes and conditions are chosen at random, with the restriction that at most 30% of the genes and 50% of the conditions overlap between any 2 biclusters (percentages relative to the smallest of the 2 biclusters). Except for a noise component (see further), all genes in a bicluster share the same expression profile over the bicluster conditions. However, a bicluster can be partially overwritten by other biclusters. The bicluster profiles are sampled from a bimodal distribution consisting of 2 normal modes with means <it>&#956;</it><sub>1 </sub>= -1 (for down-regulated expression) and <it>&#956;</it><sub>2 </sub>= 1 (for up-regulated expression) and variances <inline-formula><m:math name="1752-0509-2-33-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#963;</m:mi><m:mn>1</m:mn><m:mn>2</m:mn></m:msubsup><m:mo>=</m:mo><m:msubsup><m:mi>&#963;</m:mi><m:mn>2</m:mn><m:mn>2</m:mn></m:msubsup><m:mo>=</m:mo><m:mn>0.49</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabigdaXaqaaiabikdaYaaakiabg2da9iabeo8aZnaaDaaaleaacqaIYaGmaeaacqaIYaGmaaGccqGH9aqpcqaIWaamcqGGUaGlcqaI0aancqaI5aqoaaa@3963@</m:annotation></m:semantics></m:math></inline-formula>. The expression profiles of individual genes in a bicluster are noisified by adding normally distributed noise (<it>&#956;</it><sub><it>n </it></sub>= 0 and <it>&#963;</it><sub><it>n </it></sub>= 0.2|<it>x</it>| with |<it>x</it>| the amplitude of the log ratio expression of the gene in the given condition). The variances, bicluster size and overlap parameters are chosen so that the overall distribution of the simulated log ratio expression values mimicks the distribution of log ratio expression values in the Rosetta compendium <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> up to a scale factor (see Additional file <supplr sid="S1">1</supplr> Figure S2). Note that, apart from the distribution of expression ratios, the structure of these toy datasets does not necessarily bear any resemblance to real biological data.</p>
               <p>For the non-modular datasets, we implanted 500 pairs of partially coexpressed genes (co-differentially expressed under 10&#8211;50% of all conditions) in the background. The expression profiles are constructed as described above. The resulting expression value distribution again mimicks the Rosetta distribution (see Additional file <supplr sid="S1">1</supplr> Figure S2).</p>
               <p>Unlike for real data (see below), we used log<sub>2 </sub>ratio thresholds to discretize the expression values of the artificial datasets, because the generation of meaningful artificial differential expression <it>p</it>-values proved to merit further study in its own right. Therefore, the artificial data cannot be used to assess the advantage of including variational information in ENIGMA's discretization step (instead, we performed a qualitative comparison of <it>p</it>-value and log-ratio based discretization on real data, see below). On the other hand, we can still compare the performance of ENIGMA with other methods that do not use variational information. We used a log<sub>2 </sub>ratio threshold of 1 for upregulation and -1 for downregulation, corresponding to the means of the distributions used to generate the bicluster profiles. In other words, half of the datapoints in the biclusters are presumed not to be significantly over- or underexpressed.</p>
            </sec>
            <sec>
               <st>
                  <p>Performance of ENIGMA on artificial data and comparison with other methods</p>
               </st>
               <p>The performance of ENIGMA on these toy datasets was compared with that of two commonly used similarity measures, namely PCC and the <it>&#967;</it><sup>2</sup>-statistic, and two established biclustering methods, SAMBA <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and ISA <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B35">35</abbr></abbrgrp>. PCC was chosen as a representative of the global similarity measures used in traditional clustering algorithms, while we included the <it>&#967;</it><sup>2</sup>-statistic because of its relation to the combinatorial statistic used by ENIGMA (see Algorithm section). The selection of biclustering methods was based on the following criteria: (i) the methods should be non-partitioning in nature, (ii) they should have the capacity to generate overlapping biclusters, (iii) a suitable implementation should be publicly available, and (iv) they should produce a reasonable amount of biclusters (in the order of 10&#8211;100) on the modular toy datasets. We used the version of SAMBA <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> incorporated in the EXPANDER 3.0 package <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, and the implementation of ISA <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> available as part of the biclustering tool BicAT <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>, both with default parameter settings. The ISA trajectories from randomly chosen starting points (default 100) converge to a limited number of 'fixed point' biclusters. To prune nearly identical modules, we merged ISA biclusters that overlap for more than 80%.</p>
               <p>The clustering performance of all methods is only assessed in the gene dimension. Standard internal criteria for partitional clustering performance, such as the silhouette width or Dunn's index <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr></abbrgrp>, cannot be used to assess the performance of algorithms that generate overlapping clusters. Instead, we use the <it>F</it>-measure and introduce a derivative, the <it>F'</it>-measure (also used in the ENIGMA clustering optimization procedure described above), to compare the performance of different clustering methods on artificial datasets. In both cases, the coexpression network generated by a method (either directly or by translating the clusters to network components) is compared to the artificial input coexpression network in terms of true and false positive edges and false negative edges, from which the different flavors of the <it>F</it>-measure are calculated (see Additional file <supplr sid="S1">1</supplr> Figure S1). The difference between the <it>F</it>-measure and the <it>F'</it>-measure is that the <it>F</it>-measure does not take into account the multiplicity of the inferred edges. In other words, the <it>F'</it>-measure penalizes overpredicted (redundant) edges, whereas the <it>F</it>-measure does not. This entails that the <it>F'</it>-measure is more useful to compare methods that generate overlapping clusters, whereas the <it>F</it>-measure can be used more generally to compare methods that generate both overlapping or non-overlapping clusters or pair-wise coexpression networks.</p>
               <p>The performance of ENIGMA is tested on two levels by assessing the overlap between the artificial input correlation network and (i) the network of significant correlations obtained in the first step of the ENIGMA algorithm (before clustering, referred to as ENIGMA-N); (ii) the modules inferred by ENIGMA (ENIGMA-M). The output networks for ENIGMA-M and the biclustering methods SAMBA and ISA are obtained by converting the obtained modules/biclusters to fully connected network components. The <it>&#967;</it><sup>2 </sup>network is constructed by translating significant <it>&#967;</it><sup>2 </sup>correlation <it>p</it>-values between the discretized expression profiles to edges in the output network. We used the same discretization threshold (|log<sub>2 </sub>ratio| = 1) and FDR level (0.05) for the <it>&#967;</it><sup>2 </sup>and ENIGMA methods. The performance of PCC was measured for different thresholds (for each threshold <it>t</it>, gene pairs with PCC > <it>t </it>define an edge in the network).</p>
               <p>Using the <it>F</it>-measure, ENIGMA outperforms all other methods on the modular artificial data (see Figure <figr fid="F2">2A</figr> and Additional file <supplr sid="S1">1</supplr> Tables S1 and S2). The performance of ENIGMA-M was consistently higher than the <it>&#967;</it><sup>2 </sup>performance (&#916;<it>F </it>= 0.11 on average) and the optimal PCC performance (at a PCC threshold of 0.20&#8211;0.30 depending on the dataset; &#916;<it>F </it>= 0.07 on average). The global similarity measure PCC appears to perform surprisingly well. However, the performance of PCC critically depends on the choice of the PCC threshold, and determining the optimal PCC threshold on real data is problematic. In contrast, ENIGMA has the advantage of having an easily tunable significance threshold: the False Discovery Rate (FDR) level. To illustrate this, we plotted the performance curve of ENIGMA for different non-corrected <it>p</it>-value thresholds (ENIGMA-T curve), on Figure <figr fid="F2">2A</figr> and <figr fid="F2">2B</figr>. For all artificial datasets, the performance of ENIGMA-N at FDR = 0.05 (medium gray dot) is close to the optimum of this curve, indicating that FDR control at a reasonable level gives near-optimal performance.</p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Performance on artificial data</p>
                  </caption>
                  <text>
                     <p><b>Performance on artificial data</b>. Performance of ENIGMA versus other coexpression measures and biclustering methods on (A) modular and (B) non-modular toy datasets. The ENIGMA-T curve shows the performance for the ENIGMA coexpression network at several non-corrected <it>p</it>-value thresholds, ENIGMA-N stands for the ENIGMA coexpression network at FDR = 0.05, and ENIGMA-M for the final clustering result.</p>
                  </text>
                  <graphic file="1752-0509-2-33-2"/>
               </fig>
               <p>Among the biclustering methods, the rather poor performance of the ISA algorithm (&#916;<it>F </it>with ENIGMA-M = 0.34 on average) may seem somewhat surprising. Preli&#263; et al <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, using the same implementation of ISA but other methods to generate artificial data and to assess biclustering performance, previously established that the performance of ISA decreases with increasing overlap between biclusters. Our results seem to confirm that ISA is not the optimal method in case there is substantial overlap between modules. The performance of ISA did not change significantly when using 500 starting points instead of the default 100 (results not shown).</p>
               <p>The performance gain of ENIGMA-M over SAMBA is substantially smaller (&#916;<it>F </it>= 0.03 on average), and on two out of 10 datasets, the performance of SAMBA was slightly higher than that of ENIGMA-M (see Additional file <supplr sid="S1">1</supplr> Tables S1 and S2). A more tangible advantage of ENIGMA over SAMBA (and ISA) is that ENIGMA nearly always recovered the correct number of modules (20 &#177; 1), whereas SAMBA consistently predicted more modules than there were in the input data (53 &#177; 6 modules). ISA predicted only one extra module on average, but with a higher variance than ENIGMA (21 &#177; 4). In other words, SAMBA and to a lesser extent ISA produce more fragmented and/or more redundant modules. Redundancy makes the module output much harder to interpret, but it is not taken into account by the standard <it>F</it>-measure.</p>
               <p>To quantify the effect of redundancy on the clustering quality, we compared SAMBA, ISA and ENIGMA-M using the <it>F'</it>-measure. As in the calculation of the <it>F'</it>-measure used in the clustering optimization procedure (see above), edges that are inferred by multiple modules are counted multiple times, but in the present case, multiply defined edges may also occur in the input network if they overlap between multiple artificial input modules (see Additional file <supplr sid="S1">1</supplr> Figure S1 B). Specifically, edges that are inferred by <it>x </it>output modules and <it>y </it>input modules are now counted as <it>y tp </it>and <it>x </it>- <it>y fp </it>in case <it>x </it>&#8805; <it>y</it>, or <it>x tp </it>and <it>y </it>- <it>x fn </it>in case <it>x </it>&lt;<it>y</it>. Using the <it>F' </it>criterion, the performance of ENIGMA-M (<it>F' </it>= 0.85 &#177; 0.03) is substantially higher than that of SAMBA (<it>F' </it>= 0.74 &#177; 0.03) and ISA (<it>F' </it>= 0.51 &#177; 0.09, see Additional file <supplr sid="S1">1</supplr> Tables S3&#8211;S5).</p>
               <p>On non-modular artificial data, the performance of ENIGMA-M and the biclustering methods SAMBA and ISA is very low (see Figure <figr fid="F2">2B</figr> and Additional file <supplr sid="S1">1</supplr> Tables S6 and S7). This is not surprising since there are no modules to be found in these datasets. In this respect, a particularly attractive feature of ENIGMA is that it finds very few modules in the non-modular data (3 &#177; 1 modules containing on average 5 genes each, precision of clustering result = 0.27), in contrast to ISA and SAMBA, which recover 78 &#177; 5 modules (containing on average 27 genes) and 127 &#177; 2 modules (containing on average 16 genes), respectively. Among the pair-wise methods, ENIGMA-N invariably featured the highest performance, indicating that our combinatorial statistic detects partial coexpression relationships more efficiently than PCC and <it>&#967;</it><sup>2</sup>. The fact that ENIGMA efficiently uncovers coexpression relationships in non-modular data opens perspectives for the exploration of the less modular parts of expression datasets. Real datasets typically contain a limited number of perturbation experiments that target a few specific processes. These processes can be expected to be rather well resolved in terms of their coexpression relationships, whereas other processes will probably give rise to more fragmented (less modular) regions in the network. Moreover, despite the success of the modularity concept in the analysis of expression data and systems biology in general, it is not inconceivable that transcriptional networks might also contain genuinely non-modular regions.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Testing on real data: the Rosetta gene expression compendium</p>
            </st>
            <sec>
               <st>
                  <p><it>p</it>-value versus log-ratio based discretization</p>
               </st>
               <p>Although useful for testing and comparing methods, artificial datasets do not capture the complexity of real biological systems. Consequently, good performance on artificial data does not guarantee good performance on real biological data. In order to assess the use of ENIGMA for analyzing real data, we applied our methodology to the Rosetta compendium of expression profiles, representing data on 300 different experimental perturbations of <it>S. cerevisiae </it><abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Experiments on 20 strains that were marked as aneuploid in the original dataset were left out, because they can give rise to artificial expression correlations between genes on the aneuploid chromosomes. The log-ratio expression data and differential expression <it>p</it>-values were downloaded in prenormalized and preprocessed form. Genome-wide ChIP data for 102 TFs were obtained from Harbison et al <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. All genes that are bound with <it>p </it>&lt; 0.005 by a certain TF were considered reliable targets. Protein and genetic interactions for <it>S. cerevisiae </it>were obtained from the BioGRID database <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>.</p>
               <p>Using a differential expression <it>p</it>-value threshold of 0.01 in the discretization step and an FDR threshold of 0.05 for defining coexpression edges, ENIGMA identified a network of 100,762 significant positive coexpression links and 30,390 negative coexpression links involving 2,871 genes. The clustering parameters (<it>&#957;</it><sub>1</sub>, <it>&#957;</it><sub>2</sub>) = (0.30, 0.55) were optimized by MCSA as described in the Algorithm section. To assess the efficiency of the MCSA procedure, we performed an exhaustive screen of the parameter space to locate the global maximum of the <it>F'</it>-measure criterion (see Additional file <supplr sid="S1">1</supplr> Figure S3). The MCSA procedure found back the global optimum with 100% efficiency.</p>
               <p>ENIGMA discovered 206 modules in the Rosetta dataset encompassing 2201 genes and 141 conditions (see supporting data for module details and figures). These numbers seem reasonable given that 130 of the 280 conditions included in the compendium contain less than five differentially expressed genes, which entails that they have a small chance of contributing to a module. Given the low amount of informative conditions, it is not surprising that only a third of the <it>S. cerevisiae </it>genes can be included in modules. According to the GO enrichment results, 107 out of 206 modules have a significant degree of functional coherence. Fifty-four modules are enriched in targets of one or more TFs, and 39 modules show enrichment of both GO Biological Process categories and TF binding sites. Together, 60% of the modules show enrichment of GO categories and/or TF binding sites, indicating that our method is capable of identifying biologically relevant expression modules.</p>
               <p>In order to qualitatively assess the effect of using a differential expression <it>p</it>-value cutoff in the discretization step instead of a fold-change cutoff, we repeated the analysis using a |log<sub>2 </sub>ratio| discretization threshold of 1 (two-fold up- or downregulation). The resulting coexpression network contains 58,612 positive and 2,837 negative links between 2,581 genes. The clustered network contains 206 modules encompassing 1,853 genes. Ninety-three modules exhibit GO enrichment, 47 exhibit TF binding enrichment and 35 exhibit both. Despite the significantly lower amount of connections in the log-ratio network, the number of functionally coherent modules and the number of clustered genes is roughly similar, and the optimized clustering parameters (<it>&#957;</it><sub>1</sub>, <it>&#957;</it><sub>2</sub>) = (0.30, 0.55) are identical, indicating that the general structure of the network and its strongest modules are fairly well preserved. Indeed, many highly functionally coherent modules (a.o. related to amino acid metabolism, hexose transport, steroid biosynthesis, iron ion homeostasis, mating) are present in both networks. Not incidentally, many of these modules are related to the processes that were targeted by Hughes et al <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, which can be expected to show a pronounced expression response. However, modules that show less pronounced expression variations, for example the modules related to ribosome biosynthesis, are not recovered in the log-ratio network. This illustrates the main disadvantage of using a fixed log-ratio threshold: different processes show different amplitudes of expression change upon perturbation, which cannot be captured by a single threshold. One could argue that this can easily be remedied by standardizing the expression profiles to zero mean, unit variance before applying the threshold, as is done by some methods, e.g. SAMBA <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. However, in the case of perturbational data, this manipulation runs the risk of effiectively breaking the connection to the reference condition, thereby distorting the meaning of up- and downregulation and introducing serious artifacts (see Additional file <supplr sid="S1">1</supplr> Figure S4).</p>
            </sec>
            <sec>
               <st>
                  <p>Topological characteristics of the ENIGMA coexpression network</p>
               </st>
               <p>Since many cellular functions are carried out in a highly modular manner <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>, most cellular networks, including protein interaction networks, metabolic networks and gene expression networks, are modular in nature <abbrgrp><abbr bid="B42">42</abbr><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>. On the other hand, many cellular networks, including coexpression networks, have been claimed to exhibit a node degree (<it>k</it>) distribution of the power-law type, P(<it>k</it>) ~ <it>k</it><sup>-<it>&#947;</it></sup>, indicative of scale-free properties <abbrgrp><abbr bid="B47">47</abbr><abbr bid="B48">48</abbr><abbr bid="B49">49</abbr></abbrgrp>. The coexistence of modularity and a scale-free degree distribution can be explained by assuming a hierarchical modular network organization <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B47">47</abbr><abbr bid="B49">49</abbr></abbrgrp>. According to this view, the network consists of a hierarchy of nested topological modules of increasing size and decreasing coherence. In other words, small coherent modules combine to form larger and less coherent modules in a hierarchical fashion. At reasonable levels of module resolution, the modules consist mainly of sparsely connected but highly clustered nodes (low <it>k</it>, high <it>C</it>). The modules are linked together through a small number of highly connected nodes with a low clustering coefficient (high <it>k</it>, low <it>C</it>), often referred to as hubs. In the case of coexpression networks, these hubs represent genes that are linked to different expression modules depending on the experimental conditions.</p>
               <p>A few papers <abbrgrp><abbr bid="B50">50</abbr><abbr bid="B51">51</abbr></abbrgrp> have cast doubt on the ubiquity of power-law degree distributions in biological networks, claiming that some of the supposed power-laws actually turn out to be closer to exponentials when rigorously analyzed. Indeed, the degree distribution of the ENIGMA co-differential expression network appears to be exponentially distributed (Figure <figr fid="F3">3A</figr>), at least for lower <it>k</it>. For higher <it>k</it>, the picture is different. Relative to the distribution obtained for lower degrees, the most highly connected nodes (hubs) seem to be underconnected. This observation is exactly the opposite of what would be expected for a power-law ('fat-tailed') degree distribution (i.e. highly connected nodes should be overconnected with respect to the exponential distribution), indicating that the coexpression hubs are not nearly as central in the network as would be expected in a scale-free network. However, from the plots of the clustering coefficient <it>C </it>versus the degree <it>k </it>(Figure <figr fid="F3">3B</figr>), it is apparent that the highly connected nodes still possess hub-like characteristics: they generally have a lower clustering coefficient and are assigned to multiple modules. Thus, highly connected nodes act more as local hubs that hold together a few modules. These hubs, by virtue of their polytomous expression behavior, may represent genes that function at the interface of several processes. An example of genes that probably interface between the cell cycle, mating pheromone response and cell wall biosynthesis is given below. Overall, 1050 genes are linked to 2 or more modules and 115 are linked to 5 or more modules, indicative of extensive crosstalk at the transcriptional level.</p>
               <fig id="F3">
                  <title>
                     <p>Figure 3</p>
                  </title>
                  <caption>
                     <p>Topological characteristics of the Rosetta network</p>
                  </caption>
                  <text>
                     <p><b>Topological characteristics of the Rosetta network</b>. (A) Semilog rank-degree plot for the ENIGMA network inferred from the Rosetta data [4]. (B) Plot of the clustering coefficient of a node's neighborhood as a function of the node degree <it>k</it>. The data points are colored according to the number of modules to which the corresponding gene is assigned.</p>
                  </text>
                  <graphic file="1752-0509-2-33-3"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Comparison between ENIGMA, SAMBA and ISA</p>
               </st>
               <p>Rigorously comparing the performance of (bi)clustering algorithms on real data is extremely difficult, because of the lack of adequate gold standards and the subjectivity of the available external performance criteria <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Therefore, we limit ourselves to a more qualitative comparison of the ENIGMA, SAMBA and ISA modules obtained on the Rosetta dataset. SAMBA was run with default parameter settings, for ISA we used the BicAT implementation <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> with parameters <it>t</it><sub><it>G </it></sub>= 3.1, <it>t</it><sub><it>C </it></sub>= 2.0 and 10,000 starting points (see <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> for parameter details). The ISA biclusters were pruned by merging biclusters that showed more than 80% overlap. The ISA and SAMBA biclusters were put through the ENIGMA postprocessing pipeline to functionally annotate them and to screen them for TF binding enrichment. SAMBA identified 314 modules containing 3,437 genes and 279 conditions. 203 modules were enriched in one or more GO Biological Process categories, 161 modules were enriched in binding sites for one or more TFs, and 136 modules showed both GO and TF binding enrichment. ISA identified 236 modules containing 3,065 genes and 261 conditions. Eighty-one modules were enriched in one or more GO Biological Process categories, 39 modules were enriched in binding sites for one or more TFs, and 28 modules showed both GO and TF binding enrichment. These numbers are not directly comparable between methods, because of the differing degrees of overlap (redundancy) between modules in the three formalisms. SAMBA generates a lot of biclusters with largely overlapping gene content (but different condition sets), whereas the gene overlap between the ENIGMA modules and especially the pruned ISA modules is more limited. For instance, SAMBA identified 17 modules enriched in conjugation-related genes, containing a total of 46 genes annotated to 'conjugation' in GO (see Table <tblr tid="T1">1</tblr>). In contrast, ENIGMA and ISA identified fewer conjugation modules (10 and 11, respectively), but containing similar amounts of known conjugation genes (43 and 42, respectively).</p>
               <p>Instead of comparing general properties such as the overall coverage of genes and conditions by biclusters, the proportion of GO-enriched modules or the average specificity (functional coherence) of the enriched modules, we focused our comparison on the biological processes that were mainly targeted by Hughes et al <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> (see Table <tblr tid="T1">1</tblr>), namely mating (conjugation), ergosterol biosynthesis, cell wall biogenesis, oxidative phosphorylation and iron ion homeostasis. All three formalisms uncover modules that are highly enriched for these processes. We used two criteria to assess the module representation of a given GO class <it>A</it>, namely the overall recall, or proportion of genes annotated to <it>A </it>found across all modules enriched for <it>A</it>, and the top module precision, or the proportion of genes in the most significantly enriched module that belong to <it>A</it>. SAMBA generally detects slightly more true positive genes than ENIGMA (higher recall), but at the expense of a lower top module precision and a higher amount of modules (see Table <tblr tid="T1">1</tblr>). ISA generally features a lower recall than SAMBA and ENIGMA, but frequently exhibits better top modules in terms of precision. In short, the main distinction between the formalisms seems to be a difference in balance between precision and recall. Moreover, the interpretation of the criteria defined above is not always straightforward. For instance, a lower top module precision is not always caused by a lack of functional coherence, but may be caused by the presence of genes involved in closely related processes. If we look at the overlap between the gene sets identified by the three methods (see Additional file <supplr sid="S1">1</supplr> Figure S5), it becomes clear that all three formalisms add extra information to the global picture. For all 5 processes in Table <tblr tid="T1">1</tblr>, a sizeable core of genes is identified by all three methods, but the different methods also have substantial idiosyncrasies. For instance, only 25 out of a total of 64 identified conjugation-related genes are found by all three formalisms. Eleven genes are found by ENIGMA and SAMBA but not ISA, two genes are found by ENIGMA and ISA but not SAMBA, and four are shared between SAMBA and ISA but not ENIGMA. Five genes are ENIGMA-specific, 6 are SAMBA-specific and strikingly, 11 are ISA-specific, although ISA identifies the least number of conjugation genes in total and has the 'worst' top module. This illustrates that different methods have different focuses and biases, and that integration of the results of different analysis methods often leads to a better global picture.</p>
               <tbl id="T1">
                  <title>
                     <p>Table 1</p>
                  </title>
                  <caption>
                     <p>Performance on Rosetta data</p>
                  </caption>
                  <tblbdy cols="9">
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c cspan="3" ca="center">
                           <p>Top module</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>GO category</p>
                        </c>
                        <c ca="left">
                           <p># genes</p>
                        </c>
                        <c ca="center">
                           <p>method</p>
                        </c>
                        <c ca="center">
                           <p># modules</p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>tp</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>R</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>p</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>tp</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>P</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="9">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>conjugation (GO:0000746)</p>
                        </c>
                        <c ca="left">
                           <p>117</p>
                        </c>
                        <c ca="center">
                           <p>ENIGMA</p>
                        </c>
                        <c ca="center">
                           <p>10</p>
                        </c>
                        <c ca="center">
                           <p>43</p>
                        </c>
                        <c ca="center">
                           <p>0.37</p>
                        </c>
                        <c ca="center">
                           <p>3.98E-29</p>
                        </c>
                        <c ca="center">
                           <p>23</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.62</b>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>SAMBA</p>
                        </c>
                        <c ca="center">
                           <p>17</p>
                        </c>
                        <c ca="center">
                           <p>46</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.39</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>4.10E-29</p>
                        </c>
                        <c ca="center">
                           <p>24</p>
                        </c>
                        <c ca="center">
                           <p>0.55</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>ISA</p>
                        </c>
                        <c ca="center">
                           <p>11</p>
                        </c>
                        <c ca="center">
                           <p>42</p>
                        </c>
                        <c ca="center">
                           <p>0.36</p>
                        </c>
                        <c ca="center">
                           <p>1.55E-15</p>
                        </c>
                        <c ca="center">
                           <p>18</p>
                        </c>
                        <c ca="center">
                           <p>0.28</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>ergosterol biosynthesis (GO:0006696)</p>
                        </c>
                        <c ca="left">
                           <p>26</p>
                        </c>
                        <c ca="center">
                           <p>ENIGMA</p>
                        </c>
                        <c ca="center">
                           <p>4</p>
                        </c>
                        <c ca="center">
                           <p>14</p>
                        </c>
                        <c ca="center">
                           <p>0.54</p>
                        </c>
                        <c ca="center">
                           <p>1.28E-12</p>
                        </c>
                        <c ca="center">
                           <p>9</p>
                        </c>
                        <c ca="center">
                           <p>0.23</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>SAMBA</p>
                        </c>
                        <c ca="center">
                           <p>3</p>
                        </c>
                        <c ca="center">
                           <p>16</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.62</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>1.93E-14</p>
                        </c>
                        <c ca="center">
                           <p>15</p>
                        </c>
                        <c ca="center">
                           <p>0.08</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>ISA</p>
                        </c>
                        <c ca="center">
                           <p>1</p>
                        </c>
                        <c ca="center">
                           <p>11</p>
                        </c>
                        <c ca="center">
                           <p>0.42</p>
                        </c>
                        <c ca="center">
                           <p>1.23E-19</p>
                        </c>
                        <c ca="center">
                           <p>11</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.39</b>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>cell wall biogenesis (GO:0042546)</p>
                        </c>
                        <c ca="left">
                           <p>32</p>
                        </c>
                        <c ca="center">
                           <p>ENIGMA</p>
                        </c>
                        <c ca="center">
                           <p>1</p>
                        </c>
                        <c ca="center">
                           <p>8</p>
                        </c>
                        <c ca="center">
                           <p>0.25</p>
                        </c>
                        <c ca="center">
                           <p>2.35E-06</p>
                        </c>
                        <c ca="center">
                           <p>8</p>
                        </c>
                        <c ca="center">
                           <p>0.08</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>SAMBA</p>
                        </c>
                        <c ca="center">
                           <p>4</p>
                        </c>
                        <c ca="center">
                           <p>9</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.28</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>6.89E-06</p>
                        </c>
                        <c ca="center">
                           <p>9</p>
                        </c>
                        <c ca="center">
                           <p>0.06</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>ISA</p>
                        </c>
                        <c ca="center">
                           <p>1</p>
                        </c>
                        <c ca="center">
                           <p>7</p>
                        </c>
                        <c ca="center">
                           <p>0.22</p>
                        </c>
                        <c ca="center">
                           <p>6.32E-07</p>
                        </c>
                        <c ca="center">
                           <p>7</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.13</b>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>iron ion homeostasis (GO:0055072)</p>
                        </c>
                        <c ca="left">
                           <p>38</p>
                        </c>
                        <c ca="center">
                           <p>ENIGMA</p>
                        </c>
                        <c ca="center">
                           <p>4</p>
                        </c>
                        <c ca="center">
                           <p>15</p>
                        </c>
                        <c ca="center">
                           <p>0.39</p>
                        </c>
                        <c ca="center">
                           <p>2.35E-16</p>
                        </c>
                        <c ca="center">
                           <p>11</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.37</b>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>SAMBA</p>
                        </c>
                        <c ca="center">
                           <p>13</p>
                        </c>
                        <c ca="center">
                           <p>16</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.42</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>3.99E-18</p>
                        </c>
                        <c ca="center">
                           <p>13</p>
                        </c>
                        <c ca="center">
                           <p>0.33</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>ISA</p>
                        </c>
                        <c ca="center">
                           <p>2</p>
                        </c>
                        <c ca="center">
                           <p>13</p>
                        </c>
                        <c ca="center">
                           <p>0.34</p>
                        </c>
                        <c ca="center">
                           <p>8.43E-14</p>
                        </c>
                        <c ca="center">
                           <p>13</p>
                        </c>
                        <c ca="center">
                           <p>0.15</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>oxidative phosphorylation (GO:0006119)</p>
                        </c>
                        <c ca="left">
                           <p>38</p>
                        </c>
                        <c ca="center">
                           <p>ENIGMA</p>
                        </c>
                        <c ca="center">
                           <p>6</p>
                        </c>
                        <c ca="center">
                           <p>23</p>
                        </c>
                        <c ca="center">
                           <p>0.61</p>
                        </c>
                        <c ca="center">
                           <p>3.02E-12</p>
                        </c>
                        <c ca="center">
                           <p>9</p>
                        </c>
                        <c ca="center">
                           <p>0.35</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>SAMBA</p>
                        </c>
                        <c ca="center">
                           <p>11</p>
                        </c>
                        <c ca="center">
                           <p>30</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.79</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>2.34E-32</p>
                        </c>
                        <c ca="center">
                           <p>20</p>
                        </c>
                        <c ca="center">
                           <p>
                              <b>0.44</b>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>ISA</p>
                        </c>
                        <c ca="center">
                           <p>2</p>
                        </c>
                        <c ca="center">
                           <p>8</p>
                        </c>
                        <c ca="center">
                           <p>0.21</p>
                        </c>
                        <c ca="center">
                           <p>1.20E-05</p>
                        </c>
                        <c ca="center">
                           <p>6</p>
                        </c>
                        <c ca="center">
                           <p>0.14</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Comparison of ENIGMA, SAMBA and ISA results for selected biological processes targeted by Hughes et al. [4]. The three middle columns give the number of modules enriched for the GO class in the first column, the total number of genes annotated to that GO class in these modules (<it>tp</it>) and the corresponding recall (<it>R</it>). The three last columns contain the enrichment <it>p</it>-value of the top module, the number of true positives (<it>tp</it>) and the proportion of genes in the top module annotated to the respective GO class (precision <it>P</it>).</p>
                  </tblfn>
               </tbl>
            </sec>
            <sec>
               <st>
                  <p>Pheromone response modules</p>
               </st>
               <p>In order to further assess the capacity of ENIGMA to discover biologically relevant connections between genes and processes, we took a closer look at the mating-related ENIGMA modules. The Rosetta compendium contains expression data on at least 20 mating-related perturbations, and consequently the mating pheromone response system is well resolved in the ENIGMA network. Several mating-related modules were uncovered (notably modules 28, 77, 115 and 171, see Figure <figr fid="F1">1</figr>, Figure <figr fid="F4">4</figr> and supplementary material at <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>).</p>
               <fig id="F4">
                  <title>
                     <p>Figure 4</p>
                  </title>
                  <caption>
                     <p>Mating module 77</p>
                  </caption>
                  <text>
                     <p><b>Mating module 77</b>. A module enriched in pheromone response genes. The colors of individual spots reflect the expression ratios (experiment vs. control, blue = upregulated, yellow = downregulated, white = missing value). The module is split in leafs in both dimensions based on average linkage clustering using a cos<it>&#952; </it>threshold of 0.65. In order not to crowd the figure, leafs of size &lt; 3 are grouped in a single leaf beyond the red line (rightmost leaf and bottom leaf). Transcription factors are highlighted in yellow in the gene list if there is ChIP data available for them, while other regulators are highlighted in red. To the right of the expression matrix is a column indicating the module's seed genes (red). Further to the right is a matrix depicting the presence of enriched TF binding sites (yellow) and/or significant co- or antiexpression links with potential regulators (green and red, respectively; the hue is proportional to the <it>p</it>-value of the link; in case of overlap with an enriched binding site, the field is colored dark green or dark red). The expression profiles of these regulators are depicted on top of the module's expression matrix. Note that regulators that are part of the module are not repeated on top unless they also exhibit significant binding site enrichment. To the far right are matrices depicting the genes' membership of enriched GO categories (orange) and membership of other modules (blue). The black and magenta arcs represent protein and genetic interactions, respectively. The arrow indicates the <it>tec1</it>&#916; experiment (see main text).</p>
                  </text>
                  <graphic file="1752-0509-2-33-4"/>
               </fig>
               <p>Module 28 is most strongly related to mating (see Figure <figr fid="F1">1</figr>). Twenty-three of the 37 genes in this module are annotated to the GO category 'conjugation' (GO:0000746, <it>p </it>= 3.98E-29). Four TFs exhibit binding enrichment in module 28: Ste12, Dig1, Mcm1 and Tec1. All of these function in the regulation of the mating pheromone response (which includes mating, pseudohyphal and invasive growth). Two regulators show significant coexpression links with the module: Ste12, an important regulator of the mating response (which is in fact part of the module) and Tec1, a transcription factor involved in the regulation of haploid invasive and diploid pseudohyphal growth. The mating and invasive/pseudohyphal growth signaling pathways share many of the same components, and Tec1 is believed to mediate an invasive growth response upon low levels of pheromone signaling <abbrgrp><abbr bid="B53">53</abbr><abbr bid="B54">54</abbr></abbrgrp>. Whereas Ste12 appears to be the main regulator for module 28, Tec1 is mainly coexpressed with genes that are shared between module 28 and modules 77, 115 or 171. Modules 115 and 171 are smaller pheromone response-related modules (see figures in supplementary material <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>). Both modules contain Tec1 as a member gene, suggesting that these modules might be more related to pseudohyphal growth than to the conjugation process.</p>
               <p>Module 77 exhibits a more complicated substructure, with five major patterns (1&#8211;5) in the condition dimension and five in the gene dimension (a-e, leafs 6 and f group smaller leafs, see Figure <figr fid="F4">4</figr>). Most of the known mating-related genes in module 77 reside in the gene leafs e and f. Several genes in these leafs overlap with the mating module 28. In contrast, most of the genes in the leafs b and c overlap with module 12, a module enriched in cell wall biogenesis genes (<it>p </it>= 6.26E-10). Nevertheless, most of these genes contain binding sites for Ste12 and Dig1 and some for Tec1, justifying their presence in a pheromone response-related module. While the genes in leaf c appear to be genuinely related to cell wall biogenesis, none of the genes in leaf b is annotated as such. Compared to the genes in leaf c, the genes in leaf b show a distinctive subpattern in condition leaf 2, which mainly contains perturbations that affect the cell cycle, DNA maintenance and DNA repair. Interestingly, the genes in leaf b also distinguish themselves from the ones in leaf c (except for <it>MID2</it>) by the presence of TF binding sites for Swi4 and Swi6, which together form the SBF complex that regulates transcription at the G1/S transition <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. Additionally, the genes in leaf b show strong coexpression links with the cyclins Cln1 and Cln2. Both Swi4 and Swi6 are potential substrates of the protein kinase Cdc28, which is activated by Cln1 and Cln2 <abbrgrp><abbr bid="B56">56</abbr></abbrgrp>. Together, these data suggest that the genes in leaf b function at the interface of cell wall biogenesis, the G1/S transition and mating/filamentous growth. Such a link makes sense since upon activation of the pheromone signaling pathway, the yeast cell cycle is arrested in G1 and extensive cell wall rearrangements take place <abbrgrp><abbr bid="B57">57</abbr></abbrgrp>.</p>
               <p>Together with the genes in leafs b and f, most genes in leaf e are strongly repressed under the pheromone response-related perturbations in condition leaf 1. Unlike leafs b and f, only a few genes in leaf e (<it>FUS1 </it>and <it>WSC3</it>) feature bona fide Ste12 or Tec1 binding sites. However, the expression of the other genes in leaf e (with the exception of <it>HAP1</it>) is specifically and strongly downregulated upon haploid <it>TEC1 </it>deletion (arrow on Figure <figr fid="F4">4</figr>), suggesting that these genes are somehow transcriptionally regulated by Tec1. Further investigation made apparent that several of the genes in leaf e are flanked by or overlapping with an antisense Ty1 retrotransposon long terminal repeat (LTR) on the 3' side (<it>GAS2</it>, <it>YLR334C</it>, <it>YOL106W</it>) or the 5' side (<it>NDJ1</it>). The presence of these Ty elements is highly relevant, since <it>TEC1 </it>was originally described as a gene required together with <it>STE12 </it>for full Ty1 expression <abbrgrp><abbr bid="B58">58</abbr><abbr bid="B59">59</abbr></abbrgrp>. Three of these genes (<it>GAS2</it>, <it>YLR334C </it>and <it>NDJ1</it>) were found to be directly or indirectly associated with <it>TEC1 </it>in a previous study in which the Rosetta compendium was analyzed using a Bayesian network framework <abbrgrp><abbr bid="B60">60</abbr></abbrgrp>. A peculiar member gene of leaf e is <it>HAP1</it>, a transcription factor involved in the regulation of respiratory metabolism in response to levels of heme and oxygen. Interestingly, <it>HAP1 </it>also contains a 3' Ty1 insertion in the yeast strain used by Hughes et al (a derivative of strain S288c) <abbrgrp><abbr bid="B61">61</abbr></abbrgrp>, which helps explain its puzzling presence in a pheromone response module and strengthens our belief that the Ty1 elements are responsible for the link between leaf e genes and mating genes.</p>
               <p>The coexpression of <it>NDJ1 </it>with <it>TEC1 </it>can be directly explained by the presence of a 5' Ty1 LTR in antisense direction (Ty1 LTRs have been found to drive expression in an orientation-independent manner <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>). For <it>GAS2</it>, <it>YLR334C</it>, <it>HAP1 </it>and <it>YOL106W</it>, the situation is different given the 3' location of the flanking Ty1 LTRs. Tec1 and Ste12 activation of these Ty1 elements could in theory cause the production of antisense transcripts of these loci. Since the probes spotted on the microarray used by Hughes et al <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> contained both strands of the gene sequences, such antisense transcripts might be responsible for the observed coexpression pattern.</p>
               <p>We did not test the antisense hypothesis; the analysis we present here is merely intended as a use case to show that ENIGMA can generate hypotheses that can be tested in the lab. We did however briefly investigate whether the Ty1-associated genes (or maybe their antisense transcripts) could be functionally related to the mating process. Only two genes in leaf e (<it>PRM5 </it>and <it>FUS1</it>) are known to be involved in mating. Neither of them is flanked by a Ty1 LTR. One gene overlapping with an antisense Ty1 LTR, <it>YOL106W</it>, was previously reported to elicit a mating-related phenotype upon deletion <abbrgrp><abbr bid="B62">62</abbr></abbrgrp>. We performed mating experiments, halo assays and growth assays (see Methods) for two other 3' Ty1-associated genes in leaf e, namely <it>YLR334C </it>(overlapping antisense Ty1 LTR) and <it>GAS2 </it>(non-overlapping antisense Ty1 LTR), in addition to a wild type (WT) strain and <it>sst2</it>&#916;, a mutant that is supersensitive to mating factor-induced G1-arrest.</p>
               <p>The <it>ylr334c</it>&#916; deletion strain did not yield an interesting phenotype in any of the assays. The <it>gas2</it>&#916; deletion strain exhibited an interesting phenotype in the halo assay, characterized by extensive colony formation inside the halo (see Figure <figr fid="F5">5</figr>), which indicates that deletion of <it>GAS2 </it>somehow facilitates the recovery from <it>&#945;</it>-factor induced growth arrest. In the mating and growth assays, we did not observe any effect of <it>GAS2 </it>deletion on the mating ability (see Additional File <supplr sid="S1">1</supplr> Table S8, Table S9 and Figure S6). <it>GAS2 </it>is homologous to <it>GAS1</it>, which encodes a 1,3-<it>&#946;</it>-glucanosyltransferase required for cell wall assembly. In a recent study, <it>GAS2 </it>was found to be involved in spore wall assembly <abbrgrp><abbr bid="B63">63</abbr></abbrgrp>. Ectopic expression of <it>GAS2 </it>under control of the <it>GAS1 </it>promoter was found to complement the <it>gas1</it>&#916; phenotype only partially, and only at pH = 6.5 <abbrgrp><abbr bid="B63">63</abbr></abbrgrp>. It is therefore unlikely that <it>GAS2 </it>directly functions in regular cell wall assembly or maintenance. In one hypothetical scenario, antisense transcripts of <it>GAS2</it>, produced under control of Tec1, might interfere with the expression of its homolog <it>GAS1 </it>and hence indirectly with the formation and maintenance of the cell wall. An altered cell wall morphology might influence the efficiency with which <it>&#945;</it>-factor is inactivated, which could explain the observed <it>gas2</it>&#916; phenotype. Obviously, this is only a hypothesis and much more detailed experimentation is needed to unravel if and how <it>GAS2 </it>is linked to the pheromone response pathway. This is however outside the scope of the present study.</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Halo test for <it>&#945;</it>-factor based growth inhibition</p>
                  </caption>
                  <text>
                     <p><b>Halo test for <it>&#945;</it>-factor based growth inhibition</b>. Yeast strains (OD<sub>600</sub> = 1) were plated on YPD plates and 1000 pmol of <it>&#945;</it>-factor was spotted. The pictures are taken after 48 hours of incubation at 30&#176;C. Strains: A: Wild type BY4741 (<it>MAT</it><b>a </b><it>his</it>3&#916;1 <it>leu</it>2&#916;0 <it>met</it>15&#916;0 <it>ura</it>3&#916;0), B: <it>sst</it>2&#916;, C: <it>gas</it>2&#916;.</p>
                  </text>
                  <graphic file="1752-0509-2-33-5"/>
               </fig>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Implementation</p>
            </st>
            <p>ENIGMA is implemented as a command-line Java application that is open-source and freely available (under the GNU General Public License) from <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>. ENIGMA can be used for any organism for which there is sufficient gene expression data available. The only organism-specific part of the ENIGMA algorithm is the functional annotation module, which is based on the BiNGO tool <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. ENIGMA can be used out-of-the-box for 24 organisms, including yeasts, invertebrates, plants and mammals (see Manual section of <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>). Furthermore, ENIGMA allows the use of custom GO annotation files and GO Consortium files to accommodate other organisms.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have developed a novel method, called ENIGMA, to analyze perturbational microarray data. One of the innovations of our methodology is the use of a combinatorial statistic that is capable of detecting significant partial coexpression relationships between genes. In this respect, our method can be considered similar in purpose to biclustering methods, although ENIGMA assesses coexpression links between individual genes rather than expression coherence in a group of genes under a group of conditions. Our method produces both a detailed network of significant pair-wise coexpression links and a high-level representation of the modularity in the expression network.</p>
         <p>Tests on artificial data have shown that ENIGMA outperforms other methods, although ENIGMA wins from SAMBA on points rather than by knockout. Similar near-draws with SAMBA were reported earlier for cMonkey <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> and BiMax <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. This indicates that the (bi)clustering field has matured to a point at which it becomes increasingly difficult to easily improve on the performance of existing methods. However, ENIGMA does have some specific advantages. First, in contrast to other discretization-based methods such as SAMBA, ENIGMA discretizes the expression data based on differential expression <it>p</it>-values. Second, ENIGMA efficiently retrieved the correct number of modules from artificial datasets and actively avoids generating redundant modules, which greatly improves the interpretability of the results. Third, ENIGMA's clustering parameters are automatically optimized or can be set on relatively objective grounds. A fourth advantage that is more obvious on real data is the use of ENIGMA's expression module concept for biological discovery. In contrast to the coherent biclusters generated by most methods, an ENIGMA expression module may contain distinctive subpatterns. From our analysis of the Rosetta data, it became apparent that these subpatterns frequently represent more tightly coregulated gene clusters involved in biological processes related to a common functional theme. In our view, the grouping of such different but statistically and functionally connected patterns in one module aids greatly in the biological interpretation of the data and in the assessment of crosstalk between biological processes. The interpretation of a module's substructure is further facilitated by the integration of other data types. This is illustrated in our analysis of module 77, a pheromone response module which shows links to the cell cycle, cell wall biosynthesis and Ty1 LTR-associated genes.</p>
         <p>Although numerous approaches have already been used to mine the Rosetta compendium <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B49">49</abbr><abbr bid="B60">60</abbr><abbr bid="B62">62</abbr><abbr bid="B64">64</abbr></abbrgrp>, ENIGMA offers yet another perspective on the data. This mainly illustrates that the ideal clustering method does not exist <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B65">65</abbr></abbrgrp>, and that no single approach can extract all the information hidden in large compendium datasets. The elucidation of the regulatory networks governing the many different aspects of cellular function will therefore not only require the integration of different types of data, but also the integrated use of several complementary methods to analyze these data. We believe that ENIGMA constitutes a valuable addition to the existing repertoire of analysis methods.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Mating experiments</p>
            </st>
            <p>Yeast strains were grown overnight in YPD [yeast extract (1%), peptone (2%) and glucose (2%)] and diluted to an OD<sub>600 </sub>= 0.5 in fresh YPD. 500 <it>&#956;</it>l of each strain (<it>MAT</it><b>a</b>) was mixed with 500 <it>&#956;</it>l of the wild type strain (<it>MAT</it><b><it>&#945;</it></b>). The cells were shaken with 180 rpm at 30&#176;C. At time points 0 h, 4 h and 24 h, 100 <it>&#956;</it>l samples were serially diluted and plated on medium lacking either methionine (<it>MAT</it><b><it>&#945;</it></b>), lysine (<it>MAT</it><b>a</b>) or methionine and lysine (diploids).</p>
         </sec>
         <sec>
            <st>
               <p>Halo assay</p>
            </st>
            <p>A halo assay to measure response to and recovery from pheromone-induced growth arrest was performed as follows. Yeast cells (<it>MAT</it><b>a</b>) were grown overnight and diluted to OD<sub>600 </sub>= 1. 500 <it>&#956;</it>l was plated on YPD plates (1.5% agar in YPD). When the plates were dry, 2 <it>&#956;</it>l of the <it>&#945; </it>mating factor (= 1000 pmol) was spotted. The cells were allowed to grow for 48 hrs before the plates were scanned.</p>
         </sec>
         <sec>
            <st>
               <p>Growth assay</p>
            </st>
            <p>Yeast strains (<it>MAT</it><b>a</b>) were incubated with the wild type strain (<it>MAT</it><b><it>&#945;</it></b>) for 4 hours as described above and diluted to OD<sub>600 </sub>= 0.1. The length of the lag phase and the maximum growth rate of yeast strains in SDglu without lysine and methionine were monitored automatically by OD<sub>600 </sub>measurements with a BioscreenC apparatus (Labsystems). The parameters were as follows: 300 <it>&#956;</it>l of culture in each well, 30 s of shaking each 3 min (medium intensity), and OD<sub>600 </sub>measurement every hour. Readings are saturated at OD<sub>600</sub>s above 1.5.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>SM designed the study, developed the methods, analyzed and interpreted the data, and wrote the paper. PVD performed and analyzed the mating experiments, and MK designed the study and supervised the project.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Yvan Saeys, Thomas Abeel, Yves Van de Peer, Johan Thevelein, Dirk Aeyels and two anonymous reviewers for helpful comments on the manuscript. Cindy Colombo is acknowledged for her technical assistance and Martine De Cock for help in preparing the manuscript. SM is a Postdoctoral Fellow of the Research Foundation Flanders (Belgium).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>A new approach to decoding life: systems biology</p>
            </title>
            <aug>
               <au>
                  <snm>Ideker</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Galitski</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hood</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Annu Rev Genomics Hum Genet</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>343</fpage>
            <lpage>372</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1146/annurev.genom.2.1.343</pubid>
                  <pubid idtype="pmpid" link="fulltext">11701654</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
      