Abstract
Background
Transcriptional gene regulation is one of the most important mechanisms in controlling many essential cellular processes, including cell development, cellcycle control, and the cellular response to variations in environmental conditions. Genes are regulated by transcription factors and other genes/proteins via a complex interconnection network. Such regulatory links may be predicted using microarray expression data, but most regulation models suppose transcription factor independence, which leads to spurious links when many genes have highly correlated expression levels.
Results
We propose a new algorithm to infer combinatorial control networks from geneexpression data. Based on a simple model of combinatorial gene regulation, it includes a messagepassing approach which avoids explicit sampling over putative generegulatory networks. This algorithm is shown to recover the structure of a simple artificial cellcycle network model for baker's yeast. It is then applied to a largescale yeast gene expression dataset in order to identify combinatorial regulations, and to a data set of direct medical interest, namely the Pleiotropic Drug Resistance (PDR) network.
Conclusions
The algorithm we designed is able to recover biologically meaningful interactions, as shown by recent experimental results [1]. Moreover, new cases of combinatorial control are predicted, showing how simple models taking this phenomenon into account can lead to informative predictions and allow to extract more putative regulatory interactions from microarray databases.
Background
Transcriptional gene regulation is one of the key mechanisms in living cells; the control of gene expression is crucial in processes as cell development, cellcycle regulation, and response to external stimuli [25]. While the number of sequenced genomes is growing rapidly, it becomes more and more important to study genetic information on a higher level, i.e. to understand genes in their interdependence and to capture relations between regulatory genes, e.g. transcription factors (TF) or signaling proteins, and regulated genes via the reconstruction of generegulatory networks (GRN).
Direct experimental approaches to understand gene regulation are money and time consuming. Therefore genomescale regulatory networks are only known for E. coli [6] and for baker's yeast, S. cerevisiae [7,8]. For higher organisms, the knowledge is restricted to intensively studied small functional modules, see e.g. [9,10]. Some characteristic features of these GRN are:
• Directionality: Regulatory control is directed from regulators to regulated genes.
• Sparsity: Each single gene is controlled by a limited number of other genes, which is small compared to the total gene content (and also to the total number of TFs) of an organism.
• Combinatorial control: The expression of a gene may depend on the joint activity of various regulatory proteins.
The last item is crucial, and it is the topic of very active and diversified research [1115]. One example of combinatorial control in yeast is the case of transcription factors Yrr1 and Yrm1, which compete for occupancy of the same promoter sequence [16]. Many other types of combined control exist, such as the formation of hetero or homodimers by TFs, or their posttranslational modification by other proteins, which can entirely change their targets [17]. On the other hand, the hypothesis of sparsity has been experimentally checked in wellstudied organisms, where it has been observed that the number of TFs is low compared to the total number of genes.
It is tempting to ask in how far GRN can be reconstructed from geneexpression data. After the advent of the first generation of geneexpression microarrays, more than a decade ago [18], we face an growing number of new highthroughput technologies capable of monitoring simultaneous concentrations of thousands of cellular components, in particular of mRNAs. The improved quality of new generations of microarrays, the decrease of their cost, and the amount of experiments accumulated so far call for the development of largescale methods of data analysis. Different approaches to modeling have been proposed (see [19] for a recent review), from a coarsegrained description of coregulated genes [20], classification methods [21,22], to Boolean descriptions where genes are described in terms of logical switches with only on/off states of activity [23] (and in particular [24] for the problem of inference of boolean networks), or considering more realistic systems of differential equations describing the kinetic details [25]. Also for GRN reconstruction, approaches from different origins have been proposed: system control theory [2629], Bayesian inference [3033], information theory [3436].
Many limitations of the existing algorithms arise directly from the quantity and quality of data:
Microarrays are noisy averages over cell populations, and the number of available arrays is normally much smaller than the number of probes measured in each array. Moreover, microarrays measure mRNA but not active protein concentrations (which, for TFs, are the important parameters). Both may be uncorrelated in the cell [37]. But as proteomics data are even sparser than microarray data, this is not an easytosolve problem, and many modeling approaches use mRNA concentration alone. Another problem is the existence of combinatorial control in gene regulation: Predicting such cases is a NPcomplete problem, and has therefore eluded many approaches due to computational complexity, although some recent and interesting progress has been achieved in [33].
In this paper we introduce a novel algorithmic strategy, based on messagepassing techniques, to infer the regulatory network of an organism based solely on genomewide expression data, that specifically focuses on combinatorial control. Our methodology is probabilistic and distributed, allowing for a fast exploration of the space of networks. We apply the algorithm to three yeast networks: (i) To test the efficiency of the algorithm, we first reconstruct an insilico regulatory network for cellcycle control from artificially generated data [38]. (ii) We propose a largescale reconstruction of the yeast regulatory network, using the classic Gasch microarray dataset [4], and analyze evidence for combinatorial control. (iii) We use yeast expression data from the SMD database [39] to recover the regulations affecting genes involved in pleiotropic drug resistance (PDR). This network is now under intense scrutiny because of the more and more common nosocomial infections by Candida yeasts [40], which are able to resist to drugs by exporting them out of the cell. These resistance mechanisms are genetically regulated by the PDR network, which we aim to reconstruct. An detailed description of the algorithm is given in the Methods section. An implementation in C can be downloaded at [41].
Results and Discussion
Reconstructing an insilico yeast cellcycle network
Before coming to biological data, we test our approach on the network model of Tang et al. [38] for cell cycle regulation in S. cerevisiae. The cell cycle is regulated by cyclins/CDK complexes, which sequentially activate and inhibit each other, creating a periodicity which is the clock of the cell. Recently sequential waves of transcriptional activation independent of cyclins activation have been discovered [42,43], but they are not taken into account in the model. It anyway serves as an ideal starting point for the the performance analysis of our analysis, since the data generating network is explicitly known and can be compared to our inferred regulatory interactions.
In the model of [38], the regulatory network consists of N = 11 genes/proteins, which are described by a binary state vector . Interactions are encoded into a coupling matrix with entries in total the model of [38] contains 15 activating and 19 repressing links. The definition of the network model is completed by the dynamical rule
with
Our aim here is to infer the regulatory links of this network model based on the different state vectors s^{t}. The above insilico dynamics shows 7 fixed points, i.e. stationary states of the dynamics. Each fixed point can be characterized by the size of its basin of attraction, i.e. by the number of initial random initial conditions that end on it. Tang et al. argue that the fixed point with the largest basin of attraction can be identified with the G1 phase of the cell cycle. If one perturbs the stationary G1 state by flipping the Cln3 cyclin to its active value, the network passes trough 13 different states before reaching again G_{1}. The authors of [38] argue that this trajectory robustly reproduces various aspects of the yeast cell cycle.
We test our algorithm on two different data sets: (i) the 13 states obtained by first flipping the Cln3 cyclin to the active value, and letting the system evolve until stationarity as described before, (ii) a larger dataset containing the configurations of data set (i) and additionally the trajectories obtained by evolving all configurations at Hamming distance 1 away from G_{1 }(70 different states). In Additional File 1 we include both data sets together with the links of the network.
Additional file 1. All files are in additional file 1.
Format: ZIP Size: 444KB Download file
In order to deal with time series, Eq. (9) for the prior probability distribution is transformed into , to express the conditional probability of the target gene 0 at time t + 1 given the expression profile of the other genes at time t. For both data sets we fix the diluting field h to a value giving N_{eff }~ 30 according to Eq. (5). For the original data set (i) we fix σ_{D }0. while for the larger data set (ii), convergence of Belief propagation (BP) is ensured by σ_{D }= 0.3.
In Fig 1 we display the PrecisionRecall curve for the network inferred using BP, for both cellcycle and perturbed cellcycle data sets (cf. the paragraph about observables in Methods for a precise definition of precision and recall). Results are compared to the performance of a coexpression network which ranks links j → i according to the Pearson correlation of and . We see that on the original data set BP is able to correctly infer 11 links before making the first error, whereas Pearson correlation fails already after two correctly predicted links. This result shows that BP correctly manages to take into account combinatorial control effects, which cannot be seen by purely local methods (as pair correlations). Increasing the data set improves the outcome of BP, the larger data set leads to 16 correctly predicted links before the precision drops down from one, and the precision stays always above the one obtained from the 13state trajectory. It is also interesting to note that the first links inferred by our algorithm are those which where identified in [44] as essential for reproducing the cellcycle by a complete enumeration of the space of all networks.
Fig 1. PrecisionRecall curve for the cellcycle inference. PrecisionRecall curve for the network inferred using BP (both cellcycle and perturbed cellcycle data sets) and Pearson correlation coefficient (only cellcycle data set). In the case of BPbased inference we infer correctly 11 links (cellcycle) and 16 links (perturbed cellcycle) respectively before making the first error. In the case of the correlation based inference we make the first error after only 2 correct links.
Yeast response to environmental stresses
For a second application of BP  at much larger scale  we use the data of Gasch et al. [4], which consist of 172 genomewide microarrays of S. cerevisiae under different environmental conditions. We filter out all genes, which show little differential expression (variance smaller than three times the minimal variance measured) or which miss more than 10 data points. Thereby the gene number is reduced to 2659 target genes, i.e. to roughly half of the entire genome. As putative regulators we consider (i) genes annotated as transcription factors or structurally similar to known transcription factors, and (ii) genes involved in signaling [45]: their total number sums up to 460 putative inputs.
We run our algorithm with σ = 0.25 which equals the minimal variance of a gene found in the full data set. BP giving probabilistic results, we kept regulatory links with more than 95% of confidence.
As the distribution of the marginal probabilities follows a powerlaw distribution (data not shown), changing this threshold (e.g. going to 99% or 90%) has little effect on the final network. The network contains 5779 regulatory links, giving an average of 2.17 links per target; the inconnectivity has a distribution best fitted by an exponential law k = Ce^{γ }with γ = 0.42, a value very close to the reference one in [7]. Only 182 target genes (7%) have no predicted regulator. Moreover, 1637 targets (62%) are regulated by at least 2 genes, providing a wealth of potential predictions in the field of combinatorial control. Interestingly enough the finding of 2.17 links per target can be confronted with the result of Balaji et. al. [46], based on a review of Chipchip experiments, reporting a comparable average value of 2.9 regulators per target.
Combinatorial control
In order to assess the relevance of the inferred network, we compare it first to a network based on pairwise correlations of expression data (coexpression network), which was constructed to have the same number of links as the BP network. Selected links are those of highest absolute value of the Pearson correlation between all inputoutput gene pairs. This is clearly an oversimplified model, but it allows to grasp the significant features of our model.
One advantage of our algorithm is the explicit inference of combinatorial control mechanisms by multiple transcription factors. Indeed, the number of genes with multiple regulators inferred using our methodology is 1637, while it is only 612 in the case of the pairwisecorrelation network. The average number of regulators per regulated gene (i.e. genes with at least one inferred regulator) in our BP case is 2.33, and has to be compared to 2.9 from the work of Balaji et al. [46], and 6.17 for the coexpression network. It is interesting to note that BP results are is closer to the experimental network as compared to the coexpression one. This feature shows how, for the vast majority of target genes, our algorithm is able to describe the behavior of the gene by combining few putative regulators.
Another way of investigating combinatorial control is to compare expression profiles of different regulators. Regulators having highly correlated expression profiles carry similar information to the target gene, whereas regulators having diverse profiles can be used to transmit much more information. This is directly incorporated in our model: The sparsity term introduced in Eq. 5 reduces the effect of potential regulators whose expression profiles are highly correlated. As a limiting example let us consider two input genes with identical expression profiles, regulating one target gene. The sparsity term will select randomly only one of the two, and identify it as a regulator. In more realistic cases, no two genes shows exactly the same expression, and only the most explanatory gene will be chosen as a regulator out of a set of highly correlated potential TFs.
To quantify the independent information carried by each regulator we compute, as a simple measure, one minus the Pearson correlation coefficient between any two regulators of common target genes, see Fig 2. One can see that the information content is much higher using our methodology than simply coexpression, because the latter tends to discover redundant information as displayed in the example of Fig 3 for the target gene YDR518W. This specific example also shows that secondary regulators found by BP tend to correct discrepancies between the first regulator and the target gene.
Fig 2. Statistical independence Probability distribution. Plot of the statistical independence probability distribution (i.e. 1Pearson correlation coefficient) between pairs of TF regulating combinatorially the same target. Red curve: Coexpression network. Blue curve: BP network. Note that the statistical independence is much higher between TF inferred by BP, showing higher information content in combinatorial control.
Fig 3. An example of combinatorial control. Example for combinatorial control. Top: Three top inputs found by BP, with the prediction compared to the real output. The left bars are training, the right test data. Bottom: Three most Pearsoncorrelated genes and corresponding prediction. Over (under) expression is depicted in red (blue).
Comparison to experimental TF binding data
In order to further investigate the significance of the BP inferred network, we compare it to the experimentally verified network presented by Balaji et. al. [46], as characterized by 158 TFs, 4411 target genes, and 12974 regulatory links between them. After filtering out genes with low variance in the expression data set, the set of analyzed genes consists of 1919 targets, and 132 TFs. The number of experimentally verified links between these genes reduces consequently to 5533. Again we run BP with σ = 0.25, which equals the minimal variance of a gene found in the full data set, and we keep regulatory links with more than 90% of confidence. The resulting network has 6914 directed edges. Since these edges describe logical implications between gene expression levels, it is not clear in how far they reflect physical binding between the TF related to the input gene, and the promoter sequence of the target gene. It is easy to imagine that coregulated genes are discovered as predicting each other, or secondary targets in regulatory cascades are recognized as direct targets.
In fact, the overlap with the experimentally verified network is only 206 edges (the resulting network is provided in Additional File 1). In order to give a statistical assessment of this number, we compare it to the overlap with a null model: We scramble the links in the BP network randomly preserving the indegree of the inferred network. The overlap with the null model is 176 ± 5.3 edges, implying a zscore of 5.5, and a pvalue of 1.18 × 10^{8 }(under the hypothesis that the distribution of overlaps is Gaussian with mean and variance given by the nullmodel).
To check the effect of an increased number of experiments, we downloaded 1013 microarrays from the Stanford Microarray Database (SMD) [47]. Now 2614 target genes and 157 regulatory genes pass the statistical test, and the coverage of the experimental network increases to 7635 links. With respect to Gasch's data set, we use a 6fold higher number of arrays coming from different experiments, so we run BP at a higher noise value σ = 1.5. The resulting BP network has 16176 edges (around three times the number of edges inferred with Gasch dataset alone). The overlap with the experimentally verified network is 406 edges (the resulting network is provided in Additional File 1). The overlap with the nullmodel is 314 ± 7.9 edges. Thus we find a zscore of 11.6, and a pvalue of 1.6 × 10^{31}. As a comparison, we decided to analyze the same data set and the same set of 157 potential transcription factor with the ARACNe software [35]. To obtain statistically similar networks we set the data processing inequality threshold (a tunable parameter for controlling the overall number of edges in the network) to 0.10: the resulting network has 19775 directed edges (note that ARACNe produces undirected links). The overlap with the experimentally verified network is of 480 edges (data in Addition Files). The overlap with the nullmode is 424 ± 9.8 edges, with a zscore of 5.7 and a pvalue of 3.0 × 10^{9}.
The sensible increase of statistical significance with respect to the results using Gasch's data is encouraging: It indicates in quantitative form, that larger microarray numbers would allow for extracting substantially more information about regulatory processes from gene expression data.
Inference of the PDR network
We finally apply our algorithm to a small dataset, to tackle an issue of direct medical relevance: drug resistance among yeasts. S. cerevisiae is able to resist many drugs, using an ensemble of genes connected in the "pleiotropic drug resistance" network. The basic mechanism is that these genes, regulated by the master regulator PDR1, can export a broad range of substances out of the cell  drugs included. This general feature has been discovered in many organisms, and is considered a generic and robust mechanism of drug resistance, from bacteria to yeasts [48]. The precise regulations acting in this network are yet unknown, even if numerous works have already uncovered a part of them [4952]. Here we propose to look for combinatorial regulations in this network, in order to better understand how transcription factors dedicated to drug resistance collaborate to ensure cell survival in harsh conditions  that is, in the presence of drugs. We run our algorithm on 40 genes known to be involved in PDR processes as targets  selection was based on literature , and use all 157 transcription factors annotated in the database YEASTRACT [53] as potential regulators. The expression data consist of 912 microarrays from SMD [47]. Due to its small size, the statistical properties of the inferred network (see Fig 4) are quite different from the global one : 265 links were inferred at 95% confidence, giving a high average of 6.65 regulators per regulated gene. All target genes had at least one regulator; in fact only one had a single regulator (the GIS1 → STB5 couple).
Fig 4. The PDR network inferred. The PDR regulatory network inferred by BP, comprising 157 TF and 40 targets. Targets are shown in grey.
Again, as a comparison, we decided to analyze the same data set ARACNe. To obtain statistically similar networks, we set the data processing inequality to 0.10: 247 links were inferred (note that ARACNe produces undirected links). Both networks are provided in Additional File 1. As a first observation we note that 13 out of the 40 target gene appear not regulated in the ARACNe network. We can conclude that, at least in this case, ARACNe seems to produce links which are more concentrated to a smaller target number, with an indegree of 9.14 ± 6.6 TF/regulated target (to be confronted with the BP results of 6.625 ± 3.6).
Compared to the latest version of YEASTRACT, we find the following numbers of overlapping links: 16 in our case (if we consider the TF → target direction), and 28 if the direction is not taken into account. ARACNe, which produces an undirected network, has only 22 overlapping links. We also compared our findings with the network presented in the work of Balaji et al. [46]: in the BP case we match 8 directed edges and 15 undirected ones, whereas ARACNe matches 9 undirected links.
Moreover, a closer look to some predicted cases of combinatorial control gives interesting insights into the biology of drug resistance. In particular, we find RPN4, a transcriptional regulator of the proteasome, regulated by both PDR3 and YAP1. This interaction between drug resistance and the proteasome was already hinted in previous works concerning global stress resistance [54], and was recently proved experimentally [1]. This case is not found when running ARACNe on the same dataset, emphasizing the need for specially designed algorithms in order to uncover new cases of combinatorial control. Another interesting case of combinatorial regulation predicted in this analysis is the cross regulation of YAP1 and RAS1 by PDR1, PDR3 and RPN4. This complex regulation could therefore link drug resistance and proteasome regulation to the processes of cell aging and proliferation, regulated by RAS1. However, to our best knowledge there is no experimental evidence of this link, which is to be confirmed.
Conclusions
In this work, we have presented an effcient method for genomewide inference of regulatory networks, particularly designed to take into account cases of genetic combinatorial control. The method, based on message passing, was tested on a small insilico model for the cellcycle regulation in yeast, and then applied to both a largescale and a smallscale dataset. The test shows the accuracy of the method in case of informative data, and the applications predict meaningful network structures.
One relevant feature of our algorithm is its capability of unveiling patterns of combinatorial control. Even if the model of generegulation we used (linear superposition of inputs, followed by a nonlinear function) is very simple, it allows for regulators which account only for part of the target expression, and which may be corrected for by other regulators under other conditions, cf. Fig 3.
From the algorithmic point of view, our methodology allows to explore combinatorially the full space of regulatory networks while keeping the computational time short. The flexibility of the approach allows for integrating other type of data: to give an example, information about putative transcription factor binding sites in the regulatory region of an output gene can be easily integrated via a transcriptionfactor dependent diluting field h.
Finally, our method can be generalized to tackle a variety of issues in the field of gene regulation inference. One possibility is to try to discover new regulators, by a corrective methodology, starting with a known regulatory network and looking for the most relevant regulations to be added to this network. Another possibility is to use the information of combinatorial control in conjunction with the nature of the expression data to explain which conditions allow which combinatorial controls, opening the door to a wealth of genetic experiments and to a better understanding of the complexity of gene regulation.
Methods
Data encoding
Gene expression data are encoded into a (N + 1) × M input matrix of entries , with i = 0.1,..., N and μ = 1,..., M, where M is the number of experiments (arrays), N + 1 is the number of genes. The value is a real number that quantifies the level of expression of gene i in sample μ; more precisely, is the i logratio of the actual expression of the gene i and the expression of the same gene in a reference condition. A negative (positive) value indicates the under (over)expression of gene i i sample μ with respect to the reference. Here we use the vectorial notation to indicate expression pattern μ.
The task is the reconstruction of a network model which may explain these data. Using a statisticalphysics analogy, starting from some snapshots of the microscopic state of a system one tries to infer the energy function (Hamiltonian) governing its behavior. Note that due to the directed nature of gene networks this task can be formally factorized over regulated genes: we can ask first, which genes have a regulatory influence on gene 0, and how they interact combinatorially. Then we ask the same question for the regulators of gene 1, 2,..., N. To further simplify the possible influence other genes can have on target gene 0, we aim at a ternary classification of the influence of a gene i on 0:
This classification scheme is clearly an oversimplification with respect to biological reality, where a whole range of positive and negative interaction strengths is expected. On the other hand, given the peculiar restriction posed by the limited number of available expression patterns, having a simple but meaningful model reduces the risk of overfitting and produces results which are easier to interpret. Our algorithm can be easily extended to include more than three values for the J_{i→0}; in most cases we have analyzed this generalization does not increase the predictive power.
A minimal deterministic model of gene regulation
As a minimal functional model, we assume that a gene becomes overexpressed if the joint influence of its regulators is above some threshold τ, and it is repressed if the joint influence is below τ. Hence, indicating the sign function by sign(·), and introducing , we expect
to hold for as many expression patterns μ = 1,..., M as possible. In this sense, each pattern poses a constraint on the coupling vector J = (J_{1→0},..., J_{N→0}), and the problem of finding a good candidate vector J can be understood as an instance of a constraint satisfaction problem. A cost function for this problem counts the number of errors made in Equation (3),
with Θ being the Heaviside step function. Obviously threshold functions form only a restricted function space. Functions like a XOR (or realvalued generalizations of it) are not implementable in this way. However, due to the beforementioned problems with data quantity and quality and the risk of overfitting we must restrict our model to a class of functions which is biologically reasonable but does not depend on too many parameters.
The simplest prior biological knowledge we can include at this point is the sparsity of regulatory networks. In this sense, we are looking for coupling vectors J which are as sparse as possible, i.e. which contain as few as possible nonzero elements. The number of these entries is counted by
and will be incorporated into the cost function,
with acting as a parameter controlling the relative importance of the two contributions: A small h will favor couplings of low ℋ_{0}, a high h will force couplings to be sparse at the cost of possibly not satisfying some pattern constraints. We introduce a formal inverse temperature β and the corresponding Gibbs distribution
with . At the end we are interested in the lowβ case where the Gibbs distribution concentrates in lowcost configurations.
A minimal stochastic model of gene regulation
The previous deterministic scheme is appealing for its simplicity but does not take into account the noise present in real data. We will assume first that the actual expression value of gene 0 is given as the sum of the measured value and a Gaussian noise η of zero mean and variance σ^{2}:
Denoting a centered Gaussian of variance σ^{2 }by , we can write the probability of measuring a given value for variable 0 as
For σ → 0 we go back to the deterministic model (3), σ > 0 smoothes the Heaviside function into a sigmoidal function. The noise level that we estimate from data is encoded in the value of σ.
Bayesian Inference
We turn (9) into a probabilistic Bayesian framework [55]. Assuming statistical independence of x^{μ }for μ = 1,..., N, Bayes theorem allows to write the posterior probability of a coupling vector J:
As a prior for the coupling we use the distribution P(J) ∝ exp{hN_{eff}(J)} favoring (sparse) connections with small N_{eff}. Unsurprisingly, for σ → 0 one recovers Equation (7).
From this point of view the choice of the prior is analogous ℒ_{1 }regularization method [56], but on a discrete vector of elements in {±1, 0}, i.e. in a case where the ℒ_{1 }regularization is equivalent to the ℒ_{0 }one.
Belief Propagation
The belief propagation (BP) algorithm is exact on treelike graphical models, but it has been extensively used as an heuristic procedure to solve problems defined on sparse graphs [57,58]. Recently, the same approach has been shown to be a good approximation also for problems with dense graph structure [5961]. BP is an iterative algorithm for estimating marginal probability distributions. It works by locally exchanging messages, until global consistence is achieved. The messages sent between variable nodes i (couplings) and function nodes μ (constraints) are:
• The probability ρ_{μ→i}(J_{i→0}) that constraint μ forces variable i to assume value J_{i → 0}.
• The probability P_{i→μ}(J_{i→0}) that variable i takes value J_{i→0 }in the absence of constraint μ.
The BP equations establish an approximate relation between these messages,
Proportionality constants are easily determined by normalization. The algorithm starts from randomly initialized messages and stops when convergence is reached. Our convergence criterion requires the difference between all message at time t and the corresponding ones at time t  1 to be less than a predefined threshold (10^{8 }in our simulations). Upon convergence marginal probability distributions are given by
From the point of view of algorithmic complexity, Eq. (12) still contains a sum over (3^{N}) terms, so the direct implementation of BP is not feasible for large systems. This problem can be solved approximately: Eq. (12) can be understood as the average of over N  1 independent random variables {J_{j→0}j ≠ i}, with depending on the coupling vector only via the sum . For a sufficiently large system we can use the central limit theorem and approximate the exponential sum by a single Gaussian integration,
With
The notation ⟨·⟩_{j→μ }stands for the average over P_{j→μ}(J_{j→0}).
Of course the central limit theorem is meant to be valid in the limit of N → ∞. In practice the Gaussian approximation produces the same results as the exact computation of Eq. (12) already for N ~ 10 (where the exact computation is clearly feasible).
Computational complexity
By means of the Gaussian approximation, the complexity of Eq. (12) is reduced from (3^{N}) to (N), and that of the overall iteration to (MN). The apparent complexity (MN^{2}) of updating M N messages in time (N) can be reduced to (MN). by a simple trick: The sums in Eqs. (16) can be calculated over all j once for each μ, so only the contribution of i has to be removed in the update of ρ_{μ→i }for each i. This allows to make the single update step in constant time. A precise estimate of the overall complexity of the algorithm would require to control the scaling of the number of iterations needed for convergence. A theoretical analysis of BP convergence times in a general setting remains elusive. Some recent progress for the simpler matching problem can be found in [62]. In all the simulations presented in this work, convergence is always reached in less then 50 iterations.
It would be interesting to compare the efficiency of our algorithm with the computational strategy proposed in [33], based on a Monte Carlo Markov Chain (MCMC) sampler over the model space. In our experience, however, MCMC methods have in general some intrinsically associated problems, mainly due to the fact that the convergence (or mixing) time is hard to assess and often is exponential.
Observables
Marginals  We do not aim at constructing a single highscoring coupling vector J like in a maxlikelihood approach. Depending on the shape of the probability space, this vector might be very different from the one actually generating the data. We are instead interested in characterizing the ensemble of all highscoring vectors, or more precisely in the marginal probabilities , which tell us how frequently the coupling from i to 0 takes value J_{i→0}. We can therefore base a global ranking of all potential couplings i → 0 on the probabilities 1  P_{i}(J_{i→0 }= 0) of being nonzero.
When dealing with an artificial data set generated by a known coupling vector J^{true}, we can measure the similarity of our inference result J with the true coupling. To this aim we define
The objective of inference is predicting a fraction of all couplings with high precision, i.e. to have an as high as possible number of TP with a low number of FP. The quality of the inference can be accounted for by confronting recall (or sensitivity) RC = N_{TP }/(N_{TP }+ N_{FN}) and precision (or specificity) PR = N_{TP }/(N_{TP }+ N_{FP}). The recall describes the fraction of all existing nonzero couplings which are recovered by the algorithm, whereas the precision tells us the fraction of all predicted links being actually present in the data generator.
Entropy  An interesting quantity to measure is the entropy, i.e. the logarithm of the number of highscoring coupling vectors compatible with our data set. Within BP, it can be approximated by the Betheentropy
where S^{μ }= ∑_{J }P_{μ}(J) ln P_{μ}(J), and S_{i }= ∑_{Ji }P_{i}(J_{i}) ln P_{i}(J_{i}), and P_{μ}(J) is defined as
i.e. it takes into account the contribution of a single constraint to the probability distribution of J.
Parameter fixing and zeroentropy criterion
The diluting field h is the conjugate variable of the number of effective link , so we can equivalently fix one of the two quantities. One can decide to fix the number of effective links, and thus the size of the searched gene signature, and to choose h accordingly. To find the correct value of h we apply a cooling procedure where, after each interaction of the BP equations step, we increase (resp. decrease) h depending on whether the effective number of link is higher (resp. lower) than the desired value. Since the true number of relevant genes is an unknown quantity, the chosen value for , itself is a free parameter. In practice, in the cooling procedure of the h field, we monitor the value of the entropy and we stop the iteration when as soon as it becomes lower then zero, i.e. at the point where we are able to restrict the of the number of possible solution to our problem to a subexponential number (remember that the entropy here indicates the logarithm of the number of solutions). Upon a further increase of h the entropy becomes negative, and no zero energy solution is found at that value of the dilution parameter h.
In all our simulations we have taken the limit β → ∞.
Authors' contributions
All authors equally contributed to this work. All authors read and approved the final manuscript.
Acknowledgements
This work was partially funded by Microsoft TCI. The authors want to thank M. Vergassola, F. Devaux, and G. Lelandais, who helped a lot in the interpretation of the PDR results.
References

Salin H, Fardeau V, Piccini E, Lelandais G, Tanty V, Lemoine S, Jacq C, Devaux F: Structure and properties of transcriptional networks driving selenite stress response in yeasts.
BMC Genomics 2008, 9:333. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

DeRisi JL, Iyer VR, Brown PO: Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. [http://www.sciencemag.org/cgi/content/abstract/278/5338/680] webcite
Science 1997, 278(5338):680686. PubMed Abstract  Publisher Full Text

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive Identification of Cell Cycleregulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. [http://www.molbiolcell.org/cgi/content/abstract/9/12/3273] webcite
Mol Biol Cell 1998, 9(12):32733297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gasch A, et al.: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. [http://www.molbiolcell.org/cgi/content/abstract/11/12/4241] webcite
Mol Biol Cell 2000, 11(12):42414257. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Causton HC, Ren B, Koh SS, Harbison CT, Kanin E, Jennings EG, Lee TI, True HL, Lander ES, Young RA: Remodeling of Yeast Genome Expression in Response to Environmental Changes. [http://www.molbiolcell.org/cgi/content/abstract/12/2/323] webcite
Mol Biol Cell 2001, 12(2):323337. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

ShenOrr S, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli.
Nat Genet 2002, 31:6468. PubMed Abstract  Publisher Full Text

Guelzim N, Bottani S, Bourgine P, Kepes F: Topological and causal structure of the yeast transcriptional regulatory network.
Nat Genet 2002, 31:6063. PubMed Abstract  Publisher Full Text

Milo R, ShenOrr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network Motifs: Simple Building Blocks of Complex Networks. [http://www.sciencemag.org/cgi/content/abstract/298/5594/824] webcite
Science 2002, 298(5594):824827. PubMed Abstract  Publisher Full Text

Davidson EH, et al.: A Genomic Regulatory Network for Development. [http://www.sciencemag.org/cgi/content/abstract/295/5560/1669] webcite
Science 2002, 295(5560):16691678. PubMed Abstract  Publisher Full Text

Albert R, Othmer H: The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster.
Journal of Theoretical Biology 2003, 223:118. PubMed Abstract  Publisher Full Text

Aguilar D, Oliva B: Topological comparison of methods for predicting transcriptional cooperativity in yeast.
BMC Genomics 2008, 9:137. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Banerjee N, Zhang MQ: Identifying cooperativity among transcription factors controlling the cell cycle in yeast.
Nucleic Acids Res 2003, 31(23):70247031. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chang YH, Wang YC, Chen BS: Identification of transcription factor cooperativity via stochastic system model.
Bioinformatics 2006, 22(18):22762282. PubMed Abstract  Publisher Full Text

Nagamine N, Kawada Y, Sakakibara Y: Identifying cooperative transcriptional regulations using proteinprotein interactions.
Nucleic Acids Res 2005, 33(15):48284837. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tsai HK, Lu HHS, Li WH: Statistical methods for identifying yeast cell cycle transcription factors.
Proc Natl Acad Sci USA 2005, 102(38):1353213537. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

LucauDanila A, Delaveau T, Lelandais G, Devaux F, Jacq C: Competitive promoter occupancy by two yeast paralogous transcription factors controlling the multidrug resistance phenomenon.
J Biol Chem 2003, 278(52):5264152650. PubMed Abstract  Publisher Full Text

Benayoun BA, Veitia RA: A posttranslational modification code for transcription factors: sorting through a sea of signals.
Trends Cell Biol 2009, 19(5):189197. PubMed Abstract  Publisher Full Text

Schena M, Shalon D, Davis RW, Brown PO: Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. [http://www.sciencemag.org/cgi/content/abstract/270/5235/467] webcite
Science 1995, 270(5235):467470. PubMed Abstract  Publisher Full Text

Li H, Xuan J, Wang Y, Zhan M: Inferring regulatory networks.
Front Biosci 2008, 13:263275. PubMed Abstract  Publisher Full Text

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. [http://www.pnas.org/cgi/content/abstract/95/25/14863] webcite
Proceedings of the National Academy of Sciences 1998, 95(25):1486314868. Publisher Full Text

Ernst J, Beg QK, Kay KA, Balázsi G, Oltvai ZN, BarJoseph Z: A semisupervised method for predicting transcription factorgene interactions in Escherichia coli.
PLoS Comput Biol 2008, 4(3):e1000044. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mordelet F, Vert JP: SIRENE: supervised inference of regulatory networks. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/16/i76] webcite
Bioinformatics 2008, 24(16):i7682. PubMed Abstract  Publisher Full Text

Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engineering algorithm for inference of genetic network architectures.

Liu W, Lähdesmäki H, Dougherty E, Shmulevich I: Inference of Boolean Networks Using Sensitivity Regularization.
EURASIP Journal on Bioinformatics and Systems Biology 2008., 2008 Publisher Full Text

Arkin A, Ross J, McAdams HH: Stochastic Kinetic Analysis of Developmental Pathway Bifurcation in Phage lambdaInfected Escherichia coli Cells. [http://www.genetics.org/cgi/content/abstract/149/4/1633] webcite
Genetics 1998, 149(4):16331648. PubMed Abstract  PubMed Central Full Text

D'Haeseleer P, Liang S, Somogyi R: Genetic network inference: from coexpression clustering to reverse engineering. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/16/8/707] webcite
Bioinformatics 2000, 16(8):707726. PubMed Abstract  Publisher Full Text

Yeung MKS, Tegner J, Collins JJ: Reverse engineering gene networks using singular value decomposition and robust regression. [http://www.pnas.org/cgi/content/abstract/99/9/6163] webcite
Proceedings of the National Academy of Sciences 2002, 99(9):61636168. Publisher Full Text

Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. [http://www.sciencemag.org/cgi/content/abstract/301/5629/102] webcite
Science 2003, 301(5629):102105. PubMed Abstract  Publisher Full Text

Tegner J, Yeung MKS, Hasty J, Collins JJ: Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling. [http://www.pnas.org/cgi/content/abstract/100/10/5944] webcite
Proceedings of the National Academy of Sciences 2003, 100(10):59445949. Publisher Full Text

Hartemink A, Gifford D, Jaakkola T, Young R: Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks.

Pe'er D, Regev A, Elidan G, Friedman N: Inferring subnetworks from perturbed expression profiles. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S215] webcite
Bioinformatics 2001, 17(suppl1):S215224. PubMed Abstract  Publisher Full Text

Wang K, Nemenman I, Banerjee N, Margolin A, Califano A: Genomewide discovery of modulators of transcriptional interactions in human B lymphocytes. In Research in Computational Molecular Biology. Volume 3909. Springer Berlin/Heidelberg; 2006::348362.
[doi:10.1007/11732990]
Publisher Full Text 
Mukherjee S, Pelech S, Neve RM, Kuo WL, Ziyad S, Spellman PT, Gray JW, Speed TP: Sparse combinatorial inference with an application in cancer biology. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/2/265] webcite
Bioinformatics 2009, 25(2):265271. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Basso K, Margolin AA, Stolovitzky G, Klein U, DallaFavera R, Califano A: Reverse engineering of regulatory networks in human B cells.
Nat Genet 2005, 37(4):382390. PubMed Abstract  Publisher Full Text

Margolin A, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera R, Califano A: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context.
BMC Bioinformatics 2006, 7(Suppl 1):S7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Santhanam N, Wainwright M: Informationtheoretic limits of selecting binary graphical models in high dimensions.

Gygi SP, Rochon Y, Franza BR, Aebersold R: Correlation between protein and mRNA abundance in yeast.
Mol Cell Biol 1999, 19(3):17201730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li F, Long T, Lu Y, Ouyang Q, Tang C: The yeast cellcycle network is robustly designed. [http://www.pnas.org/cgi/content/abstract/101/14/4781] webcite
Proceedings of the National Academy of Sciences 2004, 101(14):47814786. Publisher Full Text

Sherlock G, HernandezBoussard T, Kasarskis A, Binkley G, Matese JC, Dwight SS, Kaloper M, Weng S, Jin H, Ball CA, Eisen MB, Spellman PT, Brown PO, Botstein D, Cherry JM: The Stanford Microarray Database.
Nucleic Acids Res 2001, 29:152155. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Trofa D, Gàcser A, Nosanchuk JD: Candida parapsilosis, an emerging fungal pathogen.
Clin Microbiol Rev 2008, 21(4):606625. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Diluted Perceptron Algorithm [http://www.polito.it/cmp/code/dilperc] webcite

Kovacs LAS, Orlando DA, Haase SB: Transcription networks and cyclin/CDKs: the yin and yang of cell cycle oscillators.
Cell Cycle 2008, 7(17):26262629. PubMed Abstract  Publisher Full Text

Orlando DA, Lin CY, Bernard A, Wang JY, Socolar JES, Iversen ES, Hartemink AJ, Haase SB: Global control of cellcycle transcription by coupled CDK and network oscillators.
Nature 2008, 453(7197):944947. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lau KY, Ganguli S, Tang C: Function constrains network architecture and dynamics: A case study on the yeast cell cycle Boolean network. [http://link.aps.org/abstract/PRE/v75/e051907] webcite
Physical Review E (Statistical, Nonlinear, and Soft Matter Physics) 2007, 75(5):051907.

Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data.
Nat Genet 2003, 34(2):166176. PubMed Abstract  Publisher Full Text

Balaji S, Madan Babu M, Lakshminarayan M, Luscombe N, Aravind L: Comprehensive Analysis of Combinatorial Regulation using the Transcriptional Regulatory Network of Yeast.
J Mol Biol 2006, 360:213227. PubMed Abstract  Publisher Full Text

Stanford Microarray Database [http://smd.stanford.edu/] webcite

Tal N, Schuldiner S: A coordinated network of transporters with overlapping specificities provides a robust survival strategy.
Proc Natl Acad Sci USA 2009, 106(22):90519056. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Banerjee D, Lelandais G, Shukla S, Mukhopadhyay G, Jacq C, Devaux F, Prasad R: Responses of pathogenic and nonpathogenic yeast species to steroids reveal the functioning and evolution of multidrug resistance transcriptional networks.
Eukaryot Cell 2008, 7:6877. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Crom SL, Devaux F, Marc P, Zhang X, MoyeRowley WS, Jacq C: New insights into the pleiotropic drug resistance network from genomewide characterization of the YRR1 transcription factor regulation system.
Mol Cell Biol 2002, 22(8):26422649. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fardeau V, Lelandais G, Oldfield A, Salin H, Lemoine S, Garcia M, Tanty V, Crom SL, Jacq C, Devaux F: The central role of PDR1 in the foundation of yeast drug resistance.
J Biol Chem 2007, 282(7):50635074. PubMed Abstract  Publisher Full Text

Hikkel I, LucauDanila A, Delaveau T, Marc P, Devaux F, Jacq C: A general strategy to uncover transcription factor properties identifies a new regulator of drug resistance in yeast.
J Biol Chem 2003, 278(13):1142711432. PubMed Abstract  Publisher Full Text

Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M, Freitas AT, Oliveira AL, SáCorreia I: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae.
Nucleic Acids Res 2006, (34 Database):D446D451. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wolf DH, Hilt W: The proteasome: a proteolytic nanomachine of cell regulation and waste disposal.
Biochim Biophys Acta 2004, 1695(13):1931. PubMed Abstract  Publisher Full Text

Mackay DJC: [http://www.inference.phy.cam.ac.uk/mackay/itila/book.html] webcite
Information Theory, Inference & Learning Algorithms. Cambridge University Press; 2002.

Tibshirani R: Regression Shrinkage and Selection via the Lasso.
Journal of the Royal Statistical Society (Series B) 1996, 58:267288.

Yedidia JS, Freeman W, Weiss Y: Generalized belief propagation.
In Advances in Neural Information Processing Systems (NIPS) 13, Denver, CO Edited by press M. 2001, 772778.

Kschischang FR, Frey BJ, Loeliger HA: Factor graphs and the sumproduct algorithm. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=910572] webcite
Information Theory, IEEE Transactions on 2001, 47(2):498519. Publisher Full Text

Kabashima Y: A CDMA multiuser detection algorithm on the basis of belief propagation.
Journal of Physics A: Mathematical and General 2003, 36(43):1111111121. Publisher Full Text

Braunstein A, Zecchina R: Learning by messagepassing in networks of discrete synapses.
Phys Rev Lett 2005, 96:030201. Publisher Full Text

Pagnani A, Tria F, Weigt M: Classification and sparsesignature extraction from geneexpression data. [http://stacks.iop.org/17425468/2009/P05001] webcite
Journal of Statistical Mechanics: Theory and Experiment 2009, 2009(05):P05001.
(22pp)
Publisher Full Text 
Bayati M, Borgs C, Chayes J, Zecchina R: On the exactness of the cavity method for weighted bmatchings on arbitrary graphs and its relation to linear programs. [http://stacks.iop.org/17425468/2008/L06001] webcite
Journal of Statistical Mechanics: Theory and Experiment 2008, 2008(06):L06001.
(10pp)
Publisher Full Text