Centre INRIA Rennes Bretagne Atlantique, IRISA, Rennes, France

Université de Rennes 1, IRISA, Rennes, France

Université de Rennes 1, IRMAR, Rennes, France

CNRS, UMR 6074, IRISA, Rennes, France

Abstract

Background

Expression profiles obtained from multiple perturbation experiments are increasingly used to reconstruct transcriptional regulatory networks, from well studied, simple organisms up to higher eukaryotes. Admittedly, a key ingredient in developing a reconstruction method is its ability to integrate heterogeneous sources of information, as well as to comply with practical observability issues: measurements can be scarce or noisy. In this work, we show how to combine a network of genetic regulations with a set of expression profiles, in order to infer the functional effect of the regulations, as inducer or repressor. Our approach is based on a consistency rule between a network and the signs of variation given by expression arrays.

Results

We evaluate our approach in several settings of increasing complexity. First, we generate artificial expression data on a transcriptional network of

Conclusion

Our approach does not require accurate expression levels nor times series. Nevertheless, we show on both data, real and artificial, that a relatively small number of perturbation experiments are enough to determine a significant portion of regulatory effects. This is a key practical asset compared to statistical methods for network reconstruction. We demonstrate that our approach is able to provide accurate predictions, even when the network is incomplete and the data is noisy.

Background

A central problem in molecular genetics is to understand the transcriptional regulation of gene expression. A transcription factor (TF) is a protein that binds to a typical domain on the DNA and influences transcription. The effect of this TF can be either a repression or an activation of transcription depending on the type of binding site, the distance to coding regions, or on the presence of other molecules. Finding which gene is controlled by which TF is a reverse engineering problem, usually named

A first approach to achieve this task is to collect the information spread in the primary literature. Following this idea, a large number of databases that take protein and regulatory interactions from the literature and curate them have been developed

The alternative to a literature-curated approach is a data-driven approach. This approach is supported by the availability of high-throughput experimental data including microarray expression analysis of deletion mutants (simple or more rarely double non-lethal knockouts), over expression of TF-encoding genes, protein-protein interactions, protein localisation, or ChIP-chip experiments coupled with promoter sequence analysis. We may cite several classes of methods that use these kinds of data, such as correlation, mutual information or causality studies, Bayesian networks, path analysis, information-theoretic approaches, and ordinary differential equations

In short, most available approaches so far are based on a probabilistic framework which defines a probability distribution over the set of models. The reconstructed network is then defined as the most likely model given the data. Such an optimization problem is usually non convex, and finding a global optimum cannot be guaranteed in practice. Existing algorithms report a local optimum which should be interpreted with care: errors can appear and no consensual model may be produced.

As an illustration, special attention has been paid to the reconstruction of

In regulatory networks an important and non-trivial physiological information is the regulatory role of TFs as inducer or repressor, also called

In this paper, we apply formal methods to compute the sign of interactions in networks that have an available topology. By doing so, we also validate the topology of the network. Roughly, we use expression profiles to constrain the possible regulatory roles of TFs, and we report those regulations that are assigned the same role in all feasible models. Thus, we over-approximate the set of

Different sources of large-scale data are exploited in this study: gene expression arrays, which provide information on the interaction signs; and ChIP-chip experiments, which provide the topology of the regulatory network when not available.

The main tasks we address are the following:

1. Building a formal model of regulation for a set of genes that integrates information from ChIP-chip data, sequence analysis, and literature annotations.

2. Checking its consistency with expression profiles on perturbation assays.

3. Inferring the regulatory role of TFs as inducer or repressor if the model is consistent with expression profiles.

4. Isolating ambiguous pieces of information if it is not.

The Results section is organised as follows. We first introduce the mathematical framework which is used to define and to test the consistency between expression profiles and transcriptional networks. Then, we apply our algorithms to address three main issues:

• Analysis of the dependence between the number of available observations and the number of inferred regulations. In the case where all genes are observed, we prove that at most 40.8% of

• Illustration and validation of our method on the transcriptional network of

• Execution of our inference algorithms over the

Results

Detecting the role of a regulation and validating a model

Our goal is to determine the regulatory role of a TF on its target genes by using expression profiles. Let us illustrate our purpose with a simple example.

We suppose that we are given the topology of a network (this topology can be obtained from ChIP-chip data or any computational network inference method). In this network, let us consider a node

Independently, we suppose that we have several gene expression arrays at our disposal. One of these arrays indicates that

Illustration of the simple inference rule

**Illustration of the simple inference rule.**

This naive rule is actually used in a large class of models; we will call it the

A simple case of inconsistency between some data and a model

**A simple case of inconsistency between some data and a model.**

Let us consider now the case when

Our point of view is different; we introduce a **any variation of A must be explained by the variation of at least one of its predecessors**. In previous papers, we introduced a formal framework to justify this basic rule under some reasonable assumptions. We also tested the consistency between expression profiles and a graphical model of cellular interactions. This formalism will be introduced here in an informal way; its full justification and extensions can be found in the references

In our example, the basic rule means that if

Illustration of a prediction

**Illustration of a prediction.**

A formal approach

Consider a system of

In a typical stress perturbation experiment a system leaves an initial steady state following a change in control parameters. After waiting long enough, the system may reach a new steady state. In genetic perturbation experiments, a gene of the cell is either knocked-out or over-expressed; perturbed cells are then compared to the reference. Our approach relies on the signs of the variations in expression or activity of the species in the network. Let us denote by _{i}) ∈ {+, -, **0**} the sign of the variation of species

Let us fix species _{j}) provides the sign of the

When the experiment is a genetic perturbation, the same equation holds for every node that was not genetically perturbed during the experiment and such that all its predecessors were not genetically perturbed. If a predecessor _{M }of the node was knocked-out, the equation becomes

The same holds with +_{M }was over-expressed. There is no equation for the genetically perturbed node.

The **?**, **0**}, provided with a sign consistency relation ≈, and arithmetic operations + and ×. The following tables describe this algebra:

For a given interaction graph _{i }∈ {+, -, **0**} are

With this material at hand, let us come back to our original problem, namely to infer the regulatory role of TFs from the combination of heterogeneous data. In the following we assume that:

• The interaction graph is either given by a model to be validated, or built from ChIP-chip data and TF binding site search in promoter sequences. Thus, as soon as a TF

• The regulatory role of a TF _{ji}, which is constrained by Eqs. (1) or (2).

• Expression profiles provide the sign of variation of the gene expression for a set of

Our inference problem can now be stated as finding values in {+, -} for _{ji}, subject to the constraints:

Most of the time, this inference problem has a huge number of solutions. However, some variables _{ji }may be assigned the same value in _{ji }is a logical consequence of the constraints (3), and a prediction of the model. We will refer to these inferred interaction signs as _{ji }that have the same value in all solutions of a qualitative system (3). When the inference problem has

Let us illustrate this formulation with a very simple (yet informative) example. Suppose that we have a system of three genes

Interaction graph of three genes

**Interaction graph of three genes A, B, C, where their changes in expression was observed in six stress perturbation experiments.**

Illustration of the sign inference process

Experiments used

Qualitative system

Replacing values from experiments

Consistent solutions (_{BA},_{CA})

Inferred signs (identical in all solutions)

{_{1}}

(+) ≈ _{BA }× (+) + _{CA }× (+)

(+, +)

(+, -)

(-, +)

∅

{_{1}, _{2}}

(+) ≈ _{BA }× (+) + _{CA }× (+)

(+,+)

{_{BA }= +}

(+) ≈ _{BA }× (+) + _{CA }× (-)

(+, -)

{_{1}, _{2}, _{3}}

(+) ≈ _{BA }× (+) + _{CA }× (+)

(+, +)

{_{BA }= +, _{CA }= +}

(+) ≈ _{BA }× (+) + _{CA }× (-)

(-) ≈ _{BA }× (+) + _{CA }× (-)

In this example the variables are only the roles of regulations (signs) in the interaction graph. Variations of the species in the graph are obtained from six experiments. Using different sets of experiments we infer different roles of regulation. Using experiments {_{1}, _{2}, _{3}}, for example, our qualitative system will have three constraints and not all valuations of variables _{BA}_{CA}

Algorithmic procedure

When the signs on edges of the interaction graph are known (_{ji}), finding consistent node variations _{i }is a NP-complete problem _{i}), finding the signs of edges _{ji }from _{i }can be proven NP-complete in a very similar way. However, we have been able to design algorithms that perform efficiently on a wide class of regulatory networks. These algorithms predict signs of the edges when the network topology and the expression profiles are consistent. In case of inconsistency, though, they identify ambiguous motifs and propose predictions on parts of the network that are not concerned with ambiguities.

The general process flow is as follows (see the Methods section for details):

**Step 1 **Sign Inference

Divide the graph into motifs (each node with its predecessors). For each motif, find sign valuations (see Algorithm 1 in the Appendix section) that are consistent with all expression profiles. If there are no solutions, call the motif

Solve again the remaining equations and determine the edge signs that are fixed to the same value in all the solutions. These fixed signs are called

**Step 2 **Global test/correction of the inferred signs

Solutions at the previous step are not guaranteed to be global. Indeed, two node motifs at step 1 can be consistent separately, but not altogether (with respect to all expression profiles). This step checks global consistency by solving the equations for each expression profile. New

**Step 3 **Extending the original set of observations

Once all conflicts have been removed, we get a set of solutions in which signs are assessed to both nodes and edges.

**Step 4 **Filtering predictions

In the inconsistent case, the validity of the predictions depends on the accuracy of the model and on the correct identification of the MBMs. The model can be incomplete (missing interactions), and MBMs are not always identifiable in a unique way. Thus, it is useful to sort predictions according to their reliability. Our filtering parameter is a positive integer

The inference process then generates three results:

1.

• Modules of Type I: are composed of several direct regulations towards the same gene. They are detected in the Step 1 of the algorithm, and most of them are composed of only one edge like illustrated in Fig.

Classification of the Multiple Behaviours Modules (MBM) found in

**Classification of the Multiple Behaviours Modules (MBM) found in S. cerevisiae transcriptional network. **Green and red interactions correspond to inferred activations and repressions respectively. Significant differentially expressed genes of the MBM, during one experimental condition, are coloured green (up-regulated), or red (down-regulated) (a)

• Modules of Type II, III, IV: are detected in Steps 2 or 3, hence they contain either direct regulations coming from the same protein or indirect regulations and/or loops. Each of these regulations represents a consensus of all the experiments, but when we attempt to assess them globally, they lead to contradictions. The indices II-IV have no topological meaning, they label the most frequent situations and are illustrated in Fig.

2.

3.

On a computational level, the division between Step 1 (which considers each small motif with all profiles together) and Step 2 (which considers the whole network with each profile separately) is necessary to overcome the memory complexity of the search for solutions. To handle large scale systems we combine decision diagrams and constraint solvers (see details in the Methods section).

Since our basic rule is a weak constraint, we expect it to produce very robust predictions. On the other hand, there are theoretical limits to this approach. For certain interaction graphs, not a single sign may be inferred even with a high number of experiments. In the next paragraphs, we comment on the maximum number of signs that can be inferred from a given graph.

In perturbation experiments, gene responses are observed following changes of external conditions (temperature, nutritional stress,

In the following pragraphs we describe the results we obtained. First of all, in order to validate our formal approach, we evaluated the percentage of the

On a computational level, we checked that our algorithms were able to handle large scale data, as produced by high-throughput measurement techniques (expression arrays, ChIP-chip data). This is demonstrated in the following by considering networks of thousands of genes.

Stress perturbation experiments: how many do you need?

For any given network topology, even when considering all possible experimental profiles, there are signs that cannot be determined (see Table

In order to calculate the theoretical and the average percentages of recovered signs for the transcriptional network of

From the unsigned interaction graph of _{i}}_{i = 1,...,n }that are not entirely random, for they are constrained by Eqs.(1) and (2). Then, we forget the signs of the network edges and compute the qualitative system with the signs of regulations as unknown.

The _{max }= 1551 edges.

However, this maximum can be obtained only if all conceivable (more than 2^{50}) perturbation experiments are done, which is in practice not possible. We performed computations to understand the influence of the number of experiments (

(Both) Statistics of the sign-inference process on the regulatory network of

**(Both) Statistics of the sign-inference process on the regulatory network of E. coli from complete expression profiles.** The signed interaction graph is used to generate sets of

We can obtain a theoretical formula explaining the saturation aspect of the curve in Fig. _{1 }single incoming regulations. These can be inferred with probability one from only one experiment, using the naive algorithm (see Algorithm 1). Let us suppose a second category of interactions, whose signs are inferred with probability _{1 }+ _{2}, where _{2 }is the number of interactions in the second category. Supposing now that inference failures are independent for different experiments, we obtain the average number of inferred signs for _{1 }+ _{2}(1 - (1 - ^{N}). In general, we have _{1 }+ _{2 }<

In our example, the value _{1 }= 609 corresponds to the average number of signs inferred by the naive algorithm. Surprisingly, by using our method we can significantly improve the naive inference with little effort. For the whole

According to our estimates the position of the plateau is _{1 }+ _{2 }= 1420, which is smaller than the theoretical maximum _{max}. The difference, although negligible in practice (to obtain _{max }one has to perform ^{50 }experiments), suggests that the plateau has a very weak slope. This means that contributions of different experiments to sign inference are weakly dependent.

The values of _{1}_{2}, _{1},_{2 }mean small number of expression profiles needed for inference.

Inferring the core of the network

Obviously, not all interactions play the same role in the network. The

Core of

**Core of E. coli network.** It consists of all oriented loops and of all oriented chains leading to loops. The core contains the dynamical information of the network, hence sign edges are more difficult to infer.

In the previous section, we applied the same inference process to this graph. Not surprisingly, we noticed a rather different behaviour when inferring signs on a core graph than on a whole graph as demonstrated in Fig.

Two observations may be concluded. First, a greater number of experiments is required to reach a comparable percentage of inference; the value of

This suggests that not only the core of the network is more difficult to infer, but also that a brute force approach (multiplying the number of experiments) may fail as well. This situation encourages us to apply experiment design and planning, that is, computational methods to minimise the number of perturbation experiments while inferring a maximal number of regulatory roles.

This also illustrates why our approach is complementary to dynamical modelling. In the case of large scale networks, when an interaction stands outside the core of the graph, an inference approach is suitable for inferring the sign of the interaction. However, when an interaction belongs to the core of the network, more complex behaviours occur (

Influence of missing data

In the previous paragraphs, we assumed that all products in the network were observed. That is, in each experiment each node is assigned a value in {+, **0**, -}. However, in real measurement devices, such as expression profiles, a part of the values is discarded due to technical reasons. A practical method for network inference should cope with missing data.

We studied the impact of missing values on the percentage of inference. For this, we have considered a fixed number of expression profiles (

(All) Statistics of the sign-inference process on the regulatory network of

**(All) Statistics of the sign-inference process on the regulatory network of E. coli from partial expression profiles.** The setting is similar to the one used in Fig. 6, except for the cardinal of the expression profiles (

In both cases (whole network and core), the dependency between the average percentage of inference and the percentage of missing values is qualitatively linear. Simple arguments allow us to find an analytic dependency. If not observing one node of the network implies losing information on _{i }= _{total}; where _{total }is the total number of nodes. In order to keep _{i }non negative,

Application to

We validated our method on the transcriptional

Several profiles were available, including a reference condition. We grouped together the different profiles corresponding to the same experiment; for each gene we calculated its average variation in the group of profiles. When profiles were time series, we considered that each time series ends with steady state and we used the last state in the time series. Then, we sorted the measured genes in four classes: 2-fold up-regulated, 2-fold down-regulated, non-observed, and zero variation; this last class corresponds to non significantly (2-fold) expressed genes. Only the first two classes were used in the algorithm. Therefore, there will be missing data: for some edges, neither the input nor the output are observed. Altogether, we have processed 226 sets of expression profiles corresponding to 61 different experiments (over-expression, gene-deletion, and stress perturbation). We verified, for all the experiments, that they correspond to the comparison between one perturbed condition against a control condition with identical levels in all chemical components except for the one altered in the perturbed condition.

We applied our inference algorithm twice: the first time we used the signed network in a pre-processing step, in order to clean the expression data. It appears that the signed network is consistent with only 31 of the 61 selected experiments. After discarding the inconsistent motifs from each experiment (deleting observations that caused conflicts), we stayed with 61 experiments which only contained the data consistent with the signed network. In these 61 experiments, on average 12.62% of the network nodes were observed. When summing up all the observations, we obtained that 6.5% (190) of the edges (input and output) were observed in at least one expression profile; these represent the maximal set of signs that can be inferred at Steps 1 and 2 of our inference algorithm. In order to test our algorithm we wiped out the information on edge signs and then tried to recover it. Since the profiles and network were consistent, our algorithm found no ambiguity and predicted 38 signs,

Afterwards, we tested our algorithm with the full set of observations, no data being discarded. Conflicts appeared and we filtered our inference with different parameters on the full set of 61 experiments including inconsistencies. This time 12.9% of the network products were observed on average. When summing all the observations, 17.2% (497) of the edges (input and output) were observed in at least one expression profile. Several values of the filtering parameter

Results of the inference algorithm applied to

**Results of the inference algorithm applied to E. coli network with a compendium of 61 experiments not globally coherent. **The dark and light regions of the bars correspond to false positive and validated predictions, respectively. Without filtering, there are 28.3% of false positives. With filtering – keeping only the sign predictions confirmed by

It should be noted that we obtained very similar results either by cleaning the data thanks to the signed network, either by using our filtering procedure. This is a particularly clear indication that this filtering procedure is an effective strategy to produce robust predictions.

Our algorithm also detected ambiguous modules in the network. There are seven MBM of Type I (

Interactions in the regulatory network of

**Interactions in the regulatory network of E. coli that are ambiguous with a compendium data of expression profiles.** For each interaction, there exist at least two expression profiles that do not predict the same sign on the interaction. Dotted and filled lines represent the MBM of Type I and Type II, respectively.

A real case: inference of signs in

We applied our inference algorithm to the transcriptional regulatory network of the budding yeast

(A) The first network consists of the core of the transcriptional ChIP-chip regulatory network produced in

(B) The second network contains all the transcriptional interactions between TFs shown by

(C) The third network is the set of interactions among TFs as inferred in

(D) The last network contains all the transcriptional interactions among genes and regulators shown by

Inference process with gene-deletion expression profiles

We first applied our inference algorithm to the large scale network (D) using a panel of expression profiles for 210 gene-deletion experiments

We validated our prediction with a literature-curated network on Yeast

Gene-deletion expression profiles were used in order to compare our results to path analysis methods

First, we tested the consistency between the inferred network obtained from path analysis methods with the 210 gene-deletion experiments. We obtained that the network was inconsistent with 28 of the 210 experiments. Second, we compared the inference results for both methods, our approach and the path analysis method, obtaining in the latter that 234 roles of widely connected paths were inferred; whereas with our method 162 roles were inferred, mainly localised in the branches of the network. Both results intersected on 17 interactions and no contradiction in the inferred role was reported. An illustration of these results is given in the Supplementary Web site.

This suggests that our approach is complementary to path analysis methods. Our explanation is as follows: in

Inference with stress perturbation expression profiles

To overcome the problem exposed using the small amount of information contained in

List of genome expression experiments on

Experiment Identifier

Description

Ref.

E1

Diauxic Shift

[40]

E2

Sporulation

[41]

E3

Expression analysis of Snf2 mutant

[42]

E4

Expression analysis of Swi1 mutant

[42]

E5

Pho metabolism

[43]

E6

Nitrogen Depletion

[44]

E7

Stationary Phase

[44]

E8

Heat Shock from 21°C to 37°C

[44]

E9

Heat Shock from 17°C to 37°C

[44]

E10

Wild type response to DNA-damaging agents

[45]

E11

Mec1 mutant response to DNA-damaging agents

[45]

E12

Glycosylation defects on gene expression

[46]

E13

Cells grown to early log-phase in YPE (Rich medium with 2% of Ethanol)

[47]

E14

Cells grown to early log-phase in YPG (Rich medium with 2% of Glycerol)

[47]

E15

Titratable promoter alleles – Ero1 mutant

[48]

All experiments contain information on steady state shift and their curated data is available in SGD (Saccharomyces Genome Database) [32].

As in the case of

We obtained our total inference rate by adding the number of inferred signs fixed in an unique way to the number of non-repeated interactions in the MBM detected, and dividing it by the total number of edges in the network. In Table

Results of the sign inference process on

Interaction network

Nodes

Edges

Average observed nodes

In/Out observed simultan.

Inferred signs {+, -}

MBM Type I

MBM Type II-IV

Total Inference

Naive Algorithm Inference

(A) Core of Transc. Network [11,28]

31

52

28%

88%

11

3

0

26.8%

11%

(B) Extended Transc. Network [11]

70

96

26%

72%

29

7

0

37.4%

15,6%

(C) MacIsaac inferred network [12,13]

83

131

33%

69%

21

4

0

19%

11%

(D) Global Transc. Network [11]

2419

4344

30%

52%

no filter : 631 filter k = 3 : 198

281

463

32%

13.9%

Sign inference process applied on four transcriptional networks of

We validated the inferred interactions comparing them to the literature-curated network published in

**S. cerevisiae**** transcriptional network**. Only interactions among transcription factors were taken into account (70 nodes, 96 edges) [11]. A total of 29 interactions were inferred. Green and red arrows correspond to inferred activations and repressions, respectively. Blue arrows correspond to the detected MBM of Type I. The diagram layout was produced using the Cytoscape package [39].

As already mentioned, the algorithm identified a large number of ambiguities. The exhaustive list of MBM is given in the Supplementary Web site and the Type I modules of size 2 found for the networks (A), (B), and (C) are detailed in Table

Ambiguous modules of Type I found for 3 transcriptional networks of

Interaction network

Actor

Target

Experiment 1

Experiment 2

(A) Core of Transc. Network

YAP6

CIN5

Expression during Sporulation [41]

YPD Broth to Stationary Phase [44]

GRF10

MBP1

YPD Broth to Stationary Phase [44]

Mec1 mutant + Heat [45]

PDH1

MSN4

Nitrogen Depletion [44]

Heat shock 21°C to 37°C [44]

(B) Extended Transc. Network

YAP6

CIN5

Expression during Sporulation [41]

YPD Broth to Stationary Phase [44]

RAP1

SIP4

Expression during Sporulation [41]

Expression during the diauxic shift [40]

SKN7

NRG1

YPD Broth to Stationary Phase [44]

Expression during the diauxic shift [40]

PHD1

SOK2

Heat shock 21°C to 37°C [44]

YPD Broth to Stationary Phase [44]

RAP1

RCS1

Wild type + Heat [45]

Transition from fermentative to glycerol- based respiratory growth [47]

PHD1

MSN4

Nitrogen Depletion [44]

Heat shock 21°C to 37°C [44]

HAP4

PUT3

Expression during the diauxic shift [40]

Snf2 mutant, YPD [42]

(C) MacIssac inferred network

SWI5

ASH1

Expression regulated by the PHO path- way [43]

YPD Broth to Stationary Phase [44]

SKN7

NRG1

YPD Broth to Stationary Phase [44]

Nitrogen Depletion [44]

NRG1

YAP7

Expression regulated by the PHO path- way [43]

Transition from fermentative to glycerol- based respiratory growth [47]

NRG1

GAT3

Glycosylation [46]

Transition from fermentative to glycerol- based respiratory growth [47]

For each ambiguous module, we list two inconsistent experiments that infer a different role of regulation.

Contribution of expression profiles to the inference

Analysing only the sign inference process on the global network (D), we wish to estimate how the 14 experiments used influence the unique way {+, -} inferred signs. On that account we address the following question: Assuming that all the inferred roles in Step 1 of our inference algorithm are correct, which is the experiment that marks more inferred roles as inconsistent (

Therefore, we classified the 14 experiments according to the MBM of Type II-IV generated per experiment. MBM of Type I are not included in this computation, for they are inferred in Step 1 of the algorithm. The results of this classification are shown in Fig.

Classification of the 14 experiments used in the sign-inference process for the global transcriptional network (2419 nodes, 4344 edges)

**Classification of the 14 experiments used in the sign-inference process for the global transcriptional network (2419 nodes, 4344 edges).** The experiments are represented by their identifier (see Table 2). Each experiment has a twofold contribution: it spots inconsistent modules (MBM that are further excluded from inference) and it predicts interaction roles. Some experiments have more predictive power, just because they include more genes. In order to normalise the predictive power, we divided the percentage of predictions by the percentage of observed nodes. For each experiment we have estimated: (A) Number of significant (2-fold) up/down-regulated genes. (B) Percentage of edges in the spotted MBMs of type II-IV divided by the percentage of observed genes. (C) Percentage of inferred signs divided by the percentage of observed genes. (D) Real contribution of each experiment, calculated by subtracting C (inference) from B (eliminated inconsistency); negative values correspond to experiments whose main role is to spot ambiguities.

Discussion

Predicting from a "small" number of expression profiles

In principle, inferring the functional effect of regulations could be done using general reconstruction methods. The most outstanding approaches in this domain include Bayesian networks

Generating accurate predictions

The problem of inferring functional effect of transcription factors was specifically addressed by Yeang and colleagues

Sign inference and network topology

Using simulations, we evaluated the dependence between the number of available expression profiles and the number of signs that can be inferred from them. Not surprisingly, we noticed that the topology of the regulatory network has a strong influence on the estimated relationship. This was illustrated by computing statistics on both a complete regulatory network and its core. The complete network is characterised by an over-representation of feedback-free regulatory cascades, which are controlled by a small number of TFs. In this setting, the number of inferred signs grows almost continuously with the number of observations. In contrast, the core network does not obey the simple law "the more you observe, the better", some expression profiles being clearly more informative than others. Additionally, in these core networks an unfeasible number of experiments is necessary to infer a small number of signs with high probability. For these core networks, two different strategies may be adopted. First, to build a more accurate model for these restricted subnetworks using dynamic modelling techniques (see

Conclusion

In this work we proposed a discrete approach for a particular case of reconstruction problem: given a set of regulations between genes, and a set of expression profiles, determine the functional effect of each regulation, as activation or inhibition. Our approach is based on a qualitative modelling framework, that was initially introduced to check the consistency between a regulatory network and expression data

While intuitive and simple, the qualitative rule we propose can be used to infer a significant number of regulatory effects from a reasonable number of expression profiles. As shown using data on

From our results on yeast, it appears that a significant proportion of the network – as given by ChIP-chip data – is not compatible with the available expression profiles. As explained in the Results section, these data is discarded from the analysis, in order to compute safe predictions – but at the expense of a loss of information. The subject of our current work is to develop an improved notion of prediction, that copes better with inconsistent network and data. The goal is to include inconsistent data in the inference process, while preserving the reliability of the predictions.

Methods

Problem statement

We consider the set of equations derived from a given interaction graph

where _{ji }the sign of the influence of species **0**).

A single equation in the system (4) can be viewed as a predicate _{i,k}(_{i, k}(^{k},

Our problem can now be stated as follows: given a set of expression profiles ^{1},...,^{r}, decide if the predicate:

can be satisfied. If so, find all variables that take the same value in all admissible valuations (so called

Decision diagram encoding

In a previous work

In order to cope with the size of the problem, we propose to investigate a particular case, when all species are observed, in all experiments. In this case, _{i, k}(^{k}] and _{j, k}(^{k}] share no variables. This means that

may be satisfied. As a consequence, a variable _{ji }is a hard component of _{i,.. }_{i,. }correspond to the constraints which relate species _{i,. }is exactly the in-degree of species

As soon as some species are not observed in some experiment, the predicates _{i,. }share some variables and it is not guaranteed to find all hard components by studying them separately. A brief investigation showed (data not shown) that due to the topology of the graph, most of the equations are not independent any more, even with few missing nodes. Note however, that any hard component of _{i,. }is still a hard component of

where _{.,k }corresponds to the constraints that relate all species in

In practice, this algorithm is very effective in terms of computation time and number of hard components found. However, as already stated, it is not guaranteed to find all hard components of

Solving with Answer Set Programming

In order to solve large qualitative systems, we also tried to encode the problem as a logic program, in the setting of answer set programming (ASP). While decision diagrams represent the set of

We use clasp for solving ASP programs

To sum up, in order to solve a system of qualitative equations (4) with only partial observations, we use Algorithm 2 first and thus determine most (if not all) hard components. Then, Algorithm 3 is used for the remaining components, which are nearly all non hard.

Reduction technique

As mentioned in the Result section, interaction graphs may be reduced in a way that preserves the satisfiability of the associated qualitative system. Consider a graph

The core of an interaction graph corresponds to the most difficult part to solve, because extending a solution for the core to the entire graph can be done in polynomial time, using a breadth-first traverse.

Diagnosis for noisy data

When working with real-life data, it may happen that the predicate

• a reported expression data is wrong

• an arrow (or more generally a subgraph) is missing

• the sign on an edge depends on the state of the system

In the third case, the conditions for deriving Eq. (1) are not fulfilled for one node and its qualitative equation should be discarded. This, however, does not affect the validity of the remaining equation.

In all cases, isolating the cause of the problem is a hard task. We propose the following diagnosis approach: as _{i,. ,. }predicates, the result might directly be interpreted and visualised as a subgraph of the original model.

How to determine if a sign can be inferred

In the Results section, we have seen some examples showing that even when all feasible observations are available, it might not be possible to infer all signs in the interaction graph. Whether or not a sign can be inferred depends on the topology of the graph, and also on the actual signs on interactions. In practice, it is thus impossible to tell from the unsigned graph only if a sign can be recovered. However, it is still interesting to evaluate on fully signed interaction networks which part can be inferred. A trivial algorithm for this consists in explicitly generating all feasible observations and using the algorithms described above. This is unfeasible due to the number of observations.

With the notations introduced above, consider an observation _{i}(

Then, the constraint that we can derive on _{i}(

_{i}(_{i}(_{i}(

Finally, the hard components of _{i }are exactly the signs that can be inferred using

1. compute _{1 ≤ i ≤ n }_{i}(

2. compute _{i }from

3. compute _{i}, the constraints of signs given all feasible observations

4. compute the hard components of _{i}, which are exactly the signs that can be inferred.

If it is not possible to compute

Authors' contributions

PV participated in designing the algorithms described in the Methods section and in performing the simulations. CG designed the algorithms described in the Results section, and performed the analysis on

Appendix

**Algorithm 1**

Naive Inference algorithm

**Algorithm: **Naive Inference algorithm

**Input:**

a network with its topology

a set of expression profiles

**Output:**

a set of predicted signs

a set of ambiguous interactions

**For all **Node

**if ****then return**

**if **

**then return **Ambiguous arrow B

**Algorithm 2**

Heuristic for finding hard components in large interaction networks with many expression profiles.

**Input:**

the predicates _{i,. }and _{.,k }for all

observed variations

**Output:**

a set

**while ****do**

_{i }_{i,. }[^{k},

**if ****then return **

_{k }_{.,k }[^{k},

**if ****then return **

**end**

**Algorithm 3**

Exact algorithm for finding the set of hard components of

**Algorithm:**Hard components using ASP

**Input:**

the predicates

observed variations

**Output:**

a set

_{ji}|

**if ****then return **⊥

**while ****do**

choose

**if ****then**

_{V})} ∪

**else**

delete from _{W}

**end**

**end**

Acknowledgements

The authors are particularly grateful to B. Kauffman, M. Gebser, and T. Schaub from the University of Potsdam for their help on CLASP software. They also wish to thank the referees for their interesting and constructive remarks.