| Experimental design for efficient identification of gene regulatory networks using sparse Bayesian modelsMax Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany
BMC Systems Biology 2007, 1:51doi:10.1186/1752-0509-1-51 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1752-0509/1/51
©
2007 Steinke et al; licensee BioMed Central Ltd. AbstractBackgroundIdentifying large gene regulatory networks is an important task, while the acquisition of data through perturbation experiments (e.g., gene switches, RNAi, heterozygotes) is expensive. It is thus desirable to use an identification method that effectively incorporates available prior knowledge – such as sparse connectivity – and that allows to design experiments such that maximal information is gained from each one. ResultsOur main contributions are twofold: a method for consistent inference of network structure is provided, incorporating prior knowledge about sparse connectivity. The algorithm is time efficient and robust to violations of model assumptions. Moreover, we show how to use it for optimal experimental design, reducing the number of required experiments substantially. We employ sparse linear models, and show how to perform full Bayesian inference for these. We not only estimate a single maximum likelihood network, but compute a posterior distribution over networks, using a novel variant of the expectation propagation method. The representation of uncertainty enables us to do effective experimental design in a standard statistical setting: experiments are selected such that the experiments are maximally informative. ConclusionFew methods have addressed the design issue so far. Compared to the most well-known one, our method is more transparent, and is shown to perform qualitatively superior. In the former, hard and unrealistic constraints have to be placed on the network structure for mere computational tractability, while such are not required in our method. We demonstrate reconstruction and optimal experimental design capabilities on tasks generated from realistic non-linear network simulators. The methods described in the paper are available as a Matlab package at http://www.kyb.tuebingen.mpg.de/sparselinearmodel webcite. BackgroundRetrieving a gene regulatory network from experimental measurements and biological prior knowledge is a central issue in computational biology. The DNA micro-array technique allows to measure expression levels of hundreds of genes in parallel, and many approaches to identify network structure from micro-array experiments have been proposed. Models include dynamical systems based on ordinary differential equations (ODEs) [1-5], Bayesian networks [6,7], or Boolean networks [8]. We focus on the ODE setting, where one or few expression levels are perturbed by external means, such as RNA interference [9], gene toggle switches (plasmids) [10], or using diploid heterozygotes, and the network structure is inferred from changes in the system response. So far only few studies investigate the possibility of designing experiments actively. In an active setting, experimental design is used to choose an order of perturbations (from a set of feasible candidates) such that maximum novel information about the underlying network is obtained in each experiment. Multi-gene perturbations are becoming increasingly popular, yielding more informative data, and automated data-driven design technologies are required to deal with the combinatorial number of choices which can be opaque even for a human expert. Identifying (linear) ODE systems from observations and experimental design are well developed within the control community [11]. However, in the systems biology context, only very few measurements are available compared to the dimension of the system (i.e. number of genes), and experiments leading to such observations are severely restricted. Biological measurements are noisy, and time resolution is low, so that in practice only steady states of a system may be accurately measurable. On the other hand, there are no real-time requirements in biological control applications, and more advanced models and analysis can be used. A large body of biological knowledge can be used to counter the small number of observations, for example by specifying a prior distribution within a Bayesian method. The standard system identification and experimental design solutions of control theory may therefore not be well-suited for biology. We propose a full Bayesian framework for network recovery and optimal experimental design. Given many observed genes and rather few noisy measurements, the recovery problem is highly under-determined, and a prior distribution encoding biological knowledge about the connectivity matrix does have a large impact. One of the key assumptions is network sparsity, which holds true for all known regulatory networks. We adopt the linear model frequently used in the ODE setting [1,2,4,5,12], but use a sparsity-enforcing prior on the network matrix. The sparse linear model is the basis of the Lasso [13], previously applied to the gene network problem in [12]. However, they simply estimate the single network maximizing the posterior probability from passively acquired data, and do not address experimental design. We closely approximate the Bayesian posterior distribution over connectivity matrices, allowing us to compute established design criteria such as the information gain, which cannot be done using maximum a posteriori (MAP) estimation. The posterior distribution cannot be computed in closed form, and obtaining an accurate approximation efficiently is challenging. We apply a novel variant of the recent expectation propagation algorithm towards this end. Many other approaches for sparse network recovery have been proposed. In [1], the space of possible networks (as computed by a SVD) is scanned for the sparsest solution. A sparse Bayesian model is proposed in [14], see also [15]. While there is some work on experimental design for boolean networks [16] and Bayesian causal networks [17], none of the above mentioned methods have been used towards this goal. Experimental design remains fairly unexplored in the sparse ODE setting, with the notable exception of [3]. We compare our approach to theirs, finding our method to perform recovery with significantly less experiments and running much faster. Our method is more robust to observation noise frequently present for biological experiments, and somewhat more transparent and in line with statistical practice. Finally, their method consists of a combinatorial search and is therefore only applicable to networks with uniformly small in-degree, an assumption invalid for many known regulatory networks, e.g. [18]. Results and DiscussionAlgorithmOur ModelWe start with the common linearized ODE model: expression levels x(t) ∈ ℝN of N measured genes at time t are modeled by the stochastic dynamical system dx(t) = f(x(t))dt - u(t)dt + dW(t).(1) Here, f: ℝN → ℝN describes the non-linear system dynamics, u(t) is a user-applied disturbance, and dW (t) is white noise. With u(t) ≡ 0, we assume that the system settles in a steady state, and we linearize the system around that point. In this setting, a perturbation experiment consists of applying a constant u(t) ≡ u, then measuring the difference x between new and undisturbed steady state. Under the linearity assumption, we have that u = Ax + ε,(2) where A is the system matrix with entries aij, the non-zero aij describing the gene regulatory network. The noise εis assumed to be i.i.d. Gaussian with variance σ2. We focus on steady state differences, as in [3]. Time course measurements are modelled linearly in [4,5], and our method can easily be formulated in their setup as well. We assume that the disturbances u do not drive the system out of the linearity region around the unperturbed steady state. While this seems a fairly strong assumption, our simulation experiments show that effective network recovery is possible even if it is partly violated. Our contribution to this standard linear regression formulation is a Bayesian model, incorporating prior information about A, namely its sparsity. The unknown matrix A is inferred via a posterior distribution, rather than merely estimated, allowing us to perform experimental design within a statistically optimal framework. Observations are denoted X = (x1 ... xm)T, U = (u1 ... um)T, and the Bayesian posterior is P(A|U, X) ∝ P(U|A, X)P(A),(3) where the likelihood is Note that typically m < N, certainly in early stages of experimental design, and U = XA has no unique solution. In this situation, the encoding of knowledge in the prior P(A) is of large importance. True biological networks are known to be sparsely connected, so we would expect sparse network matrices A. The prior should force as many entries of A close to zero as possible, at the expense of allowing for fairly large values of a few components. It should be a sparsity prior. We employ a Laplace prior distribution It is instructive to compare the Laplace against the Gaussian distribution, which is commonly used as prior in the linear model. The Laplace puts much more weight close to zero than the Gaussian, while still having higher probabilities for large values. The implications are depicted in Figure 1, see also [15]. In fact, the Gaussian prior is used with the linear model mostly for convenience, since the posterior is Gaussian again and can be computed easily [19]. Even within our framework, computations with a Gaussian prior are significantly more efficient than with a Laplace. However, our results prove that theoretical arguments in favour of the Laplace prior do have real practical weight, in that the computational advantages with the Gaussian are paid for by a much worse predictive accuracy, and identification needs significantly more measurements than for the Laplace.
The bi-separation characteristic of the Laplace prior into few large and many small parameters (which is not present for the Gaussian) is embodied even more strongly in other sparsity priors, such as "spike-and-slab" (mixture of narrow and wide Gaussian), Student-t, or distributions based on α-norms, Note that the Laplace prior does not imply any strict constraints on the graph structure, i.e. the sparsity pattern of A, in contrast to other combinatorial approaches which can be run affordably only after placing hard constraints on the in-degree of all network nodes [3]. The Laplace prior P(A) and the resulting posterior have densities, so that the probability of a matrix A having entries exactly equal to zero vanishes. Sparsity priors with point masses on zero have been used in Statistics, but approximate Bayesian inference for such is very hard in general (such priors are certainly not log-concave). We predict discrete network graphs from our posterior as follows. For a small threshold δe, we take aij to represent an edge i ← j iff |aij| > δe. Moreover, the marginal posterior probability of {|aij| > δ e} is used to rank potential edges i ← j. The posterior for the sparse linear model with Laplace prior does not fall into any standard multivariate distribution family, and it is not known how to do computations with it analytically. On the other hand, experimental design requires a good approximation to the posterior, which can be updated efficiently in order to score an experiment. Denote the observations (experiments) obtained so far by D. From (3) and (4), we see that the posterior factorizes w.r.t. rows of A, in that Experimental DesignIn our setup, an experiment consists of applying a constant disturbance u to the system, then measuring the new steady state. With current technology, such an experiment is expensive and time-consuming, especially if u is to be controlled fairly accurately. The goal of sequential experimental design is to choose the next experiment among a set of candidates (of about the same cost), with the aim of decreasing the uncertainty in A using as few experiments as possible. A successful design methodology allows to obtain the same conclusion with less cost and time, compared to doing experiments at random or even following an exhaustive coverage. To this end, an information value score is computed for each candidate, and the maximizer is chosen. Different costs of experiments can be considered by multiplying the information value score with the costs. However, note that if the costs are extremely different, experiment design is often not necessary since the costs alone determine what should be done next. A straightforward choice of an information value score is the expected decrease in uncertainty. In general, experimental design thus cannot be done without a representation of uncertainty in A, and the Bayesian framework maintains such a representation at its core, namely the posterior. Methods based solely on maximum likelihood or maximum a posteriori estimation (such as Lasso) fail to represent uncertainties. Denote the current posterior by Q(A) = Q(A|D). If (u*, x*) is the outcome of an experiment, let Q'(A) = Q'(A|D ∪ {(u*, x*)}) be the posterior including the additional observation. Different information value scores have been proposed for experimental design, see [23] for an overview. A measure for the amount of uncertainty in Q is the differential entropy EQ [- log Q], so a convenient score would be the entropy difference EQ [- log Q] - EQ' [- log Q']. A related score is the information gain S(u*, x*|D) = D[Q' || Q] = EQ' [log Q' - log Q]. Here, D[Q' || Q] is the relative entropy (or Kullback-Leibler divergence), a common measure for the "cost" (in terms of information) of replacing Q' by Q. The inclusion of a new experiment leads precisely to the replacement Q → Q', so the information gain is well-motivated in our setup. While scores such as information gain or entropy difference are hard to compute for general distributions Q, Q', this can be done straightforwardly for Gaussians. If Q(a) = N(h, Σ), Q'(a) = N(h', Σ') and a = with M = (Σ')-1Σ, which can be computed very efficiently in our framework. The outcome (u*, x*) of an experiment is of course not completely known before it is performed. The central idea of Bayesian sequential design is to compute the distribution over outcomes of the experiment, based on all observations so far, with which to average the score S(u*, x*|D). Thus, some experimental candidate e is represented by a distribution Qe(· |D) over (u*, x*). In the setting of this paper, u* is completely known, say u* = u(e) for candidate e, although in an extended setting, e might only specify a distribution over u*. Given u* = u(e), TestingIn the literature, there are some small networks with known dynamics, e.g. the Drosophila segment polarity network [24]. However, a thorough evaluation of our method requires significantly larger systems for which the dynamics are known, so that disturbance experiments can be simulated, and the predictions of our method can be verified. We are not aware of such models having been established for real biological networks yet, the DREAM project [25] aims at providing such data in the future. We therefore concentrate on realistic "in-silico" models, applying our method to many randomly generated instances with different structures and dynamics in order to obtain a robust evaluation and comparison. We simulate the whole network identification process. First, we generate a biologically inspired ground-truth network together with parameters for a numerical simulator of nonlinear dynamics. We feed our method with a number of candidate perturbations {u*}, among which it can choose the experiments to be done. If some u* is selected, the corresponding x* is obtained from the simulator, and (u*, x*) is included into the posterior as new observation. We score the current posterior Q(A) against the true network after each inclusion, comparing our method against variants in different settings. Free hyperparameters (τ, σ2) are selected individually for each of the methods to be compared (see Methods section). We also compare against the experimental design method proposed in [3], and finally show results on the real, but small Drosophila segment polarity network [24]. Network SimulationCommon computational models of sparse regulatory networks often build on the scale-free or the small-world assumption [26]. In small world networks the average path length is much shorter than in a uniform random network. We sample such small-world networks with N = 50 nodes (unless otherwise said), see Figure 2 for an example. Further details about network generation and properties are given in additional file 1.
Additional file 1. Implementation details. The text describes how to sample small-world networks, how the parameters of the simulator were chosen, and describes in detail how the re-implementation of the method [3] was performed. Format: PDF Size: 103KB Download file This file can be viewed with: Adobe Acrobat Reader For a given network structure, we sample plausible interaction dynamics using Hill-type kinetics, inspired by the model in [2]. The non-linear function in (1) is where Each observation (u, x) consists of a constant disturbance u and its effect x, being the difference between a new (perturbed) and the old (unperturbed) steady state. Disturbance candidates were restricted to a small number r of non-zero entries, since experimental techniques for disturbing many genes in parallel by tightly controlled amounts are not yet available. All non-zero uj are in {±ν}, where the sign is random, so ||u|| is the same for all u. We measure ||u|| in units given by the average relative change in steady state when such disturbances u are applied. We use a pool of 200 randomly generated candidates. The SDE simulator can be used with different levels of noise, measured in terms of the signal-to-noise ratio (SNR), i.e. the ratio of ||u|| and the standard deviation of the resulting ε in (2). All results are averaged over 100 runs with independently drawn networks. In the comparative plots presented below, the different methods all see the same data in each run. Evaluation CriterionThe output from a regulatory network identification method most relevant to a practitioner is a ranking of all possible links, ordered by the probability that they are true edges. With this in mind, we choose the following evaluation score, based on ROC analysis. At any time, our method provides a posterior Q(A), of which at present we only use the marginal distributions Q(aij). We produce a ranking of the edges according to the posterior probabilities Q({|aij| > δe}), where δe = 0.1 in all experiments. δe was calibrated against average component sizes |aij|, which are roughly given through the dominant time scales in the dynamical system. The predicted rankings are robust against moderate changes of δe. In a standard ROC analysis, the true positive rate (TPR) is plotted as a function of the false positive rate (FPR), and the area under this curve (AUC) is measured. This is not useful in our setting, because only very small FPRs are acceptable at all (there are N2 potential edges). Our iAUC score is obtained by computing AUC only up to a number of FP equal to the number of edges in the true network, normalized to lie in [0, 1]. For N = 50, the "baseline" of outputting a random edge ranking has an expected iAUC of 0.02. Furthermore, on average about 25% of the true edges are "undetectable" by any method using the linearized ODE assumption: although present in the nonlinear system, their entries aij are very close to zero, and they do not contribute to the dynamics within the linearization region. Such edges were excluded from the computation of iAUC, for all competing methods. DiscussionIn Figure 3, we present reconstruction curves for our method versus competing techniques, lacking novelties of our approach (optimal experimental design, Laplace sparsity prior). Very clearly, optimal design helps to save on costly and time-consuming experiments. The effect is more pronounced for the Laplace than for the Gaussian prior. The former is a better prior for the task, and it is well known that the advantage of designed versus random experiments scales with the appropriateness of the model. In this case, the iAUC level 0.9 is attained after 36 experiments with designed disturbances, yet only after 50 measurements with randomly chosen ones, thus saving 30% of the experiments.
In general, the model with Laplace prior does significantly better than with a Gaussian one (τ of the Laplace and the variance of the Gaussian prior were of course selected independently). The difference is most pronounced at times when significantly less than N experiments have been done and the linear system (2) is strongly under-determined. This confirms our arguments in favour of the Laplace prior. The systematic underperformance of the most direct variant LD of our method, up to about N/2 observations, is not yet completely understood. One should be aware that aggressive experimental design based on very little knowledge can perform worse than a random choice. This is a variant of the well-known "explore-exploit" trade-off [27], which can be countered by either specifying prior knowledge more explicitly, or by doing a set of random inclusions (explore) before starting the active design (exploit). This is done in the LM variant. In Figure 4, experimental design is compared to the random experiment choice setting, both with a Laplace prior. In the left panel, we vary the number r of non-zero entries in the disturbances u. Recall that large r are in fact unrealistic in experimental techniques available today, but may well become accessible in the future. The less constraints there are on u, the more information one may obtain about A in each experiment, and the better our method performs. This is in line with linear systems theory, where persistent excitations [11] (i.e. full u's) are known to be most effective for exploring a system. The edge of experimental design is diminished with larger r. This is plausible, in that the informativeness of each u increases strongly with more non-zeros, thus the relative differences between u's are smaller. Experimental design can outperform random choices only if there are clear advantages in doing certain experiments over others.
The middle panel in Figure 4 explores effects of different sizes ||u||, i.e. different perturbation strengths (here, r = 3, and the noise in the SDE is very small). For larger ||u||, the real non-linear dynamics deviate more and more from the linearized ones, thus decreasing recovery performance above about 5%. On the other hand, larger ||u|| would result in a better SNR for each experiment, given that non-linear effects could be modelled as well. This is not yet done in our method, but these shortcomings are shared by all other methods relying on a linearization assumption. It is, however, encouraging that our method is quite robust to the fact that even at smaller ||u||, the residuals ε behave distinctly non-Gaussian (occasional large values). The right panel in Figure 4 shows how increasing stochastic noise in (1) influences network recovery. We keep r = 3 and set ||u|| to generate steady state deviations of 1%. Good performance is obtained at SNRs beyond 10. With a SNR of 1, one cannot expect any decent recovery with less than N measurements. At all SNRs shown, the network was recovered eventually with more and more experiments, but this is probably not an option one has in current biological practice. Comparison to Tegnér et.alThe method proposed in [3] is state-of-the-art for experimental design applied to gene network recovery, and in this section, we compare our method against theirs. Their approach can be interpreted in Bayesian terms as well, this is detailed in additional file 1. In contrast to our method, they discretize the space of possible matrices A. Observations are used to sieve out candidates which are not "consistent" with all measurements so far. They have to restrict the maximum node in-degree for each gene to 3 in order to arrive at a procedure of reasonable cost. To our knowledge, the code used in [3] has not been released. We implemented it, following all details in their paper carefully (some details of our re-implementation are given in additional file 1). In general, the diagonal of A (self-decay rates) is assumed to be known in [3]. For the comparison, we modified our method to accept a fixed known diag A and changed the iAUC score not to depend on self-edges. Results of a direct comparison are shown in Figure 5, with and without the proposed optimal design methods. Due to the high resource requirements of the method in [3], we use networks of size N = 20 (simulated as above), restricted to in-degrees at most 3. In general, our method performs much better in recovering the true network. This difference is robust even to significant changes in the ground truth simulator. We find that their method is very sensitive to measurement and system noise, or to violations of the linearization assumption, whereas our technique is markedly more robust w.r.t. all these. We give some arguments why this might be the case. Firstly, their "consistency" sieve of A candidates in light of measurements is impractical. After every experiment a number of inconsistent A is rejected from consideration, and noisy experiments may well lead to a wrong decision. Any future evidence for such a rejected solution is, however, not considered any more. At the same time, an experiment does not help to discriminate between matrices which are still consistent afterwards. Another severe problem with their approach lies in the discretization of A entries. A histogram of values of aij from our simulator reveals a very non-uniform (and also non-Gaussian) distribution: many values close to zero, but also a substantial number of quite large values. At the very least, their quantization would have to be chosen non-uniformly and adaptively, such that each bin has about equal mass under this distribution. However, it is quite likely that the best quantization depends on details of the true system which are not known a priori. Statistics with continuous variables, as we employ, is a classical way of avoiding such quantization issues. Furthermore, our Laplace prior seems to capture features of the aij distribution favourably.
In Table 1, we compare running times. Even though they restrict the node in-degree to 3, which is often unrealistic for known biological networks [18], the required running times are orders of magnitude larger than for our method. Also, their memory requirements are huge, so that networks sizes beyond N = 50 could not be dealt with on a unit with 4 GB RAM. Both are clearly consequences of their quantization approach, which we circumvent completely by applying a continuous model. The asymptotic running time for a naive implementation of our method is O(N5) (Laplace, experimental design, N experiments), independent of the true network structure, but this can be reduced to O(N4) as discussed in the Methods section. Table 1. Runtimes. Running time for full network recovery, comparing our method (Laplace, design) with [3] Drosophila segment polarity networkIn [24], von Dassow et.al. describe a realistic model of the Drosophila segment polarity network. We tested our algorithm on a single cell submodule, using the equations and parameters as described in [3, Supplement], who also used this model. So far, we modelled only mRNA levels. However, the Drosophila network also contains 5 proteins which play an important role in the regulatory network. Since proteins are hard to control and to observe, we treat them as unobserved variables and focus on identifying the effective network between the genes. A link i → j between genes i ≠ j in the effective network represents one or more interactions of the form i → P1 → ⋯ → Pq → j, where P1, ..., Pq, q ≥ 0 are intermediate proteins, but not genes. In the methods section, we give a mathematical proof that any method working on the observed part of the system only, such as ours, in fact focusses on identifying the effective network, given that the linearized ODE assumption is applied to the complete system. This is reassuring, since all regulatory networks between genes are nothing else but effective networks of larger partially unobserved systems. As shown in Figure 6, the network contains 9 inter-gene regulatory pathways, apart from the self-links that are dominated by the respective self-decay rates. Three of the inter-gene links are functionally weak (i.e.
ConclusionWe have presented a Bayesian method for identifying gene regulatory networks from micro-array measurements in perturbation experiments (e.g., RNAi, toggle-switch, heterozygotes), and shown how to use optimal design in order to reconstruct networks with a minimum number of such experiments. The approach proves robust and efficient in a realistic non-linear simulation setting. Our main improvements over previous work consist of employing a Laplace prior instead of a simpler Gaussian one, encoding the key property of sparse connectivity of regulatory networks within the model, and of actively designing rather than randomly choosing experiments. Both features are shown to lead to significant improvements. When it comes to experimental design, our method outperforms the most prominent instance of previous work significantly, both in higher recovery performance and in smaller resource requirements. Our application of the recent expectation propagation technique to the under-determined sparse linear model is novel, and variants may be useful for other models in Bioinformatics. In this paper, we have focussed on modelling mRNA levels, which can be measured easily and cost-effectively. However, protein and metabolite concentrations also play important roles in any regulatory pathway, and a concise ODE explanation of a system can probably not be formulated if they are ignored. Our method allows to treat these as unobserved variables and to identifying effective networks between the genes. However, if the additional variables can be directly measured, they can easily be treated explicitly within our method, by simply extending the state variable x(t). Throughout the paper we have assumed that u* is known for an experiment, i.e. the disturbance levels of the r targeted genes can be controlled or at least predicted in advance, before the experiment is actually done. For example, a study trying to model the efficacy of RNAi experiments is given in [28]. In the context of experiment design, we can only hope to compute the expected decrease in uncertainty for a specific experiment, and thus rank potential experiments according to their expected value, if the experimental outcome is predictable to some degree. In our method, the outcome x* for a given u* is inferred through the current posterior, i.e. the information gain from (u*, x*) is averaged over Q(x*|u*, D). This can be extended to uncertain u*, if distributions Qe(u*|D) specific to each experiment e can be specified. For experimental biology, this means that not only do we need experimental techniques which deliver quantitative measurements, but furthermore the parameters distinguishing between different experiments (u in our case) either have to be fairly tightly controlled (our assumption in this paper), or their range of outcome has to be characterized well by a mathematical model. In general, biological prior knowledge about the (effective) regulatory network may already be available before any experiments are done. In fact, in the presence of many genes N, it is typically not affordable to do on the order of N disturbance experiments, which are required for complete network identification in the absence of specific prior knowledge (it has been conjectured that O(log N) experiments are required only in [3], but we cannot confirm such a surprisingly fast scaling based on our experiments, even when using their method). Within our method, such prior knowledge can be incorporated if it can be formulated in terms of the system matrix A. No interaction i ← j is encoded as aij = 0, an activating influence i ← j as aij > 0. These types of knowledge can be included in our method, as is discussed in the Methods section. There are several other setups of formulating the network recovery problem in terms of a sparse linear model. Time-course mRNA measurements with unknown, yet time-constant disturbances u are used in [5] and [4]. Relative rather than absolute changes in expression levels are employed in [2]. Within all these setups, our general efficient Bayesian framework for the sparse linear model could be beneficial, and could lead to improvements due to the Laplace sparsity prior. The linearized ODE assumption is frequently done [1-5,12], yet it is certainly problematic. For disturbances which change steady state expression levels by more than about 5%, our simulator showed a behavior which cannot directly be captured by a linearized approach. But such perturbation levels may be necessary to achieve a useful SNR in the presence of typically high measurement noise. An important point for future work is the extension of the model by simple non-linear effects of relevance to biological systems. For example, our model can directly be extended to higher-order Taylor expansions of non-linear dynamics, since these are still linear in the parameters. MethodsApproximate Bayesian InferenceIn this section, we provide an exposition and some possible extensions of our approximate inference method. We sketch the expectation propagation (EP) method for inference in the sparse linear model. Further details are given in [29]. Our aim is to approximate Bayesian inference, given a model and prior distributions for all unknowns. The likelihood function is the probability of the observed data, given all unknowns, it is determined entirely by the model. In our case, we have a Gaussian likelihood The posterior distribution is by the rules of conditional probabilities. It factorizes w.r.t. rows of A, since both prior and likelihood do. For the sparse linear model, computations based on the posterior factor P( More specifically, fix a row index i and let a = In the under-determined case m <N we are principally interested in here, this standard application of EP fails, because the Gaussian coupling factor cannot be normalized as distribution over a. The variant of EP we are using in our experiments here, does come with essentially the same motivation, but some more complicated details. It is described in [29]. Returning to experimental design, the information gain score S(u*, x*|D) for an experimental outcome (u*, x*) is D[Q' || Q]. Note that two things happen in Q → Q'. Firstly, (u*, x*) is included, which modifies the Gaussian coupling factor in Q. Secondly, the site parameters b, πare updated by EP. For the purpose of scoring, early trials showed that the second step can be skipped in scoring without much loss in performance. Doing so, we see that M in the equation for D[Q' || Q] has the form I + x* Running TimeThe running time for a naive implementation of our method (Laplace prior, experimental design) is O(N5), if N experiments are done. Namely, after each experiment, we need to update N posterior representations, one for each row of A. For each, we require at least N EP updates, one at each Laplace site, and each such update costs O(N2) (at least once m, the number of experiments so far, is close to N). This scaling behaviour can be improved by noting that especially during later stages, it will not be necessary to do EP updates for all N2 sites after each new experiment. For a row a, we can compute the change in marginal moments of each Q(ai) upon including the new observation into the likelihood P(0) only. We then do EP updates for O(1) sites only, namely the ones with most significantly changed marginals. This cuts the scaling to O(N4). Relations to other Sparse Bayesian MethodsInterestingly, EP for the sparse linear model can be compared directly to the sparse Bayesian learning (SBL) approach [15]. While SBL is formulated in terms of Student-t priors, we can do the same scale-mixture decomposition as they do for the Laplace prior [31]. The SBL approach leads to a Gaussian posterior approximation Q(a) of the same form as in EP, with the difference that in the site approximations The πj are chosen in SBL by maximizing the likelihood of the data, integrating out the parameters a. This is a non-convex problem which requires some optimization code, while EP comes with a method of updating bj, πj which can be motivated more directly. The role of log-concavity is also less clear in SBL. A systematic comparison between these approaches is subject to future work. Note that SBL with Student-t priors has been applied to gene network recovery [14], although they did not consider experimental design. Furthermore, the Student-t distribution is not log-concave, so the true posterior is multimodal, rendering the quality of the Gaussian SBL approximation questionable. A Markov chain Monte Carlo (MCMC) method for the linear model with Laplace prior is given in [31]. In their approach, the noise variance σ2 is inferred together with a, and they give arguments why their sampler should converge quickly, based again on posterior log-concavity. While a direct comparison to our EP variant has not been done, it seems clear that the MCMC approach is much more costly. This may not be a problem for a standard application, but is likely to make the experimental design approach computationally unattractive. In general, while MCMC inference approximations are exact in the limit of large running time, it is very hard even for experts to assess at which point an MCMC estimate can be considered reliable. Incorporating Biological Prior KnowledgeIn our method presented so far, we assumed that nothing is known about the network system matrix A, apart from it being sparse. In many applications, substantial additional prior knowledge about A is available. In this section, we show how some types of such prior knowledge can be incorporated into our method, leading to fewer experiments required for identification. In general, our method can be extended by using additional sites beyond the First, suppose that mRNA degradation rates for some genes are roughly known from independent experiments, say ri for gene i. We could either fix aii = - ri and eliminate this variable, or we could use the factor with smaller τ than usual, which would allow for errors in the knowledge of ri. Using such off-center factors is of course possible in our framework with very minor changes. Next, suppose that partial connectivity knowledge is available. For example, if there is no influence j → i, then aij = 0, and the corresponding variable can simply be eliminated. If it is known that j → i is an activating influence, this means that aij > ε for some ε ≥ 0. We can incorporate a site Setting Free ParametersWe need to adjust two free parameters: the noise variance σ2, and the scale τ of the Laplace prior. Given some substantial amount of observations, these could be estimated by empirical Bayesian techniques, but this is not possible for experimental design, where we start with very few observations. One may be able to correct initial estimates of σ2, as more observations are made, and a method for doing so is subject to future work. There are two sources of noise, i.e. non-zero ε for observations (u, x) and true linearization matrix A. First, the ODE of our simulator is stochastic, and measurement errors are made for u, x. Second, we have systematic deviations between the true non-linear dynamics to ones of the linearization. It is possible to estimate the variance of errors of the first kind without knowing the true A or performing specific disturbance experiments, by observing fluctuations around the undisturbed steady state. This is not possible for errors of the second kind. However, it is reasonable to assume that a good value for σ2 does not change too much between networks with similar biological attributes, so that we can transfer it from a system whose dynamics are known, or for which sufficiently many observations are already available. This transfer was simulated in our experiments by generating 50 networks with data as mentioned above, then estimating σ2 from the size of the εresiduals. Note that these additional networks were only used to determine σ2, for the other experiments we used independent samples from our network generator. The scale parameter τ determines the a priori expected number of edges in the network. It could be determined similar to σ2, but a simple heuristic worked just as well in most setups we looked at (the exception was very high noise situations). We need a rough guess of the average node in-degree We found in practice that our method is quite robust to moderate changes in τ and σ2, as long as the correct order of magnitude is chosen. Unobserved variablesComplete gene regulatory networks consist of mRNA concentrations, but also of proteins and metabolites. In typical setups, only (some) mRNA levels are directly measured, and we will discuss here how the unobserved elements of the network influence our network inference. For simplicity, all unobserved quantities will be termed as proteins in this section. Denote the observed mRNA concentrations by x(t) ∈ ℝN as before, unobserved protein concentrations by y(t) ∈ ℝM. Let u(t) ∈ ℝN be a perturbation vector, which does not affect the unobserved variables. We now have a joint (nonlinear) ODE system for (x, y), which is again linearized around its steady state. If time constant perturbations are used, the difference between new and old steady state follows a linear equation (up to noise) From this, we deduct u = (A - BD-1C)x. Thus, given only the u and x our algorithm will not recover A, but We show that Lemma 1. Let W ∈ ℝn,n be the weighted adjacency matrix of a directed graph, in that i ← j has weight wij, and the edge is present iff wij ≠ 0. Assume that W is nonsingular. The following holds: if (W-1)ij ≠ 0, then there exists some directed path j → i. Proof. We prove the logical converse. For i = j, there is always a path of length 0 from i to i, so the lemma makes no statement. For i ≠ j, assume that there is no directed path from j to i. Let J be the set of all nodes reachable by j (note that j ∈ J), and let I be its complement. i ∈ I by our assumption. Without loss of generality, assume that J = {1, ..., |J|}, noting that this can always be achieved by renaming nodes, without changing the network. Now, If WI,J was not zero, there would be some element in I reachable from J, therefore from j, so I ∩ J ≠ ∅, a contradiction. From the special form of W we have that |W| = |WJ||WI|, so that both WJ, WI are nonsingular. Now, with Back to the effective gene network, we have that While we can thus recover an effective network, the knowledge of Availability and RequirementsThe methods described in the paper are available as a Matlab package at http://www.kyb.tuebingen.mpg.de/sparselinearmodel webcite. The code makes use of C++ MEX files for core routines, pre-complied binaries are provided for Windows and Linux 32 bit operating systems. The code is published under the GNU GPL licence, for commercial use please contact the authors. Authors' contributionsFS was involved in defining the problem statement, and FS, MS decided on the model solution. FS carried out the numerical experiments and performed the comparison to [3]. MS designed and implemented the computational framework of the approximate inference algorithm. KT proposed to look at the experimental design problem and helped to dicuss the plausibility of the work, as well as relations to other proposed approaches. All authors contributed significantly to the writing of the final manuscript. References
|



on Google Scholar







author email
corresponding author email

Figure 1.






Figure 2.

Figure 3.
Figure 4.
Figure 5.
Figure 6.














