Abstract
Background
Estimating the rate constants of a biochemical reaction system with known stoichiometry from noisy time series measurements of molecular concentrations is an important step for building predictive models of cellular function. Inference techniques currently available in the literature may produce rate constant values that defy necessary constraints imposed by the fundamental laws of thermodynamics. As a result, these techniques may lead to biochemical reaction systems whose concentration dynamics could not possibly occur in nature. Therefore, development of a thermodynamically consistent approach for estimating the rate constants of a biochemical reaction system is highly desirable.
Results
We introduce a Bayesian analysis approach for computing thermodynamically consistent estimates of the rate constants of a closed biochemical reaction system with known stoichiometry given experimental data. Our method employs an appropriately designed prior probability density function that effectively integrates fundamental biophysical and thermodynamic knowledge into the inference problem. Moreover, it takes into account experimental strategies for collecting informative observations of molecular concentrations through perturbations. The proposed method employs a maximizationexpectationmaximization algorithm that provides thermodynamically feasible estimates of the rate constant values and computes appropriate measures of estimation accuracy. We demonstrate various aspects of the proposed method on synthetic data obtained by simulating a subset of a wellknown model of the EGF/ERK signaling pathway, and examine its robustness under conditions that violate key assumptions. Software, coded in MATLAB^{®}, which implements all Bayesian analysis techniques discussed in this paper, is available free of charge at http://www.cis.jhu.edu/~goutsias/CSS%20lab/software.html webcite.
Conclusions
Our approach provides an attractive statistical methodology for estimating thermodynamically feasible values for the rate constants of a biochemical reaction system from noisy time series observations of molecular concentrations obtained through perturbations. The proposed technique is theoretically sound and computationally feasible, but restricted to quantitative data obtained from closed biochemical reaction systems. This necessitates development of similar techniques for estimating the rate constants of open biochemical reaction systems, which are more realistic models of cellular function.
Background
Biochemical reaction systems are popular models of cellular function. These models are extensively used to represent the interconnectivity and functional relationships among molecular species in cells and, most often, they provide accurate description of cellular behavior. Inferring a biochemical reaction system from experimental data is an important step towards building mathematical and computational tools for the analysis of cellular systems. This step requires both structure (stoichiometry) identification as well as parameter (rate constant) estimation [14]. Due however to the large combinatorial complexity of determining the stoichiometry of a biochemical reaction system, solving this problem requires large amounts of high quality experimental data and substantial computational resources, which are not usually available in practice.
Recently, several approaches have been proposed in the literature for addressing a simpler problem, known as model calibration. The objective of model calibration is to adjust the kinetic parameters of a biochemical reaction system with given stoichiometry in order to obtain a sufficiently good match between simulated and observed dynamics; e.g. see [2,511].
Among known model calibration techniques, the ones based on Bayesian analysis [7,10,11] are perhaps the most versatile. Bayesian analysis allows us to effectively incorporate biophysical knowledge into the problem at hand and naturally draw statistical conclusions about the unknown kinetic parameters. This is done by employing a probability density function that encapsulates prior information about the rate constants of a biochemical reaction system and by deriving a posterior probability density function over the kinetic parameters after experimental data have been collected. By taking into account the experimental data and the information contained in the prior, the posterior density summarizes all knowledge available about the unknown kinetic parameters and quantifies uncertainty about their true values [12,13]. Moreover, the posterior allows us to quantify our confidence about estimation accuracy, compute probabilities over alternative calibrations, and design additional experiments to improve inference.
Most published model calibration techniques do not take into account constraints on the reaction rate constants imposed by the fundamental laws of thermodynamics. If these constraints, known as Wegscheider conditions [14,15], are not explicitly considered by a model calibration technique, then the method will spend most time examining impossible kinetic parameter sets and will most probably produce a biochemical reaction system that is not physically realistic [16]. This issue has been recently recognized in the literature, and new modeling formalisms have been suggested in an effort to address it [1720]. The proposed formalisms describe a biochemical reaction system by welldefined thermodynamic parameters whose values always guarantee that the reaction rate constants satisfy the Wegscheider conditions. For example, in [19,20], a biochemical reaction system is parameterized in terms of molecular capacities and reaction resistances, by using a thermodynamic kinetic modeling (TKM) formalism that enjoys a number of advantages over the ones suggested in [17,18].
We believe that parameterizing a biochemical reaction system in terms of capacities and resistances is unnecessary and, in certain instances, problematic. It has been pointed out in [19] that different choices for the TKM parameters can lead to the same concentration dynamics. As a consequence, the TKM parameters cannot be determined uniquely from concentration measurements. A way to address this problem is to take the capacities to be the equilibrium concentrations (which is always possible in closed biochemical reaction systems), in which case the capacities are constrained by conservation relationships imposed by the system stoichiometry. Then, parameter estimation in the TKM formalism may be possible by arbitrarily fixing a subset of capacity values and estimating the remaining capacities and resistances. However, this approach can be very cumbersome when dealing with molecular perturbations (as we do in this paper) or when merging estimated TKM models, since, in both cases, the capacities may not refer to compatible equilibrium concentrations. It has been suggested in [19] that a way to merge two models using the TKM formalism is to first convert the capacities and resistance to the rate parameters, merge the two models, and then convert back to the TKM formalism. However, this approach seems to be overly complicated, especially in view of the model calibration methodology presented here.
In this paper, we introduce a thermodynamically consistent Bayesian analysis approach to model calibration that does not require reparametrization. Our approach relies on statistically modeling the reaction rate constants of the forward reactions as well as the equilibrium constants of individual reactions. We restrict our attention to closed systems (or systems that can be approximately considered to be closed), since thermodynamic analysis of such systems is easier to handle than open systems. The proposed approach controls thermodynamic consistency of the reaction rate constants by employing welldefined relationships between the kinetic parameters of a biochemical reaction system, imposed by the Wegscheider conditions. By embedding these relationships within an iterative algorithm that finds the mode of the posterior density, we arrive at a thermodynamically consistent Bayesian estimate for the rate constants.
Bayesian analysis can be appreciably influenced by the choice of the prior probability density functions. This is particularly true in systems biology problems in which only a small number of observations is usually available. It is therefore important to focus our effort on constructing appropriate prior densities for the unknown rate constants of the forward reactions and the equilibrium constants of individual reactions. Although a number of choices may be possible, it is imperative to use fundamental biophysical and thermodynamic principles to derive informative prior densities that effectively encapsulate such principles.
By using the classical Arrhenius formula of chemical kinetics [21], we construct an appropriate prior density for the lograte constants of the forward reactions. To do so, we assume that the prefactor and activation energy associated with the Arrhenius formula are both random variables following lognormal and exponential distributions, respectively. This approach takes into account unpredictable changes in biochemical conditions affecting the structure of the reactant molecules and the probability of reaction after collision. On the other hand, by exploiting the thermodynamic relationship between rate constants, equilibrium concentrations, and stoichiometric coefficients, we derive an analytical expression for the joint prior density of the logarithms of the equilibrium constants. This expression depends on steadystate concentration measurements and on the stoichiometry of the biochemical reaction system under consideration.
Another important issue associated with the inference problem considered in this paper is the need to collect an informative set of measurements that can lead to sufficiently accurate parameter estimation. It has been increasingly recognized in the literature that a powerful approach to accomplish this goal in problems of systems biology is to selectively perturb key molecular components and measure the effects of these perturbations on the underlying concentrations [2224]. We follow this strategy here and assume that we can selectively perturb, one at a time, the initial concentrations of a selected number of molecular species in a biochemical reaction system, by increasing or decreasing their values without altering the underlying stoichiometry. This can be achieved by a variety of experimental techniques, such as RNA interference (RNAi), transfection, or molecular injection. Therefore, our approach combines Bayesian analysis with current experimental practices, thus bridging the gap between statistical inference approaches and experimental design.
The Bayesian analysis technique discussed in this paper requires numerical evaluation of a number of statistical summaries of the posterior density. Although several methods are available to deal with this problem (e.g., see [25,26]), we employ here a maximizationexpectationmaximization (MEM) strategy that calculates a thermodynamically consistent estimate of the reaction rate constants as well as Monte Carlo estimates of posterior summaries used to evaluate the quality of inference. This strategy is based on sequentially combining a powerful stochastic optimization technique, known as simultaneous perturbation stochastic approximation (SPSA) [27], with Markov chain Monte Carlo (MCMC) sampling [25]. Our experience with extensive synthetic experiments, based on data obtained by simulating a subset of a wellknown model of the EGF/ERK signaling pathway, indicates that the proposed algorithm is robust, producing excellent estimation results even in cases of high measurement errors and limited time measurements.
This paper is structured as follows. In the "Methods" section, we provide a brief overview of biochemical reaction systems, discuss how to model perturbations, and present a standard statistical model for the measurements. We then outline our Bayesian analysis approach to model calibration and present our choices for the prior and posterior densities. By emphasizing the fact that the prior density must assign zero probability over the thermodynamically infeasible region of the parameter space, and by employing an encompassing prior approach to Bayesian analysis, we derive an appropriate posterior density that satisfies this condition. We finally outline our proposed methodology for computing thermodynamically consistent Bayesian estimates of the kinetic parameters and for assessing estimation accuracy.
In the "Results/Discussion" section, we provide simulation results, based on a subset of a wellestablished model of the EGF/ERK signal transduction pathway. These results illustrate key aspects of the proposed model calibration methodology and show its potential for producing sufficiently accurate thermodynamically consistent estimates of a biochemical reaction system from noisy timeseries measurements of molecular concentrations.
Finally, in the "Conclusions" section, we discuss a key statistical advantage of the proposed model calibration methodology, viewed from a biasvariance tradeoff perspective. Moreover, we provide suggestions for further research to address a number of practical issues associated with model calibration, such as estimating the initial concentrations and their perturbations, dealing with partially observed or missing data, and extending the proposed technique to the case of open biochemical reaction systems.
Extensive mathematical and computational details are required to rigorously formulate, derive, and understand various aspects of the proposed approach. We provide these details in three Additional files accompanying this paper. In Additional file 1, we present a detailed exposition of the underlying theory, whereas, in Additional file 2, we carefully discuss computational implementation. Finally, in Additional file 3, we provide all necessary details pertaining the biochemical reaction system we use in our simulations. Welldocumented software, coded in MATLAB^{®}, which implements all Bayesian analysis techniques discussed in this paper, is available to interested readers free of charge at http://www.cis.jhu.edu/~goutsias/CSS%20lab/software.html webcite.
Additional file 1. In this document, we provide theoretical details necessary to understand the Bayesian analysis approach introduced in the Main text.
Format: PDF Size: 224KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. This document contains a detailed description of the computational algorithms used for implementing various steps of the proposed Bayesian analysis approach.
Format: PDF Size: 81KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3. In this document, we list the biochemical reactions associated with our numerical example and provide thermodynamically consistent values for the rate constants as well as appropriate values for the initial molecular concentrations.
Format: PDF Size: 79KB Download file
This file can be viewed with: Adobe Acrobat Reader
Methods
Biochemical reaction systems
In this paper, we consider a biochemical reaction system comprised of N molecular species X_{1}, X_{2}, ..., X_{N }that interact through M coupled reactions of the form:
The parameters k_{2m1 }and k_{2m }are the rate constants of the forward and reverse reactions, whereas, ν_{nm},
initialized by
where
s_{nm }is the net stoichiometry coefficient of the n^{th }molecular species associated with the m^{th }reaction, defined by
Equations (2)(4) are based on the assumption that we can selectively perturb, one
at a time, the concentrations of molecular species in a set
Due to the enormous complexity of biological reaction networks, (1) is used to model
a limited number of molecular interactions embedded within a larger and more complex
system. Mass flow between the biochemical reaction system given by (1) and its surroundings
complicates modeling. As a matter of fact, some molecular concentrations in the system
may be influenced by unknown reactions, not modeled by (1), or by partially known
reactions with reactants regulated by unknown biochemical mechanisms. To address this
problem, we will assume that there is no appreciable mass transfer between the biochemical
reaction system and its surroundings during the observation time interval
Measurements
We will now specify an appropriate model for the available measurements. We will assume
that, by an appropriately designed experiment, we can obtain noisy measurements
for q = 1, 2, ..., Q + 1, where
In the following, we will assume that the biochemical reaction system, and all its
perturbed versions, is sufficiently close to steadystate at time point t_{Q+1}. We can justify this assumption by taking t_{* }≤ t_{Q+1 }≤ t_{max }and by recalling our previous assumption that the biochemical reaction system is at
thermodynamic quasiequilibrium at times t_{* }≤ t ≤ t_{max}. Our Bayesian analysis approach is based on data y, whereas, we use the steadystate measurements
Finally, we will assume that the error components
Bayesian model calibration
In this paper, we deal with the following problem: Given noisy concentration measurements
y and
We should note here that it is convenient to estimate the logarithms of the rate constants
instead of the constants themselves. By focusing on the logarithms, we can reduce
the dynamic range of rate constant values and make their estimation numerically easier.
To simplify our developments, we will assume that the initial concentrations {c_{n}, n ∈
Given data y, the objective of Bayesian analysis is to evaluate the posterior probability density function p(κ  y), which summarizes our belief about the lograte constants κ after the data y have been collected. It can be shown [see Equations (S1.4) and (S1.5) in Additional file 1] that
where p ∝ q denotes that p is proportional to q, and
with z = {z_{m}, m ∈ ℳ} being the set of logequilibrium constants of the reactions, defined by
Note that the prior density of the lograte constants κ depends on z. For this reason, we view z as a set of random hyperparameters (in Bayesian analysis, parameters used to specify prior densities are known as hyperparameters), specified by means of the prior density p(z).
The posterior density p(κ  y) takes into account our prior belief about the rate constant values and the data formation process, summarized by the prior density p(z) of the logequilibrium constants, the conditional prior density p(κ  z) of the lograte constants given the logequilibrium constants, the conditional probability density p(σ^{2 } κ) of the error variance given the lograte constants, and the likelihood p(y  κ, σ^{2}). However, the posterior density is hard to interpret, especially in highdimensional problems that involve many parameters, such as the problem we are dealing with here. As a consequence, the main objective of Bayesian analysis is to produce numerical information that can be effectively used to summarize the posterior density and simplify the task of statistical inference to the extent possible. Typical summaries include measures of location and scale of the posterior, which are used to produce estimates for the parameter values and to evaluate the accuracy of such estimates, respectively.
It is clear from (6) that, to evaluate the posterior p(κ  y), we need to compute the "effective" prior density ∫ p(κ  z)p(z) dz as well as the "effective" likelihood ∫ p(y  κ, σ^{2})p(σ^{2 } κ)dσ^{2}. To do so, we must specify the aforementioned densities p(σ^{2 } κ), p(z), p(κ  z), and p(y  κ, σ^{2}). We discuss this problem next.
Prior density of error variance
In general, it is difficult to derive an informative prior probability density function p(σ^{2 } κ) for the error variance. To deal with this problem, we assume here that the error variance is independent of the rate constants; i.e., we assume that p(σ^{2 } κ) = p(σ^{2}). Moreover, we assume that σ^{2 }follows an inverse gamma distribution, in which case
for two parameters α, b > 0.
The independence assumption between σ^{2 }and κ is reasonable, in view of the fact that the errors are mainly due to the experimental methodology used to obtain the measurements, whereas, the rate constants are due to biophysical principles underlying the biochemical reaction system. On the other hand, the choice given by (9) has been welljustified in Bayesian analysis. In fact, the inverse gamma distribution is the conjugate prior for the variance of additive Gaussian errors [13]. Conjugate priors are common in Bayesian analysis, since they often lead to attractive analytical and computational simplifications. Note that E[σ^{2}] = b/(α  1) and var[σ^{2}] = {E[σ^{2}]}^{2}/(α  2) = b^{2}/[(α  1)^{2}(α  2)], for α > 2. Therefore, the parameters α, b control the location and scale of the inverse gamma distribution given by (9). We illustrate this prior in Figure S1.3 of Additional file 1. In the following, we treat α and b as hyperparameters with known values. For a practical method to determine these values, the reader is referred to Additional file 1.
Prior density of logequilibrium constants
Before we consider the problem of specifying a prior density for the logequilibrium constants z, we first investigate how much information about z can be extracted from measurements.
It is a direct consequence of thermodynamic analysis that, at steadystate, the net flux of each reaction in a closed biochemical reaction system must be zero. This implies that
by virtue of (4), where
known as Wegscheider conditions [14,15], where r_{m }is the m^{th }element of the M × 1 vector r,
From (8) and (10), note that
By employing (5) and (12), we can show that
Using this result and some straightforward algebra (see Additional file 1), we can show that, given
where ℍ is an M × M matrix with elements
The previous result suggests that we may be able to use
An important observation here is that evaluation of
which we can always evaluate, since matrix
Prior density of lograte constants
To specify the (conditional) prior density p(κ  z) of the lograte constants of a biochemical reaction system, we will first derive
a prior probability density function p(κ_{2m1}) for the lograte constant of the m^{th }forward reaction. To do so, we use the wellknown Arrhenius formula of chemical kinetics
[21] and set k_{2m1 }= α_{m }exp{E_{m}/k_{B}T}, where α_{m }is the prefactor, E_{m }is the activation energy of the reaction, k_{B }is the Boltzmann constant (k_{B }= 1.3806504 × 10^{23}J/K), and T is the temperature. Unfortunately, we cannot predict the values of the prefactor and
activation energy precisely. To deal with this problem, we set
where
Basic thermodynamic arguments (see Additional file 1) imply that z_{m}, defined by (8), is a constant characteristic to the m^{th }reaction. Since κ_{2m }= κ_{2m1 } z_{m}, this implies that the rate constants κ_{2m }and κ_{2m1 }are two dependent random variables, given z_{m}, with joint probability density p(κ_{2m}, κ_{2m1 } z_{m}) = δ(κ_{2m } κ_{2m1 }+ z_{m})p(κ_{2m1}), where δ(·) is the Dirac delta function [see Equation (S1.37) in Additional file 1]. By assuming that the reaction rate constants of different reactions are mutually independent given the z's (which is reasonable if we assume that all common factors affecting these rates, such as temperature and pressure, are kept fixed), we obtain
Equations (16) and (17) provide an analytical form for the prior density of the lograte
constants. To use this expression, we must determine appropriate values for
Effective likelihood
Calculating the effective likelihood p(y  κ), given by (7), is straightforward. From (5), (7), and (9), we can show that
where
By setting ξ = φ(κ,y)/2σ^{2 }in (18), we obtain
Note that evaluating φ(κ,y) at given values of κ and y requires integration of the system of ordinary differential equations (2).
Posterior density
Our previous developments lead finally to an analytical formula for the posterior density p(κ  y) of the lograte constants. Indeed, (6), (15), (17), (19), and (20), lead to
with
where θ_{mm' }are the elements of matrix
Note that the posterior density of the lograte constants is a compromise between the prior and the likelihood. The prior terms ω(κ) and ψ(κ, y) penalize lograte values that do not fit well with available apriori information, whereas, the likelihood term φ(κ, y) penalizes lograte values that produce concentration dynamics which deviate appreciably from measurements. As the number N(P + 1)Q of available measurements increases, this compromise is controlled to a greater extent by the data through the factor φ(κ, y).
A problem arises with the posterior density p(κ  y), given by (21) and (22), since nonzero probabilities may be assigned to thermodynamically infeasible lograte constants. A Bayesian analyst might argue that we have correctly done our job by formulating the problem as we did and that it is the data which will rule out the possibility that our biochemical reaction system can be characterized by thermodynamically infeasible parameters. However, we choose to trust thermodynamics far more than we would trust noisy data and appropriately modify the posterior density based on our knowledge that the kinetic parameters must satisfy the Wegscheider conditions given by (11).
By using the Wegscheider conditions, we can decompose the 2M lograte constants κ into two mutually exclusive sets: M + M_{1 }"free" lograte constants κ_{f }and M − M_{1 }"dependent" lograte constants κ_{d}, where
where δ is the Dirac delta function. Clearly, this density assigns zero probability to kinetic
parameters that do not satisfy the Wegscheider conditions, since
Clearly, the values of the marginal posterior p_{W }(κ_{f } y) are proportional to the corresponding values of the original posterior density p(κ_{f }, κ_{d } y) over the thermodynamically feasible region of the parameter space, given by the
hyperplane
Computing the posterior mode
In a Bayesian setting, we use the location of the posterior density over the parameter space to provide an estimate of the unknown parameter values. Typically, two measures of location are employed, namely the mode and the mean of the posterior. The posterior mean minimizes the meansquare error between the estimated and true parameters, whereas, the posterior mode is more likely to produce dynamics that closely resemble the true dynamics (see Additional file 1 for why this is true). We note here that the main objective of parameter estimation in biochemical reaction systems is not necessarily to determine parameter values that are "close" to the true values (e.g., in the mean square sense) but to obtain appropriate values for the rate constants so that the resulting molecular concentration dynamics closely reproduce the dynamics observed in the true system [33]. As a consequence, we choose the posterior mode as our parameter estimator.
The posterior logdensity lnp_{W }(κ_{f } y) is usually not concave, especially when a limited amount of highly noisy data y is available. As a consequence, there is no optimization algorithm that can find the posterior mode in a finite number of steps. A method to address this problem would be to randomly sample the parameter space at a predefined (and usually large) number of points and use these points to initialize an optimization algorithm, such as the simultaneous perturbation stochastic approximation (SPSA) algorithm discussed in the Additional file 2. We can then calculate the parameters and the associated values of lnp_{W }(κ_{f } y) obtained by each initialization after a set number of optimization steps, and declare the parameters associated with the highest logposterior value as being the desired mode estimates.
Unfortunately, SPSA (and as a matter of fact any other appropriate optimization algorithm)
is computationally costly, especially in the case of large biochemical reaction systems.
Therefore, using SPSA in the previous multiseed strategy may result in a computationally
prohibitive approach for finding the posterior mode. In order to reduce computations,
we may choose only a small number of initial points that we believe are sufficiently
proximal to the posterior mode. Two such points might be the prior and posterior means.
As a matter of fact, as the data sample size tends to infinity, we expect that the
posterior mean will coincide with the posterior mode, since, under suitable regularity
conditions, the posterior density converges asymptotically to a Gaussian distribution
[12,34]. This simple idea leads to the sequential maximizationexpectationmaximization (MEM)
algorithm we discuss in the Additional file 2. According to this algorithm, we perform a relatively small number of SPSA iterations,
initialized by the prior mode, to obtain a posterior mode estimate
Estimation accuracy
One way to quantify the accuracy of the posterior mode estimate
A small value of ϵ_{RMSE }provides us with confidence that the estimated value of that constant is accurate. On the other hand, the estimate may be perceived as inaccurate if ϵ_{RMSE }is exceedingly large.
Another useful metric for evaluating estimation accuracy is
Note that the RMSE's ϵ_{RMSE }can be computed from the diagonal elements of
When the true values κ^{true }of the lograte constants are known (which is the case when we use simulated data to evaluate the performance of the proposed Bayesian analysis approach, as we do in this paper), we can provide a more direct evaluation of estimation performance. As we have mentioned previously, calculating a measure of "closeness" (such as the square error) between the estimated and true parameter values may not be quite appropriate here. Since, in reality, our objective is to estimate the rate constant values so that the biochemical reaction system produces dynamics that closely match the true molecular dynamics, it may be more appropriate to use, as measures of estimation performance, the following median and maximum absolute error criteria:
where
Results/Discussion
To illustrate key aspects of the previous Bayesian analysis methodology, we now consider a numerical example based on a subset of a wellestablished model of the EGF/ERK signal transduction pathway proposed by Schoeberl et al. [35]. This model corresponds to an open biochemical reaction system, since it contains irreversible reactions as well as reactions governed by MichaelisMenten kinetics that involve molecular species not included in the model. We extract a closed subset of the Schoeberl model by choosing the largest connected section that contains only reversible reactions governed by mass action kinetics. The resulting biochemical reaction system is depicted in Figure 1 and is comprised of N = 13 molecular species that interact through M = 9 reversible reactions. Of course, we could attempt to generate a closed biochemical reaction system for the entire EGF/ERK signaling pathway, by including all relevant molecular species not considered by the Schoeberl model (e.g., ADP, ATP, intermediate forms in catalyzed reactions, etc.). However, since we are only interested in demonstrating the potential and key properties of our Bayesian analysis methodology, we found this to be unnecessary. We feel that the biochemical reaction system depicted in Figure 1 leads to a sufficiently rich numerical example that serves the main purpose of this section well.
Figure 1. A subset of the EGF/ERK signal transduction pathway model proposed in [35]. The biochemical reaction system is comprised of N = 13 molecular species that interact through M = 9 reactions. Bayesian analysis is focused on estimating the values of the 18 rate constants associated with the reactions.
In specifying the model depicted in Figure 1, we must provide three sets of physically reasonable values: true rate constant values, initial concentrations, and experimentally feasible perturbations to the initial concentrations. Published values for the reaction rate constants associated with our example are given in Equation (S3.1) of Additional file 3. However, these values do not correspond to a thermodynamically feasible biochemical reaction system, since they do not satisfy the Wegscheider conditions, given by (11). We should point out here that this is a common problem in systems biology. Reaction rate values are usually amalgamated from various independent sources in the literature, so it is highly unlikely that these values will correspond to a thermodynamically feasible biochemical reaction system. As a consequence, it is desirable to develop a method that uses published values for the reaction rate constants and calculates an appropriate set of thermodynamically feasible values that can be considered as the "true" parameter values. In Additional file 3, we calculate "true" values for the lograte constants by using a linear least squares approach to project the published values onto the thermodynamically feasible hyperplane. The resulting "true" values are given in Equation (S3.3) of Additional file 3.
Regarding the initial concentrations, we use the values specified in [35,36], with two minor modifications. First, molecular species with zero initial concentrations are modified to have a small number of molecules present. We do this to accommodate the fact that, in a real cellular system, these molecular species are constitutively expressed. The second modification comes from the fact that we are no longer modeling the entire EGF/ERK signaling cascade and, therefore, we must account for the upstream EGF stimulus. To take this into account, we increase the initial concentration of the most upstream molecular species in our model, namely (EGFEGFR*)2GAP. The initial concentrations used are given by Equation (S3.4) in Additional file 3.
To specify appropriate perturbations to the initial molecular concentrations, note that molecular complexes, such as dimers, trimers, etc., are far more difficult to perturb than simple monomeric molecular species. For this reason, we focus our perturbation efforts on Shc*, Grb2, and Sos. Since Shc* is commercially available in a purified and quantified form, we will assume that we can increase its initial concentration by a factor of 100 using molecular injection. We will also assume that we can perturb Grb2 and Sos by RNAi, resulting in a decrease in their initial concentrations by a factor of 100. Thus, we set π_{1 }= 99c_{1}, π_{2 }= .99c_{2}, and π_{4 }= .99c_{4}.
To avoid specifying different hyperparameter values for the prior densities of the forward lograte constants, we assume here that all densities share the same known values {κ^{0}, τ, λ}, where κ^{0 }= 5.1010, τ = 1.8990, and λ = 0.7409, whereas, we set α = 3 and b = 1 for the hyperparameters of the prior density of the variance σ^{2 }of the measurement errors. These choices correspond to the prior densities depicted in Figure S1.2(a) and Figure S1.3(a) in Additional file 1. We implement our Bayesian analysis approach using the MEM algorithm described in Additional file 2, with I = 5,000 SPSA iterations in each maximization step and a total of L = 50,000 MCMC iterations in the expectation step. Finally, we observe the biochemical reaction system within a time period of 1 min.
In Figure 2, we depict a typical result obtained by the proposed Bayesian analysis algorithm. In this figure, we compare the estimated lograte values (blue) with the thermodynamically consistent true lograte values (red) as well as the corresponding concentration dynamics of selected molecular species in the unperturbed biochemical reaction system. We have obtained these results by measuring the concentration dynamics in the unperturbed and perturbed systems at Q = 6 logarithmicallyspaced time points (green circles), with the measurements being corrupted by independent and identically distributed (i.i.d.) zeromean Gaussian noise with standard deviation σ = 0.3. Moreover, we summarize the estimated posterior RMSE values, given by (25), in Table 1. Finally, the calculated median and maximum absolute error values, given by (26), are 3.03 × 10^{2 }and 1.68 × 10^{1}, respectively.
Figure 2. True (red) vs. estimated (blue) lograte values and selected molecular dynamics in the unperturbed biochemical reaction system depicted in Figure 1. The results are based on measuring the dynamics in the unperturbed and perturbed systems at Q = 6 logarithmicallyspaced time points (green circles). Perturbations are applied on the initial concentrations of Shc*, Grb2, and Sos, one at a time. The measurement errors are i.i.d. zeromean Gaussian with standard deviation σ = 0.3.
Table 1. Estimated posterior RMSE values for the case of i.i.d. zeromean Gaussian errors with standard deviation σ = 0.3. Logarithmic sampling is used with Q = 6.
The concentration dynamics produced by the estimated rate constant values match well the dynamics produced by the true values. As a matter of fact, the calculated median and maximum absolute error values imply that half of the relative integrated absolute error values between the estimated and true concentration dynamics (across all molecular species and all applied perturbations) are smaller than 3.03%, whereas, the remaining values are between 3.03% and 16.8%. On the other hand, the estimated posterior RMSE values summarized in Table 1 indicate a high probability that, given the concentration measurements, the lograte values will lie within a relatively small region around the corresponding posterior mode values.
We expect that, in general, by selecting appropriate perturbations and by increasing the number of concentration data collected during an experiment, we can improve estimation accuracy. However, how can one know if the right perturbations have been applied on the biochemical reaction system and if enough data has been collected in a practical situation? Inspection of RMSE values can provide an answer to these important questions. If the estimated RMSE values of the lograte constants of many reactions are large, it may be worth collecting additional data by increasing P and Q. Additional data can improve estimation accuracy by shrinking the RMSE values to a size that indicates an acceptable degree of uncertainty. However, if the biochemical reaction system is insensitive to a given kinetic parameter, then the RMSE associated with that reaction may remain large even as the quality of data improves. Therefore, additional data should only be collected when the RMSE values are large and sensitivity analysis indicates that the values of the rate constants associated with these RMSE values appreciably affect the system dynamics.
The RMSE values do not provide a global measure of estimation accuracy, since some parameters may have small RMSE values and some may have large values. To address this problem, we may instead employ the Doptimal criterion as a measure of estimation accuracy. As a matter of fact, we can effectively use the Doptimal criterion as a guide for selecting appropriate perturbations and for determining the data sampling scheme we must use in order to increase estimation accuracy. In Table 2, for example, we summarize estimated values of D, for the case of uniform and logarithmic sampling, calculated for different values of Q. Clearly, the sampling scheme used may appreciably affect estimation performance. For each value of Q, uniform sampling results in higher values of D than logarithmic sampling. As a consequence, we must use logarithmic sampling over uniform sampling, since the former may produce better estimation accuracy than the latter. This is expected, since uniform sampling may result in measuring steadystate concentrations much more often than (shortlived) transient concentrations. On the other hand, logarithmic sampling may be used to gather valuable information about the transient behavior of a biochemical reaction system while placing less emphasis on its steadystate dynamics (which only provide information about the equilibrium constants of the underlying reactions). The results depicted in Table 2 also suggest an appropriate value for Q. If our goal is to find the smallest value Q* of Q (an objective dictated by the high cost of experimentally measuring molecular concentrations) which results in a value of D that is no less than, say 5%, of the value obtained when Q = Q*  1, then we must set Q* = 6.
Table 2. Estimated values of the Doptimal criterion for uniform and logarithmic sampling schemes.
In Table 3, we summarize the estimated values of D obtained from seven different perturbation experiments (logarithmic sampling is used
with Q = 6). Moreover, we report the D values obtained by repeating an experiment that does not use molecular perturbations.
Experimental replication may be an effective approach to obtain additional data, especially
when molecular perturbations are costly or difficult to apply. Our formulation allows
us to consider this scenario by setting π_{p }= 0, for every p ∈
Table 3. Estimated values of the Doptimal criterion for different replications and perturbations.
One of the underlying assumptions associated with the proposed Bayesian analysis algorithm
is that the measurement errors are statistically independent, following a zeromean
Gaussian distribution with standard deviation σ. To assess the adequacy of this assumption and evaluate its implication on estimation
performance, we depict in Table 4 calculated median and maximum absolute error values obtained when the measurement
errors
Table 4. Median and maximum absolute error values under a variety of measurement error conditions.
Figure 3. True (red) vs. estimated (blue) lograte values and selected molecular dynamics in the unperturbed biochemical reaction system depicted in Figure 1. The results are based on measuring the dynamics in the unperturbed and perturbed systems at Q = 6 logarithmicallyspaced time points (green circles). Perturbations are applied on the initial concentrations of Shc*, Grb2, and Sos, one at a time. The measurement errors are correlated zeromean Gaussian with standard deviation σ = 0.3.
Conclusions
In this paper, we have introduced a novel Bayesian analysis technique for estimating the kinetic parameters (rate constants) of a closed biochemical reaction system from time measurements of noisy concentration dynamics. The proposed procedure enjoys a clear advantage over other published estimation techniques: the estimated kinetic parameters satisfy the Wegscheider conditions imposed by the fundamental laws of thermodynamics. As a consequence, it always leads to physically plausible biochemical reaction systems.
From a statistical perspective, there are additional advantages for thermodynamically restricting the kinetic parameters of a biochemical reaction system to satisfy the Wegscheider conditions. This may be seen through the wellknown biasvariance tradeoff in estimation [27]. The mean squared error of a given estimator can be decomposed into a bias term and a variance term. In general, imposing constraints on the estimator may increase its bias but decrease its variance (hence the tradeoff). However, if the true parameter values satisfy the constraints, then the variance may decrease without increasing the bias term [27]. Since the true values of the kinetic parameters must lie on the thermodynamically feasible manifold in the parameter space, confining the Bayesian estimator to this manifold (which is of lower dimension than the parameter space itself) may lead to lower mean squared error due to a smaller variance. Since the thermodynamically feasible manifold is of lower dimension than the parameter space, gains in variance (and hence improvements in the mean squared error) are expected to be large. This may be seen through the "curse of dimensionality," which refers to the exponential increase in the volume of the parameter space as its dimension grows, making estimation exponentially harder in higher dimensional spaces (in our example, the unconstrained parameter space has 12.5% more dimensions than the thermodynamically feasible subspace). The Wegscheider conditions reduce the dimensionality of the parameter space to a feasible region in which estimation may be easier. Thus, the proposed Bayesian analysis procedure improves on other estimation techniques by producing a statistically superior, physically meaningful and plausible estimate for the kinetic parameters of a closed biochemical reaction system.
The Bayesian analysis methodology discussed in this paper has been formulated by assuming that all initial concentrations and perturbations are precisely known and that concentration measurements can be obtained by directly sampling all system dynamics. However, current experimental practices in quantitative systems biology restrict the amount and type of data that can be collected from living cells. As a consequence, further research is needed to develop approaches that can accommodate this important issue and make a Bayesian analysis approach to parameter estimation better applicable to systems biology problems.
If the initial concentrations and the perturbations applied on these concentrations are not known, then we may try to estimate them together with the unknown kinetic parameters. Although formulation of this problem is similar to the one considered in this paper, the additional computational burden will be substantial. Moreover, while quantitative biochemical techniques are improving, the vast majority of data available in problems of systems biology are obtained by measuring ratios of molecular concentrations (e.g., by using techniques such as SILAC [37]). Estimation of the rate constants of a biochemical reaction system from concentration measurements available as ratios relative to a reference system requires special consideration and extensive modification of the proposed Bayesian analysis procedure. Finally, it is very important to address the problem of missing observations. This is a common problem in systems biology, since it is not possible to monitor and measure the concentrations of all molecular species present in the system. Although appropriate modifications to the proposed algorithm can lead to a Bayesian analysis approach that can handle missing data, we think that development of a practically effective way to address this problem is challenging. Our future plan is to expand and improve the Bayesian analysis procedure discussed in this paper in order to provide practical solutions to the previous problems.
It is worth noting here that the estimation procedure suggested in this paper applies only to closed biochemical reaction systems (or to approximations of closed systems embedded in a larger open system). However, a cell is an open system, since it effectively interacts with its environment. If we include the cell's environment into our system and monitor the combined system until steadystate (i.e., until cell death), then we would have the necessary closed system. Unfortunately, this is clearly an unrealistic scenario. As a consequence, there is also a need to develop a theoretical and computational approach for dealing with thermodynamically consistent parameter estimation in open biochemical reaction systems.
To conclude, it has been argued in a recent paper [33] that most models of computational systems biology are "sloppy," in the sense that many parameters of such models do not appreciably alter system behavior. A key conclusion of this paper is that collective fitting procedures (such as the Bayesian analysis technique presented in the present paper) are far more desirable than piecewise construction of a biochemical reaction system model from individual parameter estimates (which is how most models are constructed when investigators scour the literature for individual rate constant values). Moreover, it has been pointed out in [33] that using a method to obtain precise parameter values may be difficult, even with an unlimited amount of data, since the behavior of a sloppy model is insensitive to the values of most parameters. As a consequence, the authors suggest that, instead of focusing on the quality of parameter estimation, it will be more wise to focus on the quality of prediction achieved by an estimated model (as we have also argued in this paper).
To a certain extent, our Bayesian analysis approach addresses some of the issues raised in [33]. By imposing the Wegscheider conditions on the kinetic parameters of a biochemical reaction system, we can effectively constrain these parameters to a thermodynamically feasible manifold in the parameter space, thus reducing sloppiness. Moreover, we can effectively use the RMSE values and the Doptimal criterion to determine an appropriate experimental design and distinguish those estimated values that can be trusted from those that cannot. For example, if the RMSE value associated with a kinetic parameter is small, then we may trust these values. On the other hand, a large RMSE value may indicate high uncertainty in the estimated parameter values, which may be untrustworthy. As we mentioned before, if a sensitivity analysis approach, such as the one proposed in [38], indicates that the kinetic parameters associated with large RMSE values are influential parameters, then we must reduce these RMSE values to an acceptable level of uncertainty by adopting a new and more effective experimental design approach. On the other hand, if these parameters correspond to a noninfluential reaction, then we can accept the estimated values with no further consideration, since high uncertainty in the exact values of these parameters will not affect the predicted concentration dynamics.
Authors' contributions
JG developed the basic Bayesian analysis framework, derived most theoretical results, coded a substantial portion of the software, and wrote the final version of the paper. GJ derived a number of theoretical results, generated much of the material in Additional files 2 & 3, coded much of the final version of the software, and interpreted the obtained computational results. XZ and JG advised on several theoretical aspects of the paper, algorithm design, and data analysis, and wrote early versions of the software. GJ and JG advised on various theoretical aspects of the paper, on algorithm design, and data analysis. All authors read and approved the final version of the paper.
Acknowledgements
GJ has been supported by The National Defense Science and Engineering Graduate (NDSEG) Fellowship through the Department of Defense. JG acknowledges the National Science Foundation (NSF) for support of this research. The authors thank the anonymous reviewers for their insightful comments, which resulted in a great improvement of the paper. Special thanks to Ali Afshar for identifying several problems with the original version of this paper and for suggesting appropriate fixes.
References

Crampin EJ, Schnell S, McSharry PE: Mathematical and computational techniques to deduce complex biochemical reaction mechanisms.
Prog Biophys Mol Bio 2004, 86:77112. Publisher Full Text

Feng XJ, Rabitz H: Optimal identification of biochemical reaction networks.
Biophys J 2004, 86:12701281. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Maria G: A review of algorithms and trends in kinetic model identification for chemical and biochemical systems. [http://wwwold.pbf.hr/cabeq/pdf/18_3_2004/CABEQ_18_3_1.pdf] webcite

Papin JA, Hunter T, Palsson BO, Subramaniam S: Reconstruction of cellular signalling networks and analysis of their properties.
Nat Rev Mol Cell Bio 2005, 6:99111. Publisher Full Text

Ronen M, Rosenberg R, Shraiman BI, Alon U: Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics.
P Natl Acad Sci USA 2002, 99(16):1055510560. Publisher Full Text

RodriguezFernandez M, Mendes P, Banga JR: A hybrid approach for efficient and robust parameter estimation in biochemical pathways.
BioSystems 2006, 83:248265. PubMed Abstract  Publisher Full Text

Liebermeister W, Klipp E: Bringing metabolic networks to life: integration of kinetic, metabolic, and proteomic data.
Theor Biol Med Model 2006, 3:42. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Quach M, Brunel N, d'Alché Buc F: Estimating parameters and hidden variables in nonlinear statespace models based on ODEs for biological networks inference.
Bioinformatics 2007, 23(23):32093216. PubMed Abstract  Publisher Full Text

BalsaCanto E, Peifer M, Banga JR, Timmer J, Fleck C: Hybrid optimization method with general switching strategy for parameter estimation.
BMC Syst Biol 2008, 2:26. PubMed Abstract  PubMed Central Full Text

Klinke DJ: An empirical Bayesian approach for modelbased inference of cellular signaling networks.
BMC Bioinformatics 2009, 10:371. PubMed Abstract  PubMed Central Full Text

Mazur J, Ritter D, Reinelt G, Kaderali L: Reconstructing nonlinear dynamic models of gene regulation using stochastic sampling.
BMC Bioinformatics 2009, 10:448. PubMed Abstract  PubMed Central Full Text

Berger JO: Statistical Decision Theory and Bayesian Analysis. 2nd edition. New York: SpringerVerlag; 1985.

Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edition. Boca Raton, Florida: Chapman and Hall/CRC; 2004.

Heinrich R, Schuster S: The Regulation of Cellular Systems. New York: Chapman & Hall; 1996.

Vlad MO, Ross J: Thermodynamically based constraints for rate coefficients of large biochemical networks.
WIREs Syst Biol Med 2009, 1:348358. Publisher Full Text

Alberty RA: Princple of detailed balance in kinetics.
J Chem Educ 2004, 81(8):12061209. Publisher Full Text

Liebermeister W, Klipp E: Bringing metabolic networks to life: convenience rate law and thermodynamic constraints.
Theor Biol Med Model 2006, 3:41. PubMed Abstract  PubMed Central Full Text

Yang J, Bruno WJ, Hlavacek WS, Pearson JE: On imposing detailed balance in complex reaction mechanisms.
Biophys J 2006, 91:11361141. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ederer M, Gilles ED: Thermodynamically feasible kinetic models of reaction networks.
Biophys J 2007, 92:18461857. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ederer M, Gilles ED: Thermodynamic constraints in kinetic modeling: Thermodynamickinetic modeling in comparison to other approaches.
Eng Life Sci 2008, 8(5):467476. Publisher Full Text

Berry RS, Rice SA, Ross J: Physical Chemistry. 2nd edition. New York: Oxford University Press; 2000.

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network.
Science 2001, 292:929934. PubMed Abstract  Publisher Full Text

Tegnér J, Yeung MKS, Hasty J, Collins JJ: Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling.
Proc Nat Acad Sci USA 2003, 100:59445949. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zak DE, Gonye GE, Schwaber JS, Doyle FJ: Importance of input perturbations and stochastic gene expression in the reverse engineering of genetic regulatory networks: Insights from an identifiability analysis of an in silico network.
Genome Res 2003, 13:23962405. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu JS: Monte Carlo Strategies in Scientific Computing. New York: SpringerVerlag; 2001.

Moles CG, Mendes P, Banga JR: Parameter estimation in biochemical pathways: A comparison of global optimization methods.
Genome Res 2007, 13:24672474. Publisher Full Text

Spall JC: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. New York: WileyInterscience; 2003.

Šašik R, Calvo E, Corbeil J: Statistical analysis of highdensity oligonucleotide arrays: a multiplicative noise model.
Bioinf 2002, 18:16331640. Publisher Full Text

Anderle M, Roy S, Lin H, Becker C, Joho K: Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatographymass spectrometry of human serum.
Bioinf 2004, 20:35753582. Publisher Full Text

Listgarten J, Emili A: Statistical and computational methods for comparative proteomic profiling using liquid chromatographytandem mass spectrometry.
Mol Cell Proteomics 2005, 4:419434. PubMed Abstract  Publisher Full Text

Molina H, Parmigiani G, Pandey A: Assessing reproducibility of a protein dynamics study using in vivo labeling and liquid chromatography tandem mass spectrometry.
Anal Chem 2005, 77:27392744. PubMed Abstract  Publisher Full Text

Klugkist I, Hoijtink H: The Bayes factor for inequality and about equality constrained models.
Comput Stat Data An 2007, 51:63676379. Publisher Full Text

Gutenkunst RN, Waterfall JJ, Casey FP, Brown KS, Myers CR, Sethna JP: Universally sloppy parameter sensitivities in systems biology models.
PLOS Comp Biol 2007, 3(10):18711878. Publisher Full Text

Walker AM: On the asymptotic behaviour of posterior distributions.

Schoeberl B, EichlerJonsson C, Gilles ED, Müller G: Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors.
Nat Biotechnol 2002, 20:370375. PubMed Abstract  Publisher Full Text

Li C, Donizelli M, Rodriguez N, Dharuri H, Endler L, Chelliah V, Li L, He E, Henry A, Stefan MI, Snoep JL, Hucka M, Novère NL, Laibe C: BioModels database: An enhanced, curated and annotated resource for published quantitative kinetic models.
BMC Syst Biol 2010, 4:92. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ong SE, Mann M: A practical recipe for stable isotope labeling by amino acids in cell culture (SILAC).
Nat Protoc 2006, 1:26502660. PubMed Abstract  Publisher Full Text

Zhang HX, Dempsey WP, Goutsias J: Probabilistic sensitivity analysis of biochemical reaction systems.
J Chem Phys 2009, 131(094101):120. PubMed Abstract  Publisher Full Text