Abstract
Background
The importance of stochasticity in cellular processes having low number of molecules has resulted in the development of stochastic models such as chemical master equation. As in other modelling frameworks, the accompanying rate constants are important for the endapplications like analyzing system properties (e.g. robustness) or predicting the effects of genetic perturbations. Prior knowledge of kinetic constants is usually limited and the model identification routine typically includes parameter estimation from experimental data. Although the subject of parameter estimation is wellestablished for deterministic models, it is not yet routine for the chemical master equation. In addition, recent advances in measurement technology have made the quantification of genetic substrates possible to single molecular levels. Thus, the purpose of this work is to develop practical and effective methods for estimating kinetic model parameters in the chemical master equation and other stochastic models from single cell and cell population experimental data.
Results
Three parameter estimation methods are proposed based on the maximum likelihood and density function distance, including probability and cumulative density functions. Since stochastic models such as chemical master equations are typically solved using a Monte Carlo approach in which only a finite number of Monte Carlo realizations are computationally practical, specific considerations are given to account for the effect of finite sampling in the histogram binning of the state density functions. Applications to three practical case studies showed that while maximum likelihood method can effectively handle low replicate measurements, the density function distance methods, particularly the cumulative density function distance estimation, are more robust in estimating the parameters with consistently higher accuracy, even for systems showing multimodality.
Conclusions
The parameter estimation methodologies described in this work have provided an effective and practical approach in the estimation of kinetic parameters of stochastic systems from either sparse or dense cell population data. Nevertheless, similar to kinetic parameter estimation in other modelling frameworks, not all parameters can be estimated accurately, which is a common problem arising from the lack of complete parameter identifiability from the available data.
Background
Mathematical models form a cornerstone of systems biology and these models are usually constructed from available biological knowledge and data, which once validated, are subsequently analyzed to address specific biological questions. Many canonical modelling frameworks, from statistical Bayesian networks to differential equations, have been applied to capture a widevariety of biological behaviours. Specifically, the dynamics related to cellular processes that involve low copy number of molecules, such as mRNA transcription, are best described as random and noisy events [1]. For example, cells in an isogenetic population do not necessarily assume the same biological state, but rather exhibit variegated genetic expressions [2,3]. In these examples, the distribution of cells is simulated by stochastic models that describe the probability density function (PDF) of cellular states. However, unlike differential equation models, the identification of stochastic models from experimental data of single cell or cell population data are not yet routine.
Despite the availability of highthroughput cell biology, the estimation of unknown (kinetic) model parameters from experimental data is still considered as the bottleneck in biological model identification, especially for dynamical models [4,5]. The difficulty is generally attributed to the informativeness of the data, or the lack thereof, a property that is proportional to not only the quantity, but also the quality of data. Furthermore, in dynamical models, the time resolution of data is naturally of great importance. In recent years, advances in bioimaging allow for real time measurements of cellular components such as mRNAs and proteins in individual cells through the use of fluorescent proteins [2,3,68]. Such measurements provide more indepth and informative data about the states of a cell and variability in a cell population, than the traditional lumped measurements from cell culture lysate or tissue homogenate. The purpose of this work is to develop practical methods that can efficiently use these data in the parameter estimation framework for stochastic biochemical systems.
Chemical master equation (CME) is the most commonly adopted modelling framework to describe stochastic cellular dynamics [13] and thus is used as a benchmark application in this work. The estimation of unknown kinetic parameters from data in CME and other stochastic models has not been adequately addressed in the literature. Many of the published CME models use rate constants that are scaled from deterministic parameter values or selected adhoc to replicate desired behaviour. Since the lowcopynumber random events can generate dynamics that are characteristically different from those in thermodynamic or deterministic limit [9,10], deterministic model parameters identified from data collected under this limit or averaged over cell populations can be misleading. Furthermore, fitting deterministic models (e.g. ordinary differential equation) to stochastic data has been shown to give poor parameter estimates and model prediction [11]. Among the existing parameter estimation methods for stochastic biological models, some rely on Bayesian inference based on the stochastic differential equation [12,13], while others are based on maximum likelihood (ML) methods. One ML method obtains parameter estimates by fitting transition density functions of stochastic differential equations in biochemical pathways [11]. A similar approach based on the ML of transitional probabilities requires measurements of the state trajectories at very fast sampling rate, whereby reactions are assumed to occur at most twice in a sampling time interval [14]. The fast sampling requirement makes this approach impractical, since biological data are typically sparse.
In this work, three kinetic parameter estimation methods for stochastic models were developed based on two criteria: maximum likelihood (ML) and density function distance (DFD). Two scenarios of practical application were considered involving both sparsely and densely populated datasets (i.e. low and high replicates). Since the distribution density functions are commonly constructed using histograms, an important aspect related to the binning strategy and the noise associated with finite sampling, has been incorporated in the parameter estimation framework. The efficacy of each method was evaluated and compared based on applications to three CME case studies: RNA dynamics in Escherichia coli, gene expression network of galactose uptake model in Saccharomyces cerevisiae, and a bimodal system comprising of a genetic toggle switch in E. coli. Despite the use of CME models here, the methods are generally applicable to other stochastic models in which the system behaviour or output can be characterized by a PDF of the states.
Methods
Chemical Master Equation
Consider a well mixed volume Ω containing N species participating in M biochemical reactions. The CME of this system is given by [15]:
where the state x is an Ndimensional vector indicating the number of molecules of each species in the volume Ω, the density function P(x, tx_{0}, t_{0}) denotes the probability that the system assumes the state configuration x_{j }at time t, given the initial condition x_{0 }at time t_{0}, the vector ν_{j }gives the stoichiometric change in the molecular count of each species due to a single jth reaction event, and k is the kinetic parameter vector. The function a_{j}(x, k) is known as the propensity function, where a_{j}(x, k)dt gives the probability of the jth reaction to occur in the time interval t and t+dt given the state x and parameters k. Due to the curse of dimensionality with increasing number of reacting species, the analytical solution of a CME is usually difficult, if not practically impossible, to obtain even for moderately sized systems [16].
In this work, Stochastic Simulation Algorithm (SSA) [16] was used to generate in silico experimental data for the purpose of parameter estimation and to solve for the PDF of the CME model. Briefly, at any given time and state configuration, the algorithm takes two uniform random numbers, from which the time to the next reaction and the reaction index are determined as a function of the propensities [16,17]. The histogram should reflect the true state PDF in the limit of the number of realizations tending to infinity. Since only a finite number of data samples are computationally feasible and experimentally practical, the error associated with histogram binning strategy is important, but this is not often discussed in existing literature of the CME. The shape of the resulting density function is known to be sensitive to the number and size of the bins, and the optimal binning distribution need not be of uniform sizes [18]. Characteristic features of a distribution such as bimodality may not be apparent when using bins that are too wide, while histograms can be significantly affected by random fluctuations associated with a small number of data points in bins that are too narrow. Although there is no hard and fast rule on the selection of bin sizes, the minimum number of realizations in each bin should typically range between 5 and 20 [19]. Unless stated otherwise, the histograms here are constructed such that each bin contains no fewer than ten occurrences. The noise due to the histogram construction using finite size random sample will be taken into account in the parameter estimation below.
In practice, the choice of numerical solvers for model equations determines the performance of any parameter estimation methods. For CME, there has been a tremendous development of numerical algorithms for computing the PDF solution, directly [2022] or indirectly [15,16,23]. The SSA was selected in this work because this algorithm is equivalent to the CME [16,17], motivating its use to generate in silico data. Consequently, the CME model was also solved using SSA, such that the efficacy of the proposed methods can be evaluated independently from the solvers. In this case, deficiencies of SSA will appear equally in both in silico data and the model solution.
Parameter Estimation Methods
The methods developed here are formulated as a minimization of distance measures between model predictions and experimental data. The first method makes use of the common likelihood function and the second involves a distance metric between density functions as predicted by the CME and the data. When experimental error is known or can be determined from data, this noise should be accounted for in the PDF solution. In this work, the error is assumed to be independent and identically distributed (i.i.d.) random samples from a normal distribution with zero mean and variance σ^{2 }(N(0,σ^{2})), which are then added to the SSA realizations.
Maximum Likelihood (ML) Method
The first estimation criterion is the likelihood function given by
where the jth experimental replicate are taken at time points {t_{1}, t_{2}, ... t_{n}} for j = 1, 2, ..., m (i.e. the experiments are done in m replicates). The likelihood function is given by the CME model, which in this case is evaluated from the density function histogram of SSA realizations. The parameter estimation is then formulated as maximization of the likelihood function given by
where P(o, t_{i}x_{0}, t_{0}) is the state PDF reconstructed from SSA simulations, with added Gaussian i.i.d. noise ε ∈ N(0,σ^{2}) when appropriate, i.e. the state trajectory is simulated as o = x + ε rounded to the nearest integer. For brevity, from hereon P(o, t_{i}x_{0}, t_{0}) will be denoted by P(o, t_{i}). Specific details of the accounting of experimental errors can be found in the description of the case studies in the results section. To avoid numerical underflows, the loglikelihood formulation of the objective function (3) is used in this work, giving
Density Function Distance (DFD) Method
The next two estimation methods are based on the minimization of state density function distance, similar to a divergence measure between two distribution functions, such as the KullbackLeibler distance [24]. In particular, two estimation criteria are considered using the probability density function and cumulative density function (CDF). In the PDF distance method, the objective of the parameter estimation is to minimize the difference between the PDF of the experimental data and SSA simulations, as follows
where P_{e}(o_{l}, t_{i}) denotes the experimental PDF constructed using a histogram with L bins and o_{l }is arbitrarily taken to be the centre of each bin. Unless stated otherwise, the binning strategy is referenced to the experimental data and the same binning distribution is used for the SSA simulations. The last bin represents an extra degree of freedom due to normalization of the sum (integral) of the PDF to 1, and thus not included in the optimization procedure. The weighting factor is an estimate of the variance of the lth bin probability at time t_{i }arising due to finite random sampling. The process of classifying N elements from either the experimental data or SSA realizations into bins of a histogram can be assumed as a binomial process and thereby the variance of the bin frequency is computed according to
As a reliable construction of a PDF typically requires a large number of replicates, the PDF distance may not be appropriate when only few replicates of data are available. On the other hand, the ML method above can be applied to datasets with low replicates, as it does not require the construction of a density function from the experimental data.
The second criterion considers the minimization of the differences between the CDF constructed using the experimental data and the SSA realizations, given by
where the CDF F_{e}(o_{l}, t_{i}) gives the probability to obtain an experimental observation o < o_{l}, and F_{e}(o_{l}, t_{i}) and F(o_{l}, t_{i}) denote the CDF constructed from the cumulative sums of the PDF, and , respectively. Similar to the PDF criteria, the weighting factor is estimated using a binomial assumption to give
The binning distribution can be kept the same as the PDF, but this need not be necessarily so. Unlike PDF, the shape of CDF is less sensitive to noise from finite sampling, with the exception of the tail ends of the CDF near the minimum and maximum values of the states. An alternate formulation with a finer binning strategy gives a similar performance to the objective function above (data not shown). The lesser sensitivity to noise also makes the CDF distance method applicable to sparse datasets (low replicates), in which case the binning strategy is done based on the SSA realizations.
Global Optimization Algorithm
Aside from model solvers, the effectiveness of any parameter estimation methods also depends on the ability to find the global optima to the minimization problems. In the case of stochastic models, the error landscape is anticipated to be highly stochastic due to noise from finite experimental data points, which prevents the use of any optimization algorithms involving gradient search. Here, a variant of evolutionary algorithms, called Differential Evolution (DE), is used as a general purpose global optimization algorithm. This method can effectively handle diversified objective function planes [25], and like other evolutionary algorithms such as genetic algorithm (GA), DE starts with a random population member and looks for the global optima by generating new population members using successive recombination and mutations based on the original parent population. However, unlike GA, DE uses floating point instead of bit string encoding, and arithmetic operations instead of logical rules, thereby providing a greater flexibility in the parameter search. Among the settings of DE, the population size and total number of generations are tuned in the case studies below based on the dimensionality of the problem (i.e. number of parameters) and the choice of parameter estimation method, respectively. The remaining parameters are maintained at previously suggested values [25]. The convergence and termination of the optimization can be based on the improvement of the best objective function in the population, standard deviation of the population vector, or maximum difference between the best and worst population member. A combination of several of these criteria can provide an efficient and robust termination criterion [26]. Since the case studies considered in this work involve in silico data with known true parameters, a maximum iteration number is used as a termination criteria and the efficacy of each method is judged based on the accuracy of the respective estimates.
The SSA and DE algorithms were implemented using Message Passing Interface (MPI) in C++ and run on a Linux IBM computing cluster (CentOS; GNU C++ compiler (v4.1.1)). A combination of a long period random number generator [27] and multiple independent streams generator [28] were used to guarantee statistically independent streams of random numbers required for both the SSA and DE.
Results
Case Study 1: RNA dynamics in E. coli
The significance of intracellular noise arises from the low copy number of genetic materials and gene transcriptional machinery. Thus, the quantification of mRNA would experience a greater influence of such noise than that of proteins, which may have thousands of copies. A high resolution fluorescence microscopy method has been developed to quantify the molecular count of mRNAs in individual Escherichia coli cells [6]. This method is based on the amplification of MS2dfused fluorescence protein signal by binding to a reporter RNA that has multiple MS2d receptor sites (Figure 1A). The transcriptional response was shown to rise and plateau after 7080 minutes post induction [6]. The molecular counts of the transcripts were obtained by normalizing the fluorescence flux with that generated by a single tagged RNA molecule. A massaction kinetic model of the average mRNA level was used to fit the experimental data to obtain the kinetic parameter values [6].
Figure 1. mRNA Dynamics Model in Escherichia coli. (A) The mRNA detection system comprises two genetic elements; a fluorescence protein fused with bacteriophage protein (MS2d) and a reporter mRNA containing tandem repeats of MS2binding sites. The GFP binding site repeats facilitate imaging and quantification of cellular mRNA to single molecular level. (B) The transcriptional model constitutes 3 reactions with 3 rate constants. DNA_{S }represents the silent form, while DNA_{A }represents the activated form
The first case study uses the CME model corresponding to the reactions and kinetic parameters proposed in the original work, as shown in Figure 1B and detailed in supplementary data [Additional File 1: Supplementary Table S1] [6]. Considering this model to be the true system, four experimental datasets of mRNA copy numbers with different replicates (m = 10, 20, 100, and 10,000) were simulated using the SSA. The simulated data were contaminated with measurement errors arising due to the normalization of the fluorescence flux, were taken to be discrete rounded values of normal random samples N(0,0.25), consistent with the actual wetlab experiments [6]. The mRNA transcripts per cell generation were recorded every 0.5 minutes until 75 minutes, mimicking the original experimental protocol.
Additional file 1. Supplementary tables of the manuscript file. Six supplementary tables are included in this document; Table S1 describes the SSA formulation of the E. coli RNA dynamics model of the case study 1. Table S2 details the SSA formulation of the reduced yeast enhanced GFP galactose utilization pathway of the case study 2. Table S3 provides the SSA formulation of the complete gene expression model of the yEGFP galactose utilization pathway. Tables S5 and S6 give the parameter estimation results for the reduced and complete yEGFP gene expression models, respectively. The parameter estimation in these cases was done using the DFD methods involving the maximum distance measures (equation 10 and 11 in the main text). Table S6 lists the parameter estimation results of the Schlögl model.
Format: PDF Size: 31KB Download file
This file can be viewed with: Adobe Acrobat Reader
The parameter search was constrained to a space bounded by k ∈ [0,5]. The density functions predicted by the CME were constructed using 10,000 SSA realizations with added i.i.d and N(0,0.25) noise. In the case of low replicate datasets (m = 10, 20, and 100), only the DFDCDF method was applied, in which the CDF of the experimental data was constructed according to: [19]
where l now denotes the index of the state in replicate vector after arranging the data in ascending order (i.e., o_{1 }≤ o_{2 }≤ ...≤ o_{m}). This construction implicitly uses the differences between sorted data values as the bin sizes. As stated earlier, since the DFDPDF method requires the histograms of experimental data, which in the case of low replicate datasets, are highly inaccurate, this method was only performed for cell population data (m = 10,000). The DE optimization was implemented with a population size of 30 (10 × the number of parameters) for 4,000 generations and the optimization routine took about 1.5 hours for completion.
Table 1 presents the parameter values estimated using the ML and DFD methods for all datasets. In general, the parameter estimates were closer to the true values with increasing number of replicates, as expected from the increase of information with higher replicates. The DFD(CDF) method generally performed better than the ML. Amongst the parameters, k_{1 }is the most accurately determined parameter by all methods. At higher replicates, the DFDCDF method converged to the true solution faster than the PDF and ML methods, in this order, which could be attributed to the difference in the shape of the objective function surface. As seen in Figure 2A and 2C, the DFDCDF criterion produced a higher surface curvature (second derivatives) than those of ML and DFDPDF (Figure 2B, D and 2E). Using a larger population size and higher number of iterations (100 population members and 20,000 generations), the ML method was able to match the accuracy of the CDF estimates (see Table 1, m = 10).
Table 1. Parameter estimation of RNA dynamics model in E. coli.
Figure 2. Normalized objective function contours of the ML and DFD methods in the E. coli RNA dynamics model. The parameter values k_{2 }and k_{3 }were varied between 0.1 and 1 while keeping the value of k_{1 }at its original value. The normalization was done with respect to the optimal solution from each parameter estimation method, where the white circles represent the extrema on the normalized objective function plane. (AB) Normalized objective function contours of the DFDCDF and ML methods using sparse datasets (m = 10), respectively. (CE) Normalized objective functions of the DFDCDF, PDF and ML methods using population datasets (m = 10,000).
Case Study 2: Galactose uptake model in S. cerevisiae
The inherent stochastic nature of gene expression can lead to diversified responses in a (clonal) cell population, even when subjected to uniform external conditions. This diversity has been demonstrated in a cell population using fluorescence techniques such as flow cytometery (FACS). The second case study used in this work looks at the problem of estimating CME parameters from a cell population data. The model describes an artificial genetic construct with the green fluorescence protein (GFP) gene downstream of a galactose activated promoter UAS_{G }and a TetR repressor binding element 2xtetO_{2 }(Figure 3A). In the presence of galactose, the GFP expression can be modulated rheostatically by varying the level of inducer ATc [29]. The original publication utilized a clonal population of S. cerevisiae (yeast) to investigate the inherent cellular noise in the GFP gene expression, which is measured as the heterogeneity of fluorescence among the cells.
Figure 3. Gene Expression Model for the Preferential Galactose Uptake in Yeast Cells. (A) Genetic construct of the transcriptional control of the yeastenhanced green florescent protein expression in the galactose utilization pathway of yeast. (B) The complete gene expression pathway includes (fast) reversible transformations among different promoter configurations and subsequent irreversible RNA and protein synthesis pathways. The reduced model assumes pseudoequilibrium among the promoter configurations, and thus only describes dynamics of processes in the dashed boxes.
The CME model adapted from this work captures the random transitions among all possible promoter states as shown in Figure 3B. The states PC_{1}, PC_{2 }and PC_{3 }represent free/silent, intermediate complex, and preinitiation complex promoter configurations, respectively, while the states RC_{1 }and RC_{2 }describe different forms of repressed promoter configurations. The transcriptional (RNA synthesis) and translational (protein synthesis) processes are modelled as singlestep irreversible reactions (Figure 3B).
In the simplified model, the different promoter configurations are assumed to be in equilibrium, which reduces the model to a set of 8 irreversible reactions, 4 states, and 8 kinetic parameters, as shown in Figure 3B (dashed boxes) [29]. As in the first case study, this model was considered to be the true system and the molecular data of yEGFP and TetR were generated using SSA, giving 10^{4 }realizations at every 5 dimensionless time units up to 50 (or about 18 times the half life of yEGFP [30]). This condition corresponds to 440 minutes of post induction by 2% galactose and 40 ng ml^{1 }ATc. To study the scalability of the proposed methods, the parameter estimation of the full network with 18 reactions, 9 states, and 15 kinetic parameters was also done using a second in silico dataset with 10^{4 }SSA realizations from the complete model. The details on the CME formulation for both the reduced and the complete model of the yEGFP gene expression pathway have been included in the supplementary data [Additional File 1: Supplementary Table S2 and S3].
Both ML and DFD methods were first applied to the reduced model, in which the DE optimization was done with 80 population members for 4000 generations, which took about 50 hours for convergence. The bounds on the parameter search space are given in Table 2. As mentioned above, the binning strategy in the DFD methods was based on the simulated experimental data, while the likelihood function in the ML method was constructed based on the histogram of SSA simulations. Table 2 presents the parameter estimates from the ML and the two DFD methods along with the true parameter values. As in the first example, the DFDCDF method gave the most accurate estimates, followed by the DFDPDF and ML methods, respectively. The parameter estimates from DFDCDF gave yEGFP PDF that is in agreement with wetlab data [Additional File 2]. As illustrated in Figure 2C, D & 2E, the differences in the performance of these methods again arises from the steepness of the objective function plane. However, the lesser performing methods can potentially match the accuracy of the CDF method if population size and number of iterations in the DE optimization are increased.
Table 2. Parameter estimation of reduced yEGFP model in S. cerevisiae
Additional file 2. Supplementary figure of the manuscript file. Comparison of actual experimental data and CME model prediction using SSA simulations with the parameters estimated in case study 2.
Format: PDF Size: 158KB Download file
This file can be viewed with: Adobe Acrobat Reader
The scalability of the methods discussed in this work was evaluated by performing the estimation of the complete model. In this case, the DE optimization was performed using 150 population members for 4000 generations and took approximately 60 hours for convergence. In this case also, the CDF method again generally outperformed the PDF and ML (Table 3). But some of the parameters, especially those involving fast reversible processes, cannot be accurately identified from data. The lack of complete parameter identifiability is perhaps not surprising, when one considers that measurements of only few states are available and that the time scale of these measurements better reflects the slow kinetics of the irreversible processes.
Table 3. Parameter estimation of full yEGFP model in S. cerevisiae.
Two other estimation criteria based on the maximum density function distance, in the form of
and
for PDF and CDF, respectively, have also been evaluated, showing similar performances and observations. The outcome of the application of these criteria to the estimation of parameters in the reduced and complete yEGFP gene expression pathway is described in supplementary data [Additional File 1: Supplementary Table S4 and S5].
Case Study 3: Stochastic model of a synthetic toggle switch
Multistability is often seen in biological networks, such as in λphage decision circuit [31], MAPK cascade [32], and cell cycle regulation [33]. In particular, bistability is a common motif encountered in cellular signalling pathways [34]. Motivated by this, a genetic toggle switch had previously been engineered in E. coli to show the ability to synthesize such motif. The synthetic switch consisted of two repressorpromoter pairs, with (i) P_{L}s1conlacI repressing Ptrc2 promoter and (ii) vice versa Ptrc2cIts (thermal sensitive) repressing P_{L}s1con promoter [8], such that they are mutually inhibitory (see Figure 4A). The switching behavior was visualized by means of green fluorescence protein (GFP), inserted downstream of cIts. The ON switch was accomplished by an inducer, isopropyl βDthiogalactosepyronoside (IPTG), that represses the activity of lacI (Figure 4A). By modulating the concentrations of the IPTG, the genetic toggle system could exhibit bistability with hysteresis [8].
A simple deterministic model was proposed to examine the behaviour of the toggle switch and to analyze different conditions of bistability [8]. The corresponding CME formulation is described in the Figure 4B and 4C[35]. Here, the propensity functions are taken directly from the deterministic model and they give effective rates of reaction following a canonical Hill equation. Taking this model to be the true system, in silico data of GFP fluorescence at IPTG concentration of 6 × 10^{5 }M were simulated using 10^{4 }independent SSA realizations, emulating flow cytometry data.
As the ML performed consistently poorer than the DFD methods in the previous case studies, the stochastic rate constants here (α_{1}, α_{2}, β, γ, η, K) were estimated using the DFDCDF and PDF methods, with DE parameters: 150 population members and 4000 generations. Both CDF and PDF criteria took about 48 hours for completion. The parameter bounds and estimates are given in Table 4. Comparing to the true values, this case study, like the previous two, again showed that the DFDCDF method performed better than DFDPDF with more accurate and robust estimates of the kinetic rate constants. Performance of different estimation methods on another bistable system (Schlögl model) is presented in supplementary data [Additional File 1: Supplementary Table S6][Additional File 3].
Table 4. Parameter estimation of synthetic toggle switch in E. coli.
Additional file 3. Supplementary text of the manuscript file. Details of the SSA formulation and the parameter estimation method used in the Schlögl case study.
Format: PDF Size: 21KB Download file
This file can be viewed with: Adobe Acrobat Reader
Discussion
In this work, three practical methods are proposed for the estimation of the parameters from (noisy) single cell datasets with low and high replicates. As the methods rely on a histogram construction of density functions from a finite sample of experimental data and Monte Carlo simulations, the objective function evaluation has a tradeoff between low accuracy when using bins that are too wide, and high sensitivity to noise when bins are too small. In order to balance this tradeoff, the binning was done such that the narrowest bin has at least ten occurrences. The noise associated with this binning strategy is also taken into account in the objective function in the DFD methods, which is modelled according to a binomial distribution.
The proposed methods are developed while considering a few practical issues when dealing with real biological datasets, such as data sparsity (low replicates), data noise and relatively coarse sampling intervals. The methods developed here do not require fast timesampling like in [14], which might pose a restrictive constraint in practice. When population data are available, the DFD methods can fully exploit the additional information and rigorously handle the noise associated with the finite sample construction of a density function through the weighting factors. Although the examples considered in this work are represented by the CME, the methodologies developed in this work are generally applicable to parameter estimation of other stochastic models (e.g. Langevin), as long as the distribution density function can be constructed. Furthermore, the different methods developed in this work can be used to robustly estimate the rate constants of large scale gene expression networks as well as systems with multistability and general nonlinear propensity equations.
The case studies above showed that methods based on matching density function shapes between model and data generally performed better than maximizing likelihood function. Furthermore, the DFDCDF distance is more sensitive to parameters than both the DFDPDF and ML, and thus is the most effective method developed in this work. The higher sensitivity of the CDF with respect to parameter variations is expected as a result of the cumulative sum of the PDF sensitivity. This is evident from comparing the normalized objective function surfaces as shown in Figure. 2, in which the CDF objective functions have the steepest curvature. The increased curvature leads to a faster convergence to the minima in the DE optimization of the CDF than the PDF, though both methods eventually converge to optimal parameter estimates with similar accuracy. In addition, the CDF is generally less sensitive to noise from finite sampling as can be seen from the noise weighting factor S_{l,i }when normalized with the respective probability, i.e. the coefficient of variation (CoV) . The monotonically decreasing CoV as a function F_{e}(o_{l}, t_{i}) of indicates that the CDF construction becomes less affected by finite sampling noise with increasing F_{e}(o_{l}, t_{i}).
Similar to the parameter estimation in deterministic models, parameter identifiability is a key issue in the estimation of the CME parameters. Such problem is commonly encountered in the parameter estimation of deterministic ODE models [36]. Following the same arguments from the deterministic estimation, the identifiability problem is caused by the limited information contained in the data about the parameters governing the fast transformations among the different promoter configurations. Such problem can be alleviated by getting additional measurements with a faster sampling rate and if possible, measuring the variables that are directly affected by the parameters, e.g. the fractions of promoters in each configuration of the second case study. An analogue of deterministic parameter identifiability analysis can be performed using the parametric sensitivity of the density function and experiments can be designed to maximize the degree of information in the data [35,37,38].
Most of the computational cost of the parameter estimation related to CME is due to the large number of SSA realizations needed to construct the solution of the CME. Furthermore, every generation of DE requires multiple computations of the objective function according to the population size setting and each of population members in turn requires the SSA solution as mentioned previously. One way to alleviate the computational burden would be to lower the SSA realizations in constructing the density function. This would however increase the binning noise, and could possibly reduce the speed of convergence to the optimal solution and the accuracy of parameter estimates (see Figure 5AC). Nevertheless, there is a diminishing return with increasing number of SSA realizations, since noise variance generally scales with the inverse of the number of samples (i.e. the standard deviation is only halved for every 4 times increase in the number of data). Alternatively, efficient approximation methods for simulating the CME can be used in place of the exact SSA [20,23,3942], again at the cost of reduced estimation accuracy. In addition, the optimization parameters, namely population size and generations, can be further tuned for the proposed methods. Unfortunately, the relationship between these two parameters is most likely nonlinear and problem specific, which may require trial and error methods to find the best setting for a particular problem.
Figure 5. Effect of the finite sampling noise on the parameter estimation of E. coli RNA dynamics model. Normalized objective function contours of the DFDPDF method for SSA realizations of 10,000 (A), 5000 (B), and 1000 (C). The parameter values k_{2 }and k_{3 }were varied between 0.1 and 1 while keeping the value of k_{1 }at its original value. The normalization was done with respect to the optimal solution from each case, where the white circles represent the extrema on the normalized objective function plane.
Conclusions
The inherent stochasticity associated with low copy number processes in the cellular genetic milieu can introduce significant noise in gene expression profiles. The modelling of such noisy system requires a careful consideration of random processes and the parameters governing the probability of random events [1]. Three parameter estimation methods for stochastic models have been proposed based on the maximum likelihood criterion and density function distances of PDF and CDF. Since state density functions of stochastic systems are often constructed from a finite number of experimental data points or Monte Carlo realizations, a careful consideration has been taken to characterize the influence of noise arising from the histogram binning. Specifically, the effects of histogram noise are directly incorporated into the parameter estimation objective function as weighting functions. Applications to two case studies have shown that the proposed methods are both effective and practical. Amongst the proposed methods, the CDFDFD method has been found to be the most efficient in estimating the kinetic rate constant than the others (i.e., the ML and DFDPDF methods) due to the higher sensitivity of CDF to the parameters.
Authors' contributions
SKP and RG conceived the project, SKP carried out all the simulations, performed the analyses and drafted the manuscript; RG provided project oversight and analyses, edited the manuscript. Both the authors read and approved the final manuscript.
Acknowledgements
This work was supported by National University of Singapore Faculty Research Council grant [R279000219112/133].
References

McAdams HH, Arkin A: It's a noisy business! Genetic regulation at the nanomolar scale.
Trends Genet 1999, 15:6569. PubMed Abstract  Publisher Full Text

Elowitz MB, Leibler S: A synthetic oscillatory network of transcriptional regulators.
Nature 2000, 403:335338. PubMed Abstract  Publisher Full Text

ColmanLerner A, Gordon A, Serra E, Chin T, Resnekov O, Endy D, Pesce CG, Brent R: Regulated celltocell variation in a cellfate decision system.
Nature 2005, 437:699706. PubMed Abstract  Publisher Full Text

Yang E, van Nimwegen E, Zavolan M, Rajewsky N, Schroeder M, Magnasco M, Darnell JE Jr: Decay rates of human mRNAs: correlation with functional characteristics and sequence attributes.
Genome Res 2003, 13:18631872. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chou IC, Voit EO: Recent developments in parameter estimation and structure identification of biochemical and genomic systems.
Math Biosci 2009, 219:5783. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Golding I, Paulsson J, Zawilski SM, Cox EC: Realtime kinetics of gene activity in individual bacteria.
Cell 2005, 123:10251036. PubMed Abstract  Publisher Full Text

Yu J, Xiao J, Ren X, Lao K, Xie XS: Probing gene expression in live cells, one protein molecule at a time.
Science 2006, 311:16001603. PubMed Abstract  Publisher Full Text

Gardner TS, Cantor CR, Collins JJ: Construction of a genetic toggle switch in Escherichia coli.
Nature 2000, 403:339342. PubMed Abstract  Publisher Full Text

Fange D, Elf J: Noiseinduced Min phenotypes in E. coli.
PLoS Comput Biol 2006, 2:e80. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Samoilov M, Plyasunov S, Arkin AP: Stochastic amplification and signaling in enzymatic futile cycles through noiseinduced bistability with oscillations.
Proc Natl Acad Sci USA 2005, 102:23102315. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tian T, Xu S, Gao J, Burrage K: Simulated maximum likelihood method for estimating kinetic rates in gene expression.
Bioinformatics 2007, 23:8491. PubMed Abstract  Publisher Full Text

Golightly A, Wilkinson DJ: Bayesian sequential inference for stochastic kinetic biochemical network models.
J Comput Biol 2006, 13:838851. PubMed Abstract  Publisher Full Text

Golightly A, Wilkinson DJ: Bayesian inference for a discretely observed stochastic kinetic model.

Reinker S, Altman RM, Timmer J: Parameter estimation in stochastic biochemical reactions.
Syst Biol (Stevenage) 2006, 153:168178. PubMed Abstract

Gillespie DT: Markov Processes: An Introduction for Physical Scientists. San Diego: Academic Press; 1991.

Gillespie DT: Exact Stochastic Simulation of Coupled Chemical Reactions.
J Phys Chem 1977, 81:23402361. Publisher Full Text

Gillespie DT: A rigorous derivation of the chemical master equation.
Physica A 1992, 188:404425. Publisher Full Text

Scott DW: Multivariate Density Estimation: Theory, Practice, and Visualization (Wiley Series in Probability and Statistics). Wiley; 1992.

Montgomery DC, Runger GC: Applied Statistics and Probability for Engineers. New York: Wiley; 2006.

Macnamara S, Bersani AM, Burrage K, Sidje RB: Stochastic chemical kinetics and the total quasisteadystate assumption: application to the stochastic simulation algorithm and chemical master equation.
J Chem Phys 2008, 129:095105. PubMed Abstract  Publisher Full Text

Macnamara S, Burrage K, Sidje RB: Multiscale modeling of chemical kinetics via the master equation.

Munsky B, Khammash M: The finite state projection algorithm for the solution of the chemical master equation.
J Chem Phys 2006, 124:044104. PubMed Abstract  Publisher Full Text

Gibson MA, Bruck J: Efficient Exact Stochastic Simulation of Chemical Systems with Many Species and Many Channels.
J Phys Chem A 2000, 104:18761889. Publisher Full Text

Kullback S, Leibler S: On Information and Sufficiency.
Ann Math Stat 1951, 22:7986. Publisher Full Text

Storn R, Price K: Differential Evolution  A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces.
J Global Optim 1997, 4:341359. Publisher Full Text

Zielinski K, Peters D, Laur R: Stopping Criteria for SingleObjective Optimization.
Proceedings of the Third International Conference on Computational Intelligence, Robotics and Autonomous Systems; Singapore 2005.

Matsumoto M, Nishimura T: Mersenne twister: a 623dimensionally equidistributed uniform pseudorandom number generator.
ACM Trans Model Comput Simul 1998, 8:330. Publisher Full Text

LeCuyer P, Simard R, Chen EJ, Kelton WD: An ObjectOriented RandomNumber Package with many long Streams and Substreams.
Oper Res 2002, 50:1073. Publisher Full Text

Blake WJ, M KA, Cantor CR, Collins JJ: Noise in eukaryotic gene expression.
Nature 2003, 422:633637. PubMed Abstract  Publisher Full Text

Chen MT, Weiss R: Artificial cellcell communication in yeast Saccharomyces cerevisiae using signaling elements from Arabidopsis thaliana.
Nat Biotechnol 2005, 23:15511555. PubMed Abstract  Publisher Full Text

Arkin A, Ross J, McAdams HH: Stochastic kinetic analysis of developmental pathway bifurcation in phage lambdainfected Escherichia coli cells.
Genetics 1998, 149:16331648. PubMed Abstract  PubMed Central Full Text

Ozbudak EM, Thattai M, Lim HN, Shraiman BI, Van Oudenaarden A: Multistability in the lactose utilization network of Escherichia coli.
Nature 2004, 427:737740. PubMed Abstract  Publisher Full Text

Pomerening JR, Sontag ED, Ferrell JE Jr: Building a cell cycle oscillator: hysteresis and bistability in the activation of Cdc2.
Nat Cell Biol 2003, 5:346351. PubMed Abstract  Publisher Full Text

Bhalla US, Iyengar R: Emergent properties of networks of biological signaling pathways.
Science 1999, 283:381387. PubMed Abstract  Publisher Full Text

Gunawan R, Cao Y, Petzold L, Doyle FJ: Sensitivity analysis of discrete stochastic systems.
Biophys J 2005, 88:25302540. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nikerel IE, van Winden WA, Verheijen PJ, Heijnen JJ: Model reduction and a priori kinetic parameter identifiability analysis using metabolome time series for metabolic reaction networks with linlog kinetics.
Metab Eng 2009, 11:2030. PubMed Abstract  Publisher Full Text

Gadkar KG, Gunawan R, Doyle FJ: Iterative approach to model identification of biological networks.
BMC Bioinformatics 2005, 6:155. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Plyasunov S, Arkin AP: Efficient stochastic sensitivity analysis of discrete event systems.
J Comp Phys 2006, 221:724738. Publisher Full Text

Cao Y, Gillespie DT, Petzold LR: Efficient step size selection for the tauleaping simulation method.
J Chem Phys 2006, 124:044109. PubMed Abstract  Publisher Full Text

Chatterjee A, Vlachos DG, Katsoulakis MA: Binomial distribution based tauleap accelerated stochastic simulation.
J Chem Phys 2005, 122:024112. PubMed Abstract  Publisher Full Text

Haseltine EL, Rawlings JB: Approximate simulation of coupled fast and slow reactions for stochastic chemical kinetics.
J Chem Phys 2002, 117:69596969. Publisher Full Text

Tian T, Burrage K: Binomial leap methods for simulating stochastic chemical kinetics.
J Chem Phys 2004, 121:1035610364. PubMed Abstract  Publisher Full Text