Ottawa Hospital Research Institute, Ottawa, Ontario, Canada

Upper Austrian University of Applied Sciences, Hagenberg, Austria

Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada

Abstract

Background

A key problem in systems biology is estimating dynamical models of gene regulatory networks. Traditionally, this has been done using regression or other ad-hoc methods when the model is linear. More detailed, realistic modeling studies usually employ nonlinear dynamical models, which lead to computationally difficult parameter estimation problems. Functional data analysis methods, however, offer a means to simplify fitting by transforming the problem from one of matching modeled and observed dynamics to one of matching modeled and observed time derivatives–a regression problem, albeit a nonlinear one.

Results

We formulate a functional data analysis approach for estimating the parameters of nonlinear dynamical models and evaluate this approach on data from two real systems, the gap gene system of

Conclusions

Functional data analysis is a powerful approach for estimating detailed nonlinear models of gene expression dynamics, allowing efficient and accurate estimation of regulatory architecture.

Background

A key problem in systems biology is estimating dynamical models of gene regulatory networks. The mathematical modeling of expression dynamics, combined with model parameter estimation, has been crucial to unraveling complex regulatory programs

Methods for estimating dynamical models depend on the form of the model and of the data available. We focus on the problem of estimating differential equation models of gene network dynamics based on time series data. Assuming one notion of expression is associated to each gene–for example, mRNA or protein expression level, but not both–then a generic ordinary differential equation (ODE) model for

where **x** is the vector of expression levels of the **f** produces a vector of time derivatives of expression depending on the current expression levels and on some adjustable parameters **f** (e.g., **f** that depend on time or to delay differential equations, where the derivatives depend on the state of the system in the past. We will also assume that the expression data is collected from the wild type network, though initial conditions may vary. Knock-out or over-expression data has also proven useful in genetic network inference, both in theory

To introduce the dynamics estimation approach we investigate, suppose for simplicity that we have access to a single time series **y**(_{0}),**y**(_{1}),…**y**(_{T}**f**(**x**,

where **x**(_{i}**x**(_{0}) = **y**(_{0})**x**(^{At}**x**(_{0}), so that the dependence of the error on the parameters (**f** is nonlinear, as is typically the case when trying to make more detailed models of network dynamics, then solving the minimization (Eq. 2) is all the more difficult.

There is another major approach to fitting ODE models, however, via functional data analysis (FDA) **y**(_{i}**ŷ**(

There are several approaches to using FDA ideas in estimating differential equations **f**, the most direct approach is to create the smooth of the expression series **ŷ**(

This error criterion is different from Equation 2. We will call that one trajectory-based error, and Equation 3 the derivated-based error. The FDA approach thus changes the problem being solved, rather than being an alternative method for solving the traditional formulation of ODE fitting. The derivative-based error has several major computational advantages that allow it to be optimized much more efficiently. First, evaluating the derivatived-based error for any particular parameter set **f** along the estimated trajectory **ŷ**(**ŷ**(_{g}

Here, _{g}**f** pertaining to gene **f**, informally, the optimization tends to be "less nonlinear" than for the trajectory-based error. In part, this is because the error involves only the evaluation of the dynamics function rather than solutions to the dynamics equation. Typically, **f** is not taken to be anything more complicated than a generalized linear model

Despite the potential advantages of the FDA approach, we believe it has not been seriously evaluated on the problem of estimating nonlinear models of gene expression dynamics. In particular, neither its efficiency nor its ability to correctly estimate regulatory network architecture have been evaluated. Here, we formulate and test FDA approaches on data from two different real networks, the gap gene system of

Results and discussion

Systems and data

We apply FDA methods for fitting differential equation models of data from two real gene networks and simulated data from a set of

The trunk gap gene system of Drosophila melanogaster

The trunk gap gene system in

Networks and data used in our computational experiments. (A) A consensus model of regulatory interactions in the gap gene system of

Networks and data used in our computational experiments. (A) A consensus model of regulatory interactions in the gap gene system of

Reinitz and colleagues have made detailed measurements of the protein expression of these seven genes during development of the embryo

A synthetic gene network in yeast

Cantone

Test problems generated by GeneNetWeaver

GeneNetWeaver

Unconstrained model-fitting by FDA

We smoothed and transformed the time series into continuous functions of time using the cubic spline functions built into the Matlab programming language (see Methods for details). For each of the data sets, this results in a set of functions **ŷ**^{i}^{th}

We modeled the gene expression dynamics by differential equations of the form

where _{g}_{gg}_{g}_{g} is the decay rate.

For each gene

where _{1} penalty to the error function

where _{0} is the original error function of Equation 6 and _{1} penalty is often used in an attempt to eliminate excess parameters in regression problems. If one is only concerned about prediction accuracy, and if one has statistically independent data points, then cross-validation can be used to choose a value of c that appropriately trades off model complexity and model accuracy on the training data. In our case, the data come from time series, so derivative estimates at different times are certainly not statistically independent. Nor is our primary concern the accuracy of the regression model. This is only a conduit to determining regulatory architecture. Thus, we experimented with a range of c values, as described in more detail below. Regulatory weights that remain nonzero for large values of

For the IRMA and GeneNetWeaver data sets, we fit models without autoregulatory links, as these systems do not include autoregulation. For the gap gene system, however, where autoregulation is believed to occur, we allowed autoregulatory links in the model. Three of the

Results on the Drosophila data

Figure _{0} error. Only three true links are missed: repression of Kr and Gt by Hb, the latter of which is a comparatively weak effect

Results of unregularized and _{1}-regularized fitting without constraint on regulatory architecture.

Results of unregularized and _{1}-regularized fitting without constraint on regulatory architecture. (A) The estimated regulatory architecture for the gap gene network. Dashed black links are false positives that are not in the gold standard model. Dashed red are missing links that are in the gold standard. (B) Statistics regarding the accuracy of regulatory architecture, as estimated by the simulated annealing (SA) approach of Jaeger _{1}-regularization on total correct links (Corr), true positives (TP; interpreted as links shared by the gold standard and the model, regardless of sign) and true negatives (TN; interpreted as links absent in both the gold standard and the model). (D-F) The same information for the fits to the IRMA data. In panel E, TSNI refers to the best-performing approach as tested by Cantone _{1}-regularized performance of the FDA approach on the sparse networks S1 (solid lines) and S2 (dashed lines).

The gap gene network is densely connected, with 22 of the 28 possible links present in the gold standard model. Adding regularization to the optimization criterion risks eliminating true positives. Nevertheless, we tried optimizing the _{1}-regularized error function _{1} for regularization constant c ranging from 0 to 10 in increments of 0.1. The results are summarized in Figure

Results on the IRMA data

Figure _{0} criterion for the IRMA data. The IRMA network is sparse compared to the gap gene network, having only seven links among the five genes. The optimization correctly identifies six of those links, including their correct sign. It misses only the activation of GAL4 by CBF1, perhaps because the model also has the true regulators of CBF1 connected to GAL4-a case of mistaking direct versus indirect regulation. Without regularization, however, there are many false positive links in the estimated regulatory architecture. Figure _{0} optimization against the TSNI algorithm, which fits a linear differential equation model that is limited to at most two inputs per gene. This algorithm performed the best of several alternatives tested by Cantone

Because the unregularized fit includes a large number of false positives, we hoped that adding the _{1}-regularization would improve the accuracy of the estimated network architecture. Figure

Results on the GeneNetWeaver data

Broadly speaking, our results on the two sparse GeneNetWeaver networks mimicked our results on IRMA, and our results on the two dense GeneNetWeaver networks mimicked our results on the gap gene network. Figure

For the two sparse networks, where false positives were a concern, we evaluated the _{1}-regularization approach to improving accuracy. The results are shown in Figure

Explicit enumeration of possible network structures

As mentioned above, the FDA approach to model fitting is computationally efficient. Part of its speed is due simply to the greater ease of evaluating the derivative-based error (Eq. 3) as opposed to the more traditional trajectory-based error (Eq. 2). We tested this in Matlab, comparing our implementation of the derivative-based error against a trajectory-based error function that uses the built-in ode45 function to solve the dynamics equation. Over a range of testing conditions, we found that the derivative-based error could be computed 300 ±40 times faster than trajectory-based error.

One of the advantages of the speed with which the FDA fits can be done is that we do not need to limit ourselves to unconstrained network architectures. We can explicitly test alternative architectures and, in fact, we are able to enumerate them all if the number of genes in the network is not too large. For the gap gene network, where all seven of the measured genes can act as input to any of the gap genes, there are 2^{7} = 128 possible input combinations for any gene. Because each gene's model is fit independently, we can test all possible regulatory architectures with a total of 4 × 2^{7} = 512 fits. This begins to be a significant computation, but on a 32-core computing cluster, it amounted to an overnight job. By enumerating all possible inputs for every gene, we are able to explicitly assess which regulators or combinations of regulators are most important for explaining each gene's observed expression. Enumeration also gives us another way to regularize the fit, by limiting the number of inputs per gene.

We performed enumerations for all six networks. The results are summarized in Figure

Results of enumerating all possible regulatory architectures.

Results of enumerating all possible regulatory architectures. (A) For Hb in the gap gene network, the _{0} error of each possible input set is plotted on the y-axis, with the size of the input set on the x-axis. (B) A similar plot for SWI5 in the IRMA network, and (C) for gene G1 in the sparse network S1 generated by the GeneNetWeaver software. (D-I) Statistics on regulatory network accuracy using the best input combination of each size (colors dark blue through dark red indicate zero inputs through all possible inputs). For definitions of CF, PPV, Sens and CSF, see text or Figure

Figure

Figure

A visual depiction of the scores of different input combinations for the Hb gene in the gap gene network, omitting autoregulation. The graph structure depicts the partial ordering of all possible input combinations, with the no-inputs case at the bottom and all possible inputs at the top. The colors within the circles indicate the genes participating in the combination, as laid out in the key at the upper right. The size of each circle is inversely related to the error obtained by using that combination, so that small circles indicate high error and large circles indicate the smallest error.

A visual depiction of the scores of different input combinations for the Hb gene in the gap gene network, omitting autoregulation. The graph structure depicts the partial ordering of all possible input combinations, with the no-inputs case at the bottom and all possible inputs at the top. The colors within the circles indicate the genes participating in the combination, as laid out in the key at the upper right. The size of each circle is inversely related to the error obtained by using that combination, so that small circles indicate high error and large circles indicate the smallest error.

Conclusions

Our computational studies show that functional data analysis is a powerful approach to estimating nonlinear models of gene expression dynamics, and in particular, to estimating the regulatory relationships between genes. The accuracy of FDA was comparable to state of the art approaches on both the gap gene

As with any estimation problem, overfitting-avoidance is an important consideration. We explored _{1}-regularization as well as explicitly limiting the number of regulators allowed for each gene.

_{1}-regularization was partly successful on the sparse IRMA network, and much more successful on the sparse GeneNetWeaver networks. _{1}-regularization requires a constant,

Explicitly evaluating all possible combinations of regulators allows one to see which combinations are the best predictors. In particular, this allows one to identify the best 1-input model of each gene, the best 2-input model, and so on. So, it provides another means for determining which candidate regulators are most important. At the same time, it reveals whether there are alternative solutions of nearly equal quality, and generally gives a more in depth view of the contributions of different regulators, especially when used in conjunction with visualizations methods, as shown in the Results section.

The approach that we have described for using FDA to estimate nonlinear differential equation models of gene expression dynamics can be extended in various ways. One important extension would be to accommodate genetic perturbation data, such as knock-outs, knock-downs or overexpression conditions. In the case of a complete knock-out, this is readily handled by hard-wiring expression of the knocked-out gene to zero in the model and otherwise fitting the data as usual. However, for partial knockdowns or overexpressions of unknown or time-varying magnitudes, more sophisticated procedures are needed. Another relevant extension would be to allow for delays in the differential equations. Cantone

Methods

Data smoothing

To obtain the temporal derivatives of the time series data, it is necessary to obtain a functional representation of the data. We constructed continuous-time series by interpolating the data with cubic splines, as implemented in the Matlab Spline Toolbox. This toolbox also includes a function to compute the derivatives from the spline. Cubic splines are not wholly defined by the data, but also depend on assumptions at or near the boundaries–in our case, the start of the time series and the end of the time series. The default approach taken by the Matlab's spline **function is to use the "not-a-knot" assumption, which states that the third derivative of the spline function should be continuous at the second knot point and the next-to-last knot point 37. Matlab offers other approaches for completing cubic splines. In pilot studies, we tried the default (not-a-knot) approach, natural cubic splines (which have second derivatives equal to zero at the endpoints; Matlab calls this the "variational" approach), and Matlab's "complete" approach (which sets first derivatives at the endpoints based on an estimate from the function values at the nearest four knots). We found that these different methods for completing the cubic splines had only small effects on the interpolated curves and negligible effects on parameter estimates for our models. So, throughout this paper we used the default not-a-knot approach.**

For the IRMA and GeneNetWeaver data sets, we also experimented with smoothing the data first, using the smooth function of Matlab, but this did not affect results significantly. For the

Fitting details

Minimization of the _{0} or _{1} criteria was done by the Matlab function fmincon. For each optimization, we did 1000 runs from different randomized starting conditions, initializing parameters uniformly within their allowed intervals. For

GeneNetWeaver

For the GeneNetWeaver

List of abbreviations used

ODE = Ordinary differential equation

FDA = Functional data analysis

CSF = Correct sign fraction

PPV = Positive predictive value

IRMA = A synthetic gene network created in yeast, and reported by Cantone

TSNI = A fitting algorithm identified as the best-performing among several alternatives investigated by Cantone

DREAM = An annual contest on reverse engineering

Competing interests

None.

Authors contributions

GS and TJP conceived the experiments, analyzed the data, and wrote the paper. GS conducted the computational experiments.

Acknowledgements

We thank Gareth Palidwor for computing and Matlab support. This work was funded in part by grants from the National Sciences and Engineering Research Council of Canada, and the Ottawa Hospital Research Institute. GS was supported in part by a training grant from the Ontario Ministry of Research and Innovation, through its Ontario Research Fund - Research Excellence program.

This article has been published as part of