Background
Inferring transcriptional regulatory networks in animals is challenging. For example, the large number of genes, the spatial and temporal complexity of expression patterns, and the presence of many redundant and indirect interactions all make it difficult to learn the network. In the long term, it will be necessary to use multiple data setsincluding gene expression, genome wide in vivo DNA binding, and network perturbation datato accurately represent all interactions. Combining multiple data classes in this way, however, is an open and challenging problem.
An alternative, intermediate approach is to use only gene expression data to infer regulatory networks. Here the relationships between the expression levels of one or more transcription factors and those of many putative target genes are used to predict which genes are the most likely targets of each factor. While much work has been done in this area, it is critical to understand the maximum amount of information that can be obtained about the network using this strategy.
Typical approaches for inferring regulatory networks have been to assume a model formulation and then fit the data to this formulation 12. Many models have been proposed, including coexpression networks 345, informationtheoretic representations 678, regression onto dynamical systems 91011121314, Bayesian networks 151617, and other graphical models 1819, each of which has advantages and disadvantages. The primary differences between these models lie in the tradeoff between statistical and interpretational issues. Techniques like Bayesian networks, graphical models, and informationtheoretic models have protections against overfitting (i.e., fitting models with many parameters to a small amount of experimental data); however, these techniques do not provide dynamical models which can generate new biological insights. On the other hand, techniques such as nonlinear regression and regression onto dynamical systems provide more biologically interpretable models, but sometimes suffer from inaccurate assumptions or overfitting of the model to the data.
There is disagreement on the necessity of dynamical 9101112138151617181914 as opposed to static 3674520212223 models. We feel that dynamical models are more philosophically pleasing because regulatory networks contain temporal characteristics: For example, a protein binds to DNA and initiates transcription, which eventually leads to transport of the mature mRNA to the cytoplasm. Yet the argument is often made that static models provide a quasisteadystate interpretation of the network that may provide a sufficient approximation. Rigorous comparison of the two approaches, however, is lacking.
Dynamical modeling of animal regulatory networks has a long history 242592610271128. It is a powerful approach in which researchers hypothesize a set of nonlinear, differential equations to describe the network, but it generally requires significant prior knowledge about the network. If there is insufficient biological knowledge about the network, then the structure of the equations can be incorrectly chosen. And if the model is not carefully chosen, it will have a large number of parameters, possibly leading to weak biological effects being erroneously identified as strong effects. Furthermore, it is sometimes shown that a wide range of different parameter values can reproduce the biological behavior of the network, which could be taken as evidence for either network robustness or overfitting 26.
The purpose of this paper is to describe a novel approach for inferring regulatory networks from expression data, and it provides a new way to trade off statistical issues and model interpretability. We generate a quasigenetic, formal model of regulatory networks using nonparametric ordinary differential equations (ODEs) which are fit using the nonparametric exterior derivative estimator (NEDE) 2930. For these reasons, we call our method and the resulting model the NODE (an amalgamation of NEDE and ODE) model. Our NODE model is similar to qualitative piecewise linear network modeling and identification 131214, and we extend these models by using identification techniques that have improved statistical properties and protect against overfitting. The NEDE estimator adds constraints to the identification problem by learning correlations between factors, and these constraints protect the model from overfitting and erroneously identifying weak biological effects as strong effects. Though we focus the discussion in this paper to temporalspatial expression patterns, our NODE method can easily be used with timeseries microarray datasets. It is also scalable to a network sized on the order of hundreds of species.
We focus our modeling effort on the formation of eve mRNA stripes during Stage 5 of Drosophila melanogaster embryogenesis. We apply our technique to this portion of the regulatory network, and compare the performance of our method to that of other more commonly used models. We show that there are significant differences in the regulatory predictions made by the NODE model and other commonly used models, including the fact that our technique predicts that factors frequently have both positive and negative effects on the same targets, depending on the concentration of the factor. We also show that the NODE model performs better than a static, spatialcorrelation model.
Results and Discussion
Our NODE model is a formalization of a quasigenetic model that seeks to capture the total net effect of direct and indirect influence of each factor on a target gene, and it is generated by looking at the correlation between factor concentrations and the change in target mRNA concentration over time. This is done in small windows of neighboring cells on the embryo and at different time intervals during development. By looking at the change in target mRNA over time, we are able to generate a dynamic equation model that describes each factors' influence on each gene in space and time; tuning parameters for our method are selected in a datadriven manner using crossvalidation (see Methods and Models for more details). In general, the model formally predicts repression in all cases where increases in the concentration of a factor leads to a decrease in the rate of change in target mRNA over time. Similarly, it formally predicts activation as all instances where increasing the concentration of a factor leads to an increase in the rate of change in target mRNA over time.
We applied our technique to experimental measurements, gathered by the Berkeley Drosophila Transcription Network Project (BDTNP), of spatial and temporal expression levels of transcription factor protein and mRNA in Drosophila embryos 3120. A NODE model was established that describes the formation of eve mRNA stripes during Stage 5 of development using data for five transcription factors known to be responsible for initiating much of the patterning of eve: Krüppel (KR), Giant (GT), Knirps (KNI), Hunchback (HB) and Bicoid (BCD) 212223. For each factor and for eve mRNA, there are 36,468 data points that represent 6,078 cells at 6 time points. Our technique was able to compute the model in approximately 20 hours on a desktop computer. The seven distinct eve mRNA stripes in the measured data can be seen in Figure 1, where both a threedimensional view and a twodimensional, cylindrical projection of the embryo are shown.
<p>Figure 1</p>Quantitative cellular resolution 3 D gene expression
Quantitative cellular resolution 3 D gene expression. A. A threedimensional plot of the Drosophila embryo showing the experimentally measured pattern of eve mRNA as it appears in late Stage 5. There are seven distinct expression stripes located along the anteriorposterior axis (AP) of the embryo, with the intensity of each stripe varying moderately along the dorsalventral axis (DV). B. A twodimensional cylindrical projection of a Stage 5 Drosophila embryo provides an easier visualization of the details of the eve mRNA patterns, showing that expression of each stripe is similar on either side of the ventral mid line (V).
Model fit
We assess the fit of our NODE model to the experimental data both qualitatively and quantitatively. Because we have an ODE model that describes the formation of the eve mRNA stripes, we can run a simulation of the model using only the experimentally measured eve concentration at the first time point of Stage 5 as the initial condition of the ODE. Only transcription factor protein and eve mRNA data from the first two time points was used to derive the NODE model for predicted regulatory interactions. By using this model along with the transcription factor protein expression data from all time points, we can then simulate the eve mRNA pattern for all six time points and then compare this to the experimentally measured eve pattern.
Qualitatively speaking, the eve mRNA pattern generated by our NODE model simulation matches the temporal behavior of the experimental pattern quite well. The experimental and simulated eve patterns are compared in Figure 2. The black lines on each of the maps in Figure 2 show the boundaries of the experimental measurements of the eve mRNA stripes, and how they change location during Stage 5. Looking first at just the experimentally observed eve mRNA pattern shown in Figure 2, we can see that the stripe regions narrow, and eve concentration in the stripes becomes stronger. The stripes also shift anteriorly. The simulation of our NODE model matches this experimental behavior, and captures the changing boundaries of the eve stripes particularly well.
<p>Figure 2</p>Comparison of the experimentally measured and the NODE model simulated patterns of eve mRNA
Comparison of the experimentally measured and the NODE model simulated patterns of eve mRNA. Cylindrical projections of the measured pattern of eve mRNA concentrations (left column), the NODE model simulated pattern of eve mRNA (center column), and the simulation error (right column) at six successive time points during blastoderm Stage 5 (rows). The eve mRNA concentration values have been normalized to range from 0 to 1 and the simulation error shown is the absolute value of the difference between experimental and simulated eve concentration in the embryo. The NODE model was generated using only data from Stage 5:03 and Stage 5:48, and the data from Stage 5:03 was used as the initial condition for simulation. It is able to predict the expression pattern well except for Stage 5:76100.
To quantify the accuracy of the model, the simulation error is also shown in Figure 2. The NODE model is able to accurately predict the eve pattern at Stages 5:925, 5:2650, and 5:5175. Its predictions are less accurate for Stage 5:76100 in some regions, especially in stripe 1, but this is not unexpected as it is known that at the end of Stage 5 a new set of transcription factors begin to regulate eve expression 32. This could not have been learned using only data taken from early Stage 5 as we have done here. Indeed, if eve mRNA expression data from all time points is used to learn the NODE model, better agreement is seen (Figure 3).
<p>Figure 3</p>Comparison of the experimentally measured and the NODE model (generated using eve mRNA expression from all time points) simulated patterns of eve mRNA
Comparison of the experimentally measured and the NODE model (generated using eve mRNA expression from all time points) simulated patterns of eve mRNA. A NODE model was generated using data from all time points in Stage 5, and it was used to predict the expression pattern. The simulation of this model shows better agreement with the experimentally observed pattern, than the NODE model shown in Figure 2 (which only uses two time points to generate the model). The figure is labelled using the same conventions as Figure 2 except that the simulation and error are for the NODE model which uses all time points.
Factor activity plots
The model generated by our technique can be visualized as spatiotemporal maps of factor activities. An example of a spatial map for our NODE model for Stage 5:925 is shown in Figure 4, which shows how the five factors (directly or indirectly) affect eve mRNA pattern formation. Blue values correspond to predicted repression (i.e., an anticorrelation between factor expression and the rate of change of target expression) and yellow/red values correspond to predicted activation (i.e., a positive correlation between factor and the change in target).
<p>Figure 4</p>Embryo wide factor activity at Stage 5:925 predicted by the NODE model
Embryo wide factor activity at Stage 5:925 predicted by the NODE model. Cylindrical projections of the correlation between each factor and the change in target expression over time. The intensity of the factor activity values is the product of the coefficients of the model in Equation 4 and the average, local factor concentration. The mathematical definition of factor activity is given in Methods and Models.
Such factor activity plots show the intensity and variation of predicted effects of factors at different locations on the embryo and at different time points. Our model is a formal, quasigenetic ODE model. It is not a mechanistic model, because it cannot capture the various mechanisms involved in the regulation of eve mRNA. This, however, is a strength because of the flexibility gained by not having to make a priori assumptions on the regulatory mechanisms. This comes at the cost of not being able to identify which interactions are direct or indirect.
Comparison to spatialcorrelation model
To aid understanding of our NODE model and help establish its utility, we compared it to a spatialcorrelation model. Such models have also been used for identifying regulatory interactions from quantitative expression data 2122232033, and are based on the descriptions of the relationship between transcription factor and target gene expression that have been most widely used by developmental biologists. These models are not dynamic and look at the correlation, at fixed time points, between factor concentrations and target mRNA concentrations. To make the result comparable to our NODE model, we consider a new variant of spatialcorrelation models which looks separately at the correlation between factor levels and target mRNA levels in different, small regions of the embryo and at different stages of development.
We first compared the embryowide spatial maps of factor activity in Figure 4 to that predicted by the spatialcorrelation model (Figure 5). Viewed in this way, the two models show many similarities, which is encouraging because many experimentally validated regulatory interactions have been implicitly interpreted using a spatialcorrelation model, and this agreement provides mutual support both for our model and the previously determined interactions.
<p>Figure 5</p>Embryo wide factor activity at Stage 5:925 predicted by the spatialcorrelation model
Embryo wide factor activity at Stage 5:925 predicted by the spatialcorrelation model. Cylindrical projections of the correlations between each factor and the target expression. The intensity of the factor activity values is the product of the coefficients of the model in Equation 5 and the average, local factor concentration. The mathematical definition of factor activity is given in Methods and Models.
Closer inspection, however, reveals significant differences in the precise locations of factor activity predicted by each method and, in some cases, differences in the direction of correlation at some stripes. To examine these in more detail, we next examined interactions during Stage 5:925 of two transcription factors, Giant (GT) and Krüppel (KR), with part of eve stripe 2 that other data suggest they repress (Figures 6 and 7) 23. Figure 6A shows the concentrations of GT protein (green line) and eve mRNA (red line) along the anteriorposterior (AP) axis, showing the classic anticorrelation of GT protein with the anterior boundary of eve stripe 2. The factor activity predicted by the "spatialcorrelation" model is shown as the plot of the GT correlation (dark blue line). In contrast, Figure 6B shows GT protein (green line) concentration; the change in eve mRNA concentration over time (red line); and the factor activity predicted by the NODE model for GT protein (dark blue line). While both models use the same protein expression data (green lines), the concentrations of eve mRNA and the temporal change in mRNA (red lines) show marked differences, as do the predicted factor activity profiles (dark blue lines). Similar differences are seen for KR (Figure 7).
<p>Figure 6</p>Comparison of spatialcorrelation and NODE models for GT at Stage 5:925
Comparison of spatialcorrelation and NODE models for GT at Stage 5:925. A. The spatial correlation model along part of the anteriorposterior (AP) axis. Plotted are the concentrations of GT protein (green line) and eve mRNA (red line) as well as the factor activity of GT in the "spatialcorrelation" model (dark blue line), calculated via a joint correlation of all factors with eve mRNA. The vertical dashed lines indicate the boundaries of eve stripe 2. The colored bars above indicate where the factor activity is positive (yellow) or negative (light blue). B. The NODE model along part of the AP axis. Plotted are the concentrations of GT protein (green line) and the change in eve mRNA over time (red line) as well as the factor activity of GT in the NODE model (dark blue line), calculated via a joint correlation of all factors with the change in eve mRNA. The vertical dashed lines indicate the boundaries of eve stripe 2. The regions of the embryo where GT is a type I or II activator or a type I or II repressor are indicated (IA, IIA, IR or IIR), and they are indicated with dotted lines. The colored bars above indicate where the factor activity is positive (yellow) or negative (light blue). C. The portion of the embryo that is plotted in A and B is shown in gray. The ventral region is omitted because otherwise the spatial variation of eve concentration along the dorsalventral (DV) axis makes interpretation of onedimensional plots difficult. The values in the onedimensional plots of A and B were generated by averaging over the DV axis and is done for strictly for visualization purposes. This averaging is not used in our standard analyses or method.
<p>Figure 7</p>Comparison of spatial correlation and NODE models for KR at Stage 5:925
Comparison of spatial correlation and NODE models for KR at Stage 5:925. A. The spatial correlation model along part of the anteriorposterior (AP) axis. B. The NODE model along part of the AP axis. C. The portion of the embryo which is plotted in A and B. The figure is labeled using the same conventions as Figure 5 except that the protein expression and models are for KR protein.
These differences raise the question: Which model is more accurate and useful? To quantitatively compare the two models, we generated a spatialcorrelation model which used eve mRNA expression data only from Stage 5:03 and Stage 5:48 and used it to predict the experimental eve pattern at later portions of Stage 5 (Figure 8). This spatialcorrelation model much more poorly predicts the eve pattern at stages 5:925 and later. (Compare the error plots in Figure 2C with those in Figure 8C.) The NODE model predicts an eve pattern that has 59% less error over the last four time points than the pattern predicted by the spatialcorrelation model. Thus, in a direct comparison of a static (spatialcorrelation) model and a dynamical (NODE) model, the dynamic model is superior.
<p>Figure 8</p>Comparison of the experimentally measured and the spatialcorrelation model simulated patterns of eve mRNA
Comparison of the experimentally measured and the spatialcorrelation model simulated patterns of eve mRNA. A spatialcorrelation model was generated using only data from Stage 5:03 and Stage 5:48, and it was used to predict the expression pattern during later portions of Stage 5. The spatialcorrelation model is unable to predict the expression pattern well, and is not as accurate as the NODE model which is shown in Figure 2. The figure is labelled using the same conventions as Figure 2 except that the simulation and error are for the spatialcorrelation model.
This result fits with the idea that the NODE model is intrinsically more biologically realistic than a spatialcorrelation model. As stated earlier, biological networks are marked by temporal effects. For instance, a protein binds to DNA which initiates transcription. This is not an instantaneous process, and there is some delay between when a factor initiates transcription and when the target mRNA is expressed. The spatialcorrelation model does not model this notion of temporal effects, whereas the NODE model does.
Concentrationdependent effects
In many cases it is known that individual gene expression stripes can be controlled via a single cisregulatory module (CRM) and current computational models generally assume that a given factor acts only as an activator or a repressor on a given CRM (e.g. 2627343536). However, both our NODE model and our variant of the spatial correlation model frequently predict concentration dependent effects whereby, on and around the same expression stripe, a factor has both repressing and activating effects (see the yellow and light blue bars above the plots in Figures 6 and 7 and more generally Figures 4 and 5). For example, consistent with previous molecular genetic evidence, KR is predicted as a repressor of posterior eve stripe 2, but is also implied by the model to be as an activator just anterior of this in cells where KR concentrations are lower (Figure 7). This and the many other similar cases could represent spurious correlations, perhaps due to other factors having dominant effects on targets in cells where the factor under study is expressed at lower levels. However, there are a number of cases where factors, including KR, have been shown to switch from activating to repressing the same target as their concentrations increase 3738. Thus, the predictions of both our NODE model and our variant of the spatial correlation model make it more obvious that gene regulation can involve multiple mechanisms of factor action that should be considered hence forth.
In some cases, the NODE model predicts factor activities that are closer to biological expectations than the spatialcorrelation model. Figure 6 indicates that both models predict strong repression by GT in almost the same anterior portion of eve stripe 2 (regions where the blue lines are below 0). On the other hand, Figure 7 indicates that the spatialcorrelation model predicts repression by KR mostly in the interstripe region between stripes 2 and 3, whereas the NODE model predicts repression by KR in the posterior half of stripe 2. Since it has been experimentally observed that the eve stripes narrow over time 3120, and the NODE model more accurately indicates narrowing of the stripes, this provides further support for the idea that the NODE model performs better.
Another significant difference between the two models is that the NODE model can distinguish between multiple regions of the embryo where target mRNA either increases or decreases over time, whereas spatialcorrelation models, by definition, cannot. This allows the NODE model to provide more subtle distinctions of factor activity.
We make the following formal definitions (see Methods and Models for the mathematical definitions):
• Type I Repression  At current factor concentrations, the target mRNA will decrease in concentration over time. An increase in factor concentration will lead to a faster rate of decrease in target mRNA amounts over time.
• Type II Repression  At current factor concentrations, the target mRNA will increase in concentration over time. An increase in factor concentration will lead to a slower rate of increase in target mRNA amounts over time.
• Type I Activation  At current factor concentrations, the target mRNA will increase in concentration over time. An increase in factor concentration will lead to a faster rate of increase in target mRNA amounts over time.
• Type II Activation  At current factor concentrations, the target mRNA will decrease in concentration over time. An increase in factor concentration will lead to a slower rate of decrease in target mRNA amounts over time.
With these definitions in hand we can readily see that, for example, while KR is a repressor within the posterior half of eve stripe 2, for most of this region it is a type II repressor, acting in cells where eve mRNA concentrations are increasing over time (Figure 9). Only in the very posterior margin of this stripe does the level of eve mRNA decrease. Similar distinctions between the two modes of activation and repression can be seen in embryo wide plots (Figures 9 and 10). The distinction between type I and II effects does not necessarily reflect different biochemical mechanisms between say, antiactivation and active repression, but equally they might. Certainly, the ability of the NODE model to make these distinctions provides a richer understanding of the relationship between factor and target expression than spatialcorrelation models.
<p>Figure 9</p>Locations of type I and II activation and repression of eve by GT
Locations of type I and II activation and repression of eve by GT. The factor activity of GT protein on eve as predicted by the NODE model is shown (left). The "Increasing" plot shows type I activation in yellow/red and type II repression in blue for cells where eve mRNA is increasing over time (center). The "Decreasing" plot shows type I repression in blue and type II activation in yellow/red for cells where eve mRNA is decreasing over time (right).
<p>Figure 10</p>Locations of type I and II activation and repression of eve by KR
Locations of type I and II activation and repression of eve by KR. The figure is labelled using the same conventions as Figure 7 except that the models are for the factor activity of KR protein on eve.
Comparison to dynamical models
It is also instructive to compare our NODE model to existing dynamical models of spatial pattern regulation in Drosophila embryogenesis. There are dynamic models, some using nonlinear ODEs, that describe the developmental change in the expression of gap genes 26273536 and the eve stripes 34. Some of these models only describe the network at the level of protein expression 2627 whereas others include more detailed processes such as protein binding 343536. Like our model, these models can replicate experimentally measured gene expression patterns.
The models in 2627 are similar to our work in some regards in that they concern the network at the expression level. However, they require significant biological knowledge in order to hypothesize the structural forms of their equations, which can be problematic because this limits their ability to provide new biological insights. For instance, an a priori biological assumption made by the models in 2627 is that factors do not have concentrationdependent effects. A factor always either represses, activates, or does not affect the target gene. Biological experiments 3738 and our models suggest that this is not always true.
The main disadvantage of the models in 343536 is that they use in vitro data in fitting models for in vivo behavior. These models contain detailed predictions of the regulatory network such as levels of proteinDNA binding in vivo. This is problematic because the parameters of the models were calculated using only gene expression data and in vitro DNA binding data. No comparison was made between the models' inferences and actual measurements of in vivo DNA binding. Work by the BDTNP shows that there is no simple correlation between in vitro affinity and in vivo occupancy, even on highly bound functional targets 39. This suggests that the models in 343536 are unlikely to be accurate and that more quantitative data, such as ChIPchip or ChIPseq binding data, needs to be used to calculate the model parameters.
Methods
Here, we describe our NODE technique which uses timeseries data to generate a dynamical model. We assume that the ratelimiting species (i.e. transcription factor protein concentrations) which drive the behavior of the network have been measured, and we do not consider actions on faster timescales (e.g. the dynamics of factors binding to target genes). Also, we assume that concentrations are large enough for the rates of interaction to be deterministic.
Under these assumptions, the system can be reasonably described by an ODE:
d
x
d
t
=
f
(
x
)
,
where x is a vector whose elements are the concentrations of the ratelimiting species. Nonlinear regression techniques 21 start with a function with unknown coefficients, and then they regress the data onto this function. This is problematic because the relationships are highly nonlinear and one risks overfitting the data by starting with a function with many unknown coefficients. In contrast, our NODE method does not make any assumptions on the functional form of f(x). We use nonparametric statistics to make local estimates of the ODE in Equation 1, and our tools can scale to networks with hundreds of species.
We focus our presentation on NODE models which describe the effect of five regulatory transcription factors on target eve mRNA, and we briefly comment on how this technique can be used with general, timeseries data. The data set we use, code for our methods, and the models generated by our methods are publically available and can be downloaded from http://bdtnp.lbl.gov/FlyNet/bioimaging.jsp?w=node.
Experimental data
We apply our technique to experimental data that has been collected and processed by the BDTNP 3120, where measurements of protein and mRNA concentrations are taken by analyzing images of many Drosophila embryos to create a virtual embryo. The virtual embryo consists of 6078 cells and is a computational, spatial decomposition which is determined by averaging the geometry and number of cells of different embryos 3120. The virtual embryo has measurements of the concentration (averaged over the different embryos at fixed points in time) of various protein factors and target mRNAs at the cellular level for six different time points during Stage 5 of the Drosophila embryo. We denote the vector of factor concentrations as x[t, e] and the vector of target gene concentrations as y[t, e], where t = 1, ..., 6 is the time of the measurement and e = 1, ..., 6078 is an index which uniquely identifies each cell in the virtual embryo. Notation like x_{bcd}[t, e] denotes the [bcd] concentration in cell e at time t.
Computational and statistical methods
The NODE technique is summarized in the following algorithm. Any tuning parameters are chosen in a datadriven manner using crossvalidation 402930.
Inputs: Factor concentrations x[t, e], target gene concentrations y[t, e]
Outputs: NODE model
1) Presmooth the factor concentrations x[t, e] and then compute time derivatives of the target gene concentrations y[t, e]
a) For each e = 1, ..., 6078
i) Do a leastsquares fit of the polynomial
x
∧
[
t
,
e
]
=
c
0
+
c
1
t
+
...
+
c
r
t
r
(where c_{0},..., c_{r }are coefficients and r is a tuning parameter) with the data points: x[t, e], for each t = 1, ..., 6
ii) Do a leastsquares fit of the polynomial
y
∧
[
t
,
e
]
=
k
0
+
k
1
t
+
...
+
k
r
t
r
(where k_{0}, ..., k_{r }are coefficients and r is a tuning parameter) with the data points: y[t, e], for each t = 1, ..., 6
b) Presmoothed factor concentration data is given by
x
∧
[
t
,
e
]
, and time derivative of target gene data is given by
d
y
∧
/
d
t
[
t
,
e
]
=
k
1
+
k
2
t
...
+
k
r
t
r
−
1
2) Define matrix Y with rows given by
(
d
y
∧
/
d
t
[
t
,
e
]
)
, for each t = 1, ..., 6 and e = 1, ..., 6078
3) Calculate the NODE model
a) For each t = 1, ..., 6 and e = 1, ..., 6078
i) Define matrix X_{[t, e] }= [1 Ξ_{[t, e]}], where first column is all one's and Ξ_{[t, e] }is matrix with rows given by
(
x
∧
[
u
,
v
]
−
x
∧
[
t
,
e
]
)
, for each u = 1, ..., 6 and v = 1, ..., 6078
ii) Define weighting matrix W_{[t, e] }to be diagonal matrix with entries along diagonal given by
w
[
u
,
v
]
=
{
3
(
1
−
(
n
[
u
,
v
]
/
h
)
2
)
/
4
,
if
n
[
u
,
v
]
≤
h
0
otherwise
for each u = 1, ..., 6 and v = 1, ..., 6078, where n[u, v] = x[u, v]x[t, e]_{2 }is the Euclidean distance and h is a tunable parameter
iii) Define matrix P_{[t, e] }by making its columns be the (p  d) principal components of Ξ_{[t, e]}^{T }W_{[t, e]}Ξ_{[t, e] }with smallest eigenvalues, where p is number of factors (p = 5 for the NODE model of target eve mRNA) and d is a tuning parameter
iv) Coefficients of NODE model, for eth cell at tth time point, are given by NEDE estimator
[
b
[
t
,
e
]
a
b
c
d
,
[
t
,
e
]
...
a
K
r
,
[
t
,
e
]
]
T
=
arg
min
β
‖
W
[
t
,
e
]
1
/
2
(
Y
−
X
[
t
,
e
]
β
)
‖
2
2
+
λ
‖
P
[
t
,
e
]
β
‖
2
2
where
d
[
e
v
e
]
d
t
=
a
b
c
d
,
[
t
,
e
]
(
[
b
c
d
]
−
x
∧
b
c
d
[
t
,
e
]
)
+
...
a
K
r
,
[
t
,
e
]
(
[
K
r
]
−
x
^
K
r
[
t
,
e
]
)
+
b
[
t
,
e
]
.
Step 1 involves presmoothing the experimental data and computing its time derivatives. We prefer to do this with local polynomial regression (LPR) 41 because it suffers from fewer transient effects than digital filters 4241. To simplify the presentation, Step 1.a describes polynomial regression (PR). LPR is a variant of PR which protects against oversmoothing the data, and it can be quickly computed by doing a weighted linear regression. More details on LPR can be found in 41.
This step is important because otherwise the NODE model will be statistically biased 43. However, caution must be used when deciding to presmooth certain data sets in which the measurements are very noisy and taken at a sparse grid of points in time. In such cases, there is a risk of smoothing out biologicallyrelevant, temporal trends in the data because of the sparsity of the temporal grid.
Step 3 computes the NODE model, and we make use of the NEDE estimator: a new statistical tool that protects against overfitting 30. The computation in Step 3.a.ii determines a window of cells v at time u that have concentrations similar to cell e at time t. The size of this window is selected by the parameter h, and cells with highly (weakly) similar concentrations are weighted highly (weakly) in the estimation of the coefficients of the NODE model. Equation 2 uses the Epanechnikov kernel to do this weighting. Note that weights for cells with very different concentrations can be similar, because the Euclidean distances computed in Step 3.a.ii can be similar. This does not cause problems because the NEDE estimator has been proven to be statistically wellbehaved in the presence of such weighting schemes 412930.
Step 3.a.iv uses the NEDE estimator in Equation 3 to compute the coefficients of the NODE model. It protects against overfitting by learning constraints which the data obeys (Step 3.a.iii), and then using these constraints to reduce the degrees of freedom in the regression. In general, the data points x_{[t, e] }form a manifold, and the projection matrix P_{[t, e] }in Equation 3 enforces that the regression coefficients lie close to the manifold. This methodology is motivated by differential geometry which says that the exterior derivative of a function on an embedded submanifold lies in the cotangent space 442930. The NEDE estimator can be calculated quickly on a computer because it is a convex optimization problem. Theoretical properties and a more detailed description of the NEDE estimator can be found in 2930.
NODE model interpretation
Instead of using a single ODE model to describe the regulatory network, the NODE model uses a group of ODE models consisting of the first order Taylor expansion (i.e., linearization) of the ODE given in Equation 1. Each equation of the NODE model describes how the behaviour of the regulatory network changes if concentrations of the factors in cell e at time t are changed. It requires fewer assumptions or prior knowledge about the system, because it does not require knowing the mathematical structure of the single ODE model in Equation 1. The disadvantage of this approach is that it is more difficult to interpret a series of models. The full NODE model for formation of target eve mRNA is given by Equation 4, and there is a different equation for each cell e at time t. Though the NEDE estimator protects against overfitting, some might feel that the NODE model overfits. The predictive ability of the NODE model, as discussed in Results and Discussion, gives evidence that it does not overfit. In that test, we used our algorithm on the first two time points of data, and we assumed that the model for cell e at times t = 3, 4, 5 was the same as the model for cell e at time t = 2.
Equation 4 is difficult to interpret because the coefficients vary depending on cell e at time t, due to the fact that each equation is a linearization that is valid for when factor concentrations are close to x[t, e]. The model describes how a change in factor concentrations in the presence of all factors (the right hand side of Equation 4), affects the change in time of eve (the left hand side of Equation 4). If d[eve]/dt is positive (negative), then eve concentration will increase (decrease) by the next instant of time. For example, suppose the concentrations of all species are kept fixed at
x
∧
[
t
,
e
]
except for the concentration of GT which is slightly increased from
x
∧
g
t
[
t
,
e
]
to
[
g
t
]
=
x
∧
g
t
[
t
,
e
]
+
Δ
g
t
. In this situation, the change in time of eve concentration will be given by d[eve]/dt = a_{gt,[t, e]}Δgt + b[t, e]. The increase of GT concentration by Δgt leads to a change in the change in time of eve concentration by a_{gt,[t, e]}Δgt amount, and the sign of a_{gt,[t, e] }describes whether this change is positive or negative.
Because this equation describes relationships in the presence of all factors, this can lead to seemingly contradictory results, such as when one species is a putative activator (e.g., BCD protein upregulates eve mRNA), but increasing the concentration of the activator in the presence of the other species can have a slight repressive effect because of interactions between the activator and the other factor species (i.e., the described concentrationdependent effects). Such a situation leads to an odd result: The coefficient of the "activator" will be negative.
Our NODE model is different from the spatialcorrelation model 2033212223. We consider the following version of the spatialcorrelation model:
[
e
v
e
]
=
a
b
c
d
,
[
t
,
e
]
(
[
b
c
d
]
−
x
∧
b
c
d
[
t
,
e
]
)
+
...
a
K
r
,
[
t
,
e
]
(
[
K
r
]
−
x
∧
K
r
[
t
,
e
]
)
+
b
[
t
,
e
]
,
and this model looks for the correlation of eve mRNA with protein factor concentrations. Whereas Equation 4 is a dynamical model, the model in Equation 5 is a static model, because it does not describe the temporal changes in eve concentration. A comparison of the fits between these models can be seen in Results and Discussion. Note that the coefficients in Equation 5 are computed with the algorithm for our NODE technique, with the change that Y is a vector of eve concentrations.
Factor activity
Factor activity is a quantitative measure of the impact of a factor on the target gene expression, and it is a particular scaling of the coefficients (or correlations) of the model. It takes into account the concentration of the factors and the coefficients of Equation 4, which describe the amount of influence of the factors on the target expression. Without loss of generality, we give the equation for factor activity of GT on the expression of eve mRNA
a
g
t
,
[
t
,
e
]
(
1
n
[
Ξ
[
t
,
e
]
T
W
[
t
,
e
]
Ξ
[
t
,
e
]
]
g
t
)
1
/
2
.
The first term is the coefficient from Equation 4, and the second term in parenthesis is a measure of average GT concentration within cells whose factor concentrations are similar to cell e at time t. The second term in parenthesis in Equation 6 is a measure of average concentrations, because it is a measure of the mean difference from the baseline concentration of x[t, e]. To clarify the notation, suppose the ith value of x[t, e] denotes: x_{gt}[t, e], which is GT concentration. Then the term [Ξ_{[t, e]}^{T }W_{[t, e]}Ξ_{[t, e]}]_{gt }denotes the ith value along the diagonal of the matrix Ξ_{[t, e]}^{T }W_{[t, e]}Ξ_{[t, e]}.
For the NODE model, the factor activities can be subdivided into four categories of behavior. Without loss of generality, we provide mathematical definitions for four categories of GT activity on eve mRNA. At a given concentration x[t, e], if the GT coefficient from Equation 4 is negative (i.e., a_{gt,[t, e] }< 0) and eve concentration is decreasing (i.e., d[eve]/dt < 0), then GT is formally a Type I repressor. A summary of the other mathematical definitions is given in Table 1.
<p>Table 1</p>Mathematical definition of factor activity classification in the NODE model
sign(a
_{
gt,[t, e]
}
)
sign(d[eve]/dt)
Type I Repression


Type II Repression

+
Type I Activation
+
+
Type II Activation
+

Without loss of generality, we consider the factor activity of GT on eve, as described by Equation 4. The classification is dependent on the mathematical sign of the coefficient of the model a_{gt,[t, e] }and the mathematical sign of the change in eve mRNA d[eve]/dt, and it is different for each factor concentration x[t, e]. A positive (negative) sign is denoted with the "plus" ("minus") symbol "+" ("").
Window sizes
An example of a window is shown in Figure 11. The NODE method uses Equation 2 to take the similarity of the cells into account when doing the regression procedure. The size of the window is determined by the parameter h which is chosen using crossvalidation, and it changes for each cell e at time t. As explained earlier, the statistical tools are wellbehaved when weighting of cells within this window is computed with Euclidean distance.
Our method can automatically identify symmetries in the embryo patterns. The window contains cells on the other half of the embryo, because it can tell that the embryo has symmetry along the leftright axis. Similarly, it can divide the embryo into stripelike regions which correlate to the positions of the eve stripes. This happens because our method looks for cells with factor concentrations similar to the redcolored cell, rather than just including cells spatially near the redcolored cell.
<p>Figure 11</p>Window of cells with similar concentrations
Window of cells with similar concentrations. The cell which represents x[t, e] is shown in red, and a purple line points towards this cell. The window of cells with similar factor concentrations is shown in gray, and cells farther away from the redcolored cell are less similar. Cells with more similar concentrations are shown by darker shades of gray, and cells not in the window are colored white. The black lines show the boundaries of the experimental eve pattern. The NODE method takes the amount of similarity of the cells into account when doing the regression procedure.
To check that window sizes selected in a datadriven manner were not too small and missing important features, we did a check in which we fixed the windows to surround cell e at time t with a circle of radius of three cells at time t. This size was chosen, because the eve stripes are about six cells wide at Stage 5. A circular window with this size would not miss important regulatory features of the network. The [Additional file 1] shows plots of factor activity as generated by our NODE method for both dataselected and fixed windows. A visual comparison of the factor activity plots generated by these two windows shows that the dataselected windows were able to identify the same features as the fixed, circular window.
<p>Additional file 1</p>
Supplementary material. Full set of Factor Activity plots generated with both crossvalidationselected and fixed window sizes.
Click here for file
General timeseries data
Our NODE technique can be applied to general timeseries data. The NODE model is
d
x
/
d
t
=
A
ξ
[
n
]
(
x
−
ξ
[
n
]
)
+
b
ξ
[
n
]
,
where (a) ξ[n] for n = 1, ..., N is a userselected set of linearization points of Equation 1, (b) A_{ξ[n] }= Df(ξ[n]) and b_{ξ[n] }= f(ξ[n]) are the coefficients of the model, and (c) Df is the gradient of f(x). The NODE technique is unchanged except e refers to different experiments (instead of different cells), and Equation 3 is applied columnwise to Y to give columns of the matrix of coefficients: [b_{ξ[n]}^{T }A_{ξ[n]}^{T}]^{T}.