Abstract
Background
Inference of generegulatory networks (GRNs) is important for understanding behaviour and potential treatment of biological systems. Knowledge about GRNs gained from transcriptome analysis can be increased by multiple experiments and/or multiple stimuli. Since GRNs are complex and dynamical, appropriate methods and algorithms are needed for constructing models describing these dynamics. Algorithms based on heuristic approaches reduce the effort in parameter identification and computation time.
Results
The NetGenerator V2.0 algorithm, a heuristic for network inference, is proposed and described. It automatically generates a system of differential equations modelling structure and dynamics of the network based on timeresolved gene expression data. In contrast to a previous version, the inference considers multistimuli multiexperiment data and contains different methods for integrating prior knowledge. The resulting significant changes in the algorithmic procedures are explained in detail. NetGenerator is applied to relevant benchmark examples evaluating the inference for data from experiments with different stimuli. Also, the underlying GRN of chondrogenic differentiation, a realworld multistimulus problem, is inferred and analysed.
Conclusions
NetGenerator is able to determine the structure and parameters of GRNs and their dynamics. The new features of the algorithm extend the range of possible experimental setups, results and biological interpretations. Based upon benchmarks, the algorithm provides good results in terms of specificity, sensitivity, efficiency and model fit.
Keywords:
Generegulatory networks; Network inference; Heuristic algorithm; ODE; NetGeneratorBackground
For the adaptation of biological systems towards external and environmental stimuli usually a complex interaction network of intracellular biochemical components is triggered. That includes changes in the gene expression at both the mRNA and protein level. Considering a certain stimulus as an input signal to the system and mRNA or protein levels as outputs, the connecting network may include interactions between signal transduction intermediates: transcription factors and target genes. Generally, the term “generegulatory network” (GRN) summarises genetic dependencies, which describe the influence of gene expression by transcriptional regulation, [1].
The inference (elucidation) of GRNs is important for understanding intracellular processes and for potential manipulation of the system either by specific gene mutations, knockdowns or by treatment of the cells with drugs, e.g. for medical purposes. Towards a full understanding in terms of a complete network, partial models of the network give intermediate results which help to refine the knowledge and to design new experiments. Still, many generegulated cellular functions, e.g. stem cell differentiation, depend on more than one stimulus and the crosstalk within the GRN, e.g. [2]. On the other hand, the stimuli might influence distinct components of a GRN. Such biologically relevant dependencies can be investigated by applying two or more stimuli and measuring the influenced genes. This approach can be called multistimuli experiment. If this is carried out in two or more separate experiments, one derives multistimuli multiexperiment data. Only algorithms with the ability to consider those data can infer such dependencies.
As shown in review articles, e.g. [1,3,4], there are different inference methods using various sources of information thus leading to different results. Amongst the typically mathematical models the application of differential equations describing timeresolved gene expression data (“time series”) has been proven successful. Unfortunately the potential complexity of the networks leads to a high number of structural connections and parameters in contrast to the comparably small number of available measurement data. Apart from the problem of identifiability, the number of possible parameter combinations is very large, thus resulting in high computational costs. Therefore, appropriate heuristic approaches can reduce this amount while providing comparably good inference results. NetGenerator is a heuristic algorithm, which considers time series data to automatically infer GRNs influenced by an external stimulus, [5] and [6]. The approach combines a structure (network topology) and parameter optimisation. The final result in form of a differential equations model can be simulated and displayed graphically. An earlier version with less functionality was applied successfully to biological problems, e.g. the regulatory network of iron acquisition in Candida albicans and the analysis of the Aspergillus fumigatus infection process, [7] and [8].
In the present article, we propose NetGenerator V2.0, an extended version of the algorithm which enables the use of multistimuli multiexperiment data, thus increasing the number of addressable biological questions. This causes significant changes in the algorithmic procedures, especially the processing of this kind of data as well as the structure and parameter optimisation. Also, some other updated features will be outlined, for example the different modes of prior knowledge integration, further knowledgebased procedures, options of graphical outputs, changed nonlinear modelling and reimplementation in the programming language / statistical computing environment R, [9]. Further, in comparison to the previous version, some of the algorithmic procedures will be explained in more detail, because they are important for understanding the overall method.
The successful application of the novel NetGenerator will be shown by inference of relevant multistimuli multiexperiment benchmark examples, namely systems with a different degree of crosstalk. Two aspects will be assessed: (i) reproduction of the benchmark systems (data and structure) and (ii) refinement / extension of a network structure by combination of different experimental data. Furthermore, the applicability of NetGenerator to a realworld problem is presented: after describing necessary data preprocessing steps, the underlying GRN of chondrogenic differentiation of human mesenchymal stem cells, a process influenced by the two stimuli TGFbeta1 and BMP2, is inferred.
Methods
In the following subsections the necessary background knowledge and methodology of the NetGenerator algorithm is described. In comparison to previous publications this includes new, updated and more detailed algorithmic procedures. First, the motivation and the goals are defined by considering the biological data. Necessary steps of data preprocessing are also explained within this subsection. Subsequently, ordinary differential equations and some of their properties are presented as a means for modelling the dynamics of gene regulatory networks. Then the heuristic approach of the algorithm is explained including the structure and parameter identification (here: optimisationbased determination). The next important topic will be the consideration of prior knowledge, followed by a subsection about the numerical simulation as well as the representation of modelling and graphical results. Finally, some important options and their influence to the algorithm are presented.
Time series data and preprocessing
Gene expression time series data as required by NetGenerator are typically derived from microarray measurements. Before starting the network inference, raw microarray data have to be processed comprising a series of steps. The three main steps are displayed in Figure 1: (i) microarray preprocessing, (ii) gene selection and (iii) time series scaling.
Figure 1. Data preprocessing work flow. This work flow illustrates inputs and outputs of NetGenerator as well as recommended data preprocessing steps: preprocessing of microarray data, selection of genes, standardisation of gene expression time series.
Microarray preprocessing applies multiple procedures to remove nonbiological noise from the data and to estimate gene expression levels. Custom probesets, as assembled by [10], reduce the number of crosshybridising probes. This initial reduction accomplishes a onetoone correspondence between probeset and gene. Background correction, normalisation and summarisation are provided by the RMA package, [11], resulting in logarithmised gene expression estimates, which can be used for the next processing step.
Gene selection (“filtering”) is the important second step of processing, since reliable network inference is only feasible for a sufficient number of measurements per gene [1]. This number is often limited and therefore a selection of genes for modelling is inevitable. Candidate genes should show pronounced temporal dynamics and significant differences compared to the control group. Statistical methods for identification of differentially expressed genes are widely used for gene selection. We use the LIMMA tool, which can operate on time series data determining significance of gene expression changes over time [12]. The statistical test (moderated tstatistics) operates on contrast terms, defined by subtracting the control group at each time point. LIMMA returns a ranked table for all genes containing columns for gene name, foldchange and adjusted pvalues. Differentially expressed genes are selected by a combination of adjusted pvalue cutoff and foldchange criterion.
Time series standardisation is the last processing step including centering and scaling
of each time series. The centering procedure subtracts the original initial value
at the starting time point from all values such that the transformed time series starts
from zero. In the subsequent scaling procedure each time series is divided by its
maximum (absolute value) across all provided experimental data. This leads to genewise
scaled data and gene expression time series varying within −1and 1. The resulting
data provided to the NetGenerator algorithm are stored in
GRNs considered as linear timeinvariant systems
The NetGenerator algorithm infers dynamical models of GRNs. Their general nonlinear dynamics can be described by a set of firstorder timeinvariant ordinary differential equations (ODEs), initial conditions, and time range (validity period)
with the vector of state variables
Even though NetGenerator has a nonlinear modelling option, the core mechanisms are based on linear modelling. Under the assumption that most networks can be considered linear and timeinvariant, the differential equation system in (1) can be modified resulting in the linear statespace equation system
with the system or interaction matrix
Under the assumption, that the behaviour of a GRN is described sufficiently by indirect transcriptional events and not by a conversion of material, activation (a_{i,j} > 0) or inhibition (a_{i,j} < 0) of state variable x_{i} is not changing the value of the originating state variable x_{j}.
Without any further assumptions all elements of
Heuristic multistimuli multiexperiment approach
The novel NetGenerator algorithm is a heuristic multistimuli and multiexperiment approach. The heuristic is based on the observation that in GRNs the number of connections is much lower than all possible connections. Further, since the applied stimulus is the cause of the observed dynamical changes, the network can be considered as a hierarchical structure originating from the input. The NetGenerator algorithm implements both observations by an iterative development of the statespace system (2) by including coupled submodels for each time series based on a structure optimisation iteratively increasing the number of connections. Structural changes are taking place only if they result in a better adaptation of simulated to measured behaviour. The terms multistimuli and multiexperiment mean that the extended algorithm can handle more than one changed input and data of several experiments, respectively.
In Figure
2 (A) the main work flow of the algorithm is displayed. One outer loop, starting with
empty
Figure 2. NetGenerator work flow. NetGenerator work flow displaying the main steps of the algorithm (A) and the improvement of best time series (B). In the main work flow, the outer loop iterates over all state variables (submodels), while in the inner loop the remaining time series are tested, i.e. a basic structure and parameters are identified. The best time series is improved further (“Growing”, “Higher Order”, “Pruning”) and included into the model as a state variable.
containing connections from state variables, N_{i} being the indices of statestate connections including the selfregulatory term a_{i,i}x_{i}, and connections from inputs with M_{i} being the indices of inputstate connections for the considered state variable x_{i}. That means that only the parameters of submodels have to be identified.
Since the algorithm aims at a low number of parameters, i.e. small N_{i} ≤ N and M_{i} ≤ M, the inner loop starts with basic models for the remaining time series containing only selfregulation, one input term as well as connections from “fix” prior knowledge if available, see respective subsection. Those basic structures can be extended by further incoming connections (“growing”) from already included submodels and further inputs. Every structural change requires a parameter identification of the active connections with respect to the considered time series, as will be explained later in the corresponding subsection. For every different set of parameters the resulting model needs to be simulated, that is the numerical solution of an initial value problem has to be found, as will be described later in another subsection.
The basic submodel which reproduces one of the remaining time series best, is chosen for further improvement, for details see Figure 2 (B), and included into the model as a state variable. The most important structural improvements are
•“Growing”: further connections added
•“Higher Order”: increase submodel order
•“Pruning”: connections removed
In the improvement step “growing” is not restricted to connections from time series that are already included in the model. For describing the influence of time series that have not yet been included as submodels, the corresponding measured and interpolated data are used as inputs. Those connections form global feedbacks in the final model.
The increase of the dynamical order within the description of a time series is realised by r−1 additional equations or intermediate state variables leading to the following form:
In this way the dynamics of a certain submodel are described by an rth order integrator chain allowing for reproduction of processes with more complex time courses. It should be emphasised that by applying this approach the number of parameters is not increased but on the other hand the number of state variables becomes larger than the number of time series data. In that case only the state variable with the highest order in such a submodel is to be compared to time series data. Still, for the sake of simplicity all following algorithmic procedures are described for firstorder submodels.
In terms of the iterative process of including submodels the different elements of the final system matrix
describe forward, local feedback and global feedback connections. Elements below the
main diagonal become forward connections, whereas the main diagonal elements
All the previously described structural procedures and the corresponding parameter identification are controlled by apriori defined settings and options of the algorithm. Some of them are balancing network complexity and error between measurement and simulation. For example, additional connections are rejected if they are not improving the objective function value to a significant extent while on the other hand connections are removed only if they are not worsening the result significantly. Further important options of the algorithm are explained in the respective subsection.
Parameter identification
The parameter values of an active submodel are identified by a nonlinear optimisation, minimising the error between simulated and measured time series data of multiple experiments. The initial parameters required for this optimisation are obtained by a linear regression. For one specific first order state variable x_{i} equation (3) can be rewritten as
with
being the parameter vector of some of the elements of
with the weight matrix
The nonlinear optimisation of the parameters for the ith submodel, initialised by the solution of the linear regression (8), is based on the minimisation of the objective function (model error)
describing the deviation between measured x_{e,i}and simulated
Consideration of prior knowledge
For improving the results, prior knowledge about the network connections can be integrated
into the network inference. This version of NetGenerator provides two modes for integration
of prior knowledge about connections of stimuli on time series as well as between
the time series: (i) “fix” and (ii) “flexible”. For both modes the knowledge can be
provided in form of connection matrices
Flexible integration allows the inference heuristic to ignore prior knowledge when the model fit is substantially worsened. This is represented by an additional term in the objective function (model error) now resulting in
The term J_{i,output} corresponds to the previously in (9) defined evaluation of output deviation, while
λ weighs the overall consideration of prior knowledge, s represent the score values of the respective prior knowledge and d describe the distances between the prior knowledge and the modelled structure (incoming
connections) evaluated by comparison of signs. That means the resulting elements of
For the extraction from published literature the software Pathway Studio V9 provides a gene relation database termed ResNet Mammalian, which has been compiled by automatic extraction of interactions from PubMed, as evaluated by [15]. As shown in the latter publication, gene relations derived from Pathway Studio V9 can be considered of high quality, since in general scientific literature is a reliable resource and the false positive rate is reported to be about 10 %.
Further, the tool matrixscan from the RSAT toolbox determines putative TFBS in the promoter regions of target genes, which might be involved in transcriptional regulation [16]. This approach requires known sequence motifs of the investigated transcription factors as well as promoter sequences. Sequence motifs are stored in form of position weight matrices (PWM), which describe relative nucleotide frequencies for each motif position, as can be obtained from the Transfac database (Version 2010) [17]. Gene promoter sequences are available from Ensembl using biomaRt, [18].
Additional prior knowledge about the regulatory potential of the individual genes can be obtained by examining the known molecular functions. For example, the interaction between genes coding for nonregulatory proteins, such as structural proteins, and target genes can be assigned “no connection”.
Simulation and graphical output
For every comparison of measurement and simulation as well as the generation of results
the model equations (2) must be integrated. This corresponds to an initial value problem
that is solved numerically. Since the recent implementation of the NetGenerator algorithm
is in R, repeated operations of certain types take a long time. Therefore, the model
itself is implemented in C, created iteratively and simulated applying the implicit
method “impAdams” of the Rpackage deSolve,
[19]. The necessary initial conditions
The final result of the NetGenerator algorithm is a parametrised model of the considered GRN. Moreover, the new implementation of the algorithm contains important graphical output facilities which have been extended to meet the needs of displaying multiinput multiexperiment data as well as different results concerning prior knowledge. First, there is a graphical comparison of measurements and simulations, showing the single measured data points and the corresponding simulated trajectory. This can be done either by comparison of each component (gene) over all experiments or by displaying the data for each experiment independently. Second the resulting network structure can be displayed as a directed graph applying the language DOT and the software collection Graphviz, [20]. Nodes denote the biochemical components, e.g. genes, and edges display connections of either activation or inhibition. In case of applying prior knowledge (see respective subsection), a comparison between the inferred network and this knowledge is displayed with a colour code. Black edges denote inferred connections without prior knowledge, green edges present an agreement, red edges could either have a wrong sign (e.g. activation instead of inhibition) or be connections that do not comply with prior knowledge, while grey dashed edges stand for prior knowledge not reproduced in the inferred network.
Further settings and updated methods
The NetGenerator algorithm itself can be controlled by parameters (settings) and also contains further methods that will be summarised in the following. An important setting is the “allowedError” that controls the structure optimisation. If the objective function value of a certain submodel structure is worse than this value the model structure must be extended as described. Therefore smaller values of “allowedError” are indirectly leading to more complex structures. Further important settings are the maximal number of connections and submodel order.
Additional updated or new methods, not described extensively here, include nonlinear modelling and knowledgebased methods. The optional nonlinear modelling approach contains an additional sigmoid transformation of the linear combination described in this publication. This transformation has its biological background in the saturating behaviour of gene expression. The additional nonlinear parameters of each submodel are determined by the described nonlinear parameter identification, too. Amongst further knowledgebased methods, the most important presents the possibility of retrieving network information from databases and combining this information with the inferred model in a directed graph. In that way, the biological interpretation can be extended by introducing unmeasured components into the network structure.
Availability
The algorithm has been implemented as a package in the programming language / statistical computing environment R, [9]. It is available in form of a testing bundle containing both the algorithm and the examples at http://www.biocontroljena.com/NetGenerator/NetGeneratorBundle.zip webcite
Results
Example networks
We applied the NetGenerator algorithm, which has been described extensively in the Methods section, to 3 benchmark examples and 1 realworld example to examine the performance of our approach. At first, we consider the three benchmark systems, their corresponding artificial data and inferred networks in order to test the reliability and performance of our algorithm. Particularly, we investigated whether network inference from multiple data sets, originating from different stimulation experiments, is beneficial. Finally, we applied NetGenerator to microarray time series data gained from human mesenchymal stem cells. We focussed on the modelling of gene regulation that occurs during in vitro stimulation of chondrogenic differentiation of these cells, with emphasis on the different effects triggered by multiple stimuli in the inferred model.
Benchmark examples
We constructed three fully parametrised benchmark systems based on linear timeinvariant
descriptions, i.e. they are composed of differential equations representing the time
series of genes and two external stimuli (u_{1} and u_{2}). The systems are characterised by a different degree of crosstalk between the components
with respect to the external stimuli, that is “full crosstalk” (FCT): all components
are influenced by all stimuli, “limited or low crosstalk” (LCT): some of the components
are influenced by more than one stimulus, and “no crosstalk” (NCT): the stimuli influence
distinct components resulting in separate networks. They also differ in the number
of genes (FCT: 5, LCT: 4, NCT: 7) and the parameters. The artificial data were generated
exhibiting characteristics of real microarray time series data, i.e. low number of
time points (six), exponentially increasing time intervals, and additional normally
distributed noise
Evaluation measures. The network inference of benchmark systems can be evaluated by determining the final objective function value (model error) J according to equation (10), the computation time t_{C}, and statistical measures that quantify the performance of the network inference by comparing the known structure with the inferred structure. The indicated computation times resulted from running the examples on a x86PC with a 2.33 GHzCPU. The measures comprise sensitivity (SE), specificity (SP), precision (PR) and Fmeasure (FM). The definitions of the measures take into account the correctly integrated edges (true positives, TP), the falsely integrated edges (false positives, FP), the truly missing edges (true negatives, TN) and true edges that are not contained in the model result (FN). False positives (FP) were further grouped into FP_{s}, connections integrated with wrong sign and FP_{n}, modelled interactions which are not present in the real network. This leads to the following definitions:
For all three benchmark examples, we evaluated the inference by those statistical measures showing the reproduction of the system structure and time series by the model.
FCT scenarios and network inference evaluation. For FCT, artificial data generation and subsequent network inference was performed within three scenarios: (i) “S_{1}”: single experiment applying only u_{1}, (ii) “S_{2}”: single experiment applying only u_{2} and (iii) “M”: multiple experiment integrating experiments “S_{1}” and “S_{2}”. For the special case of FCT, the scenarios allowed us to directly compare the inference of multiple stimuli data sets with the inferences of single stimulus data sets.
We applied the network inference to each of the three scenarios (“M”, “S_{1}”, “S_{2}”) for a series of 10 different settings varying the previously described “allowedError” = 0.001, 0.002, …, 0.01 resulting in 10 models, see Figure 3. Results for all statistical measures are depicted as connected points in individual boxes. The three scenarios are plotted in distinct colours (“M”: blue, “S_{1}”: red, “S_{2}”: green) in each box. With regard to sensitivity, M models performs best, showing gradually decreasing values. Specificity obtains highest values for the first and second model. Fmeasure results, which benefit from high sensitivity values, display good performance for all M models. The resulting model error increases gradually as expected, due to the increased “allowedError”, which is defined per time series. Analysing these results, we found “M 1” (TP = 18, TN = 15, FP_{n} = 2, FP_{s} = 0, FN = 0) to be optimal with respect to the evaluation measures (SE = 1, SP = 0.88, FM = 0.94, J = 0.004). For this model, the computation time was t_{C} = 92 s.
Figure 3. “Full crosstalk” example: inference statistics. Comparison of different NetGenerator inference results (varying the “allowedError” parameter) of three “full crosstalk” (FCT) scenarios (“M”, “S_{1}”, “S_{2}”) using three statistical evaluation measures (sensitivity, specificity, and Fmeasure) and the model error.
Dynamics of this model are displayed in Figure 4 showing a good reproduction of all time series for each of the two experiments. In Figure 5, the corresponding regulatory network is presented in form of a directed graph. Here, the colour code is not denoting a reproduction of prior knowledge but a graphical means displaying TP (green), FP (red) and FN (grey/dashed) connections.
Figure 4. “Full crosstalk” example: time courses. Comparison of the “full crosstalk” (FCT) time courses. Each panel displays the results of one gene: the simulated time course (solid line), interpolated measurements (dashed line) and the measured time series (dots) for both data sets (Experiment1 and Experiment2).
Figure 5. “Full crosstalk” example: inferred network. Network structure of the “full crosstalk” (FCT) model containing two simulated inputs (Input1, Input2) and five gene nodes. Inferred connections are highlighted in green (TP = 18), red (FP_{n}=2) and dashed grey (FN = 0).
LCT and NCT network inference evaluation. In order to test whether NetGenerator is capable of inferring different crosstalk structures, we generated benchmark systems LCT and NCT. Both contain biologically motivated types of crosstalk, such as crosstalk of downstream components or separate subnetworks (no crosstalk). Inference of both networks was successful, shown by high statistical measures (SE_{LCT} = 1, SP_{LCT} = 1, FM_{LCT} = 1, J_{LCT} = 0.0007, SE_{NCT} = 0.9, SP_{NCT} = 0.98, FM_{NCT} = 0.92, J_{NCT} = 0.003), the inferred network structures in Figure 6 and Figure 7, and the good reproduction of the time courses (Additional file 1 and Additional file 2). The computation time for inference of LCT and NCT was t_{C} = 28 s and t_{C} = 33 s, respectively.
Figure 6. “Limited crosstalk” example: inferred network. Network structure of the “limited crosstalk” (LCT) model containing two simulated inputs (Input1, Input2) and four gene nodes. Inferred connections are highlighted in green (TP = 9), red (FP_{n} = 0) and dashed grey (FN = 0).
Figure 7. “No crosstalk” example: inferred network. Network structure of the “no crosstalk” (NCT) model containing two simulated inputs (Input1, Input2) and seven gene nodes. Inferred connections are highlighted in green (TP = 18), red (FP_{n} = 1) and dashed grey (FN = 2).
Additional file 1. Figure: “Limited crosstalk” example, time courses. Comparison of the “limited crosstalk” (LCT) network time courses. Each panel displays the results of one gene: the simulated time course (solid line), interpolated measurements (dashed line) and the measured time series (dots) for both data sets (Experiment1 and Experiment2).
Format: PDF Size: 279KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. Figure: “No crosstalk” example, time courses. Comparison of the “no crosstalk” (NCT) network time courses. Each panel displays the results of one gene: the simulated time course (solid line), interpolated measurements (dashed line) and the measured time series (dots) for both data sets (Experiment1 and Experiment2).
Format: PDF Size: 477KB Download file
This file can be viewed with: Adobe Acrobat Reader
Chondrogenesis model
Background of chondrogenic data. Human mesenchymal stem cells (hMSC) are multipotent adult stem cells that have the capacity to differentiate into a variety of cell types depending on the external stimulus, [2]. Regulation of lineagespecific genes is crucial in this temporal process, [21]. Transforming growth factor (TGF)beta1 is essential for induction of chondrocyte differentiation of hMSC, a process which is strongly enhanced by the additional presence of bone morphogenetic protein (BMP)2, [22] and [23]. In this section, we describe the complementary effects of TGFbeta1 and BMP2 by multistimuli multiexperiment inference applying the NetGenerator algorithm.
Microarray time series data. hMSC from bone marrow were commercially obtained (Lonza) and cultured as described in [2]. To induce chondrogenic differentiation trypsinised hMSC were pelleted and subsequently incubated in culture medium supplemented with 100 nM dexamethasone, 10 ng/mL TGFbeta1 and, if applicable, 50 ng/mLBMP2. Timedependent gene expression was studied under three experimental conditions: (i) following treatment with TGFbeta1 (“T”), (ii) following treatment with TGFbeta1 + BMP2 (“TB”) and (iii) untreated hMSC as a control. At 10 different time points (0, 3, 6, 12, 24, 48, 72, 128, 256, 384) hafter addition of the stimuli, RNA was isolated from three technical replicates per time point and measured on Affymetrix HGU133a microarrays.
Preprocessing and filtering. Raw microarray data was preprocessed as described in the corresponding subsection. This included the use of custom chip definition files provided by [10] and application of the RMA method [11]. This procedure resulted in logarithmised gene expression estimates for 12 095genes.
Modelling a smallscale GRN using microarray data requires adequate filtering of genes. We tested all genes for differential expression, used functional annotation and expert knowledge. Differentially expressed genes were identified for both experiments (“T”, “TB”) by computing adjusted pvalues using LIMMA. All genes with an adjusted pvalue less than 10^{−10}and an absolute foldchange greater than 2 for any time point were considered significant. Using those criteria, 192 genes were found to be differentially expressed compared to control as well as between “T” and “TB”. Subsequently, we selected from this group 10 annotated transcription factors (GO:0003700, sequencespecific DNA binding transcription factor activity) and associated 5 of them (SOX9, MEF2C, MSX1, TRPS1, SATB2) with our investigated process (GO:0051216, cartilage development). Those genes may be involved in promoterdependent regulation, which is important for binding site predictions. Furthermore, we added COL2A1, ACAN, COL10A1, all three essential marker genes of chondrocyte differentiation, which encode essential structural proteins of the extracellular matrix.
Prior knowledge. Prior knowledge was taken into account as described in the corresponding subsection. Gene interactions were retrieved from the Pathway Studio ResNet Mammalian database. We obtained 6 genegene and 5 inputgene regulatory interactions. Genegene interactions were passed as flexible prior knowledge to NetGenerator. Inputgene interactions were not integrated. Additionally, potential gene interactions were determined by binding site predictions. For this purpose, we obtained PWMs for SOX9, MEF2C and MSX1 from the Transfac database and promoter sequences 1000 bp upstream from the transcription start site. Both PWMs and sequences were loaded into matrixscan from RSAT, which is performed with default options (weightscore >1, pvalue <10^{−4}) and organismspecific estimation of background nucleotide frequencies. The resulting significant binding sites have been listed in the table of Additional file 3. The observed high significance of all matches minimises the risk obtaining similar results from random sequences.
Additional file 3. Table: Results of RSAT. Results of RSAT matrixscan tool using Transfac PWMs and genomic DNA sequences from Ensembl. Each row represents a predicted binding site with Transfac motif (“PWM”), target gene, start and end coordinates, the matched sequence, match score (“Weight”) and associated pvalue.
Format: CSV Size: 1KB Download file
Network inference of multistimuli (TGFbeta1 and BMP2) multiexperiment data. After preprocessing, the input and time series data of the microarray experiments were passed to NetGenerator for automatic network inference. According to the experimental setup, the available data sets describe two experiments: only TGFbeta1 stimulation (“T”) and TGFbeta1 + BMP2 stimulation (“TB”). This is mirrored by the two distinct input data matrices both describing the respective stimuli by step functions
Model evaluation and validation. The inference results of the chondrogenic system, the GRN and the graphical comparison of time series, are displayed in Figure 8 and Additional file 4, respectively. The resulting network contains 20 connections: 14 genegene connections and 6 inputgene connections. Compared to the prior knowledge, there are 10 green connections (consistent), 1 red connection (wrong sign) and 3 blue connections (additional colour code for predicted binding site).
Figure 8. Chondrogenesis system: inferred network. Network of the chondrogenesis system, which contains two inputs (TGFbeta1 and BMP2). Nodes represent either transcription factor genes (SOX9, MEF2C, MSX1, TRPS1, SATB2) or genes coding for structural proteins (COL2A1, COL10A1, ACAN). Connections are coloured in green (consistent with prior knowledge), red (contrary to prior knowledge), blue (predicted connection with existing binding site) and black (predicted interaction). Connection widths and percentage labels illustrate the frequency of occurrence in the validation procedure.
Additional file 4. Figure: Chondrogenesis system, time courses. Comparison of the chondrogenesis system time courses. Each panel displays the results of one gene: the simulated time course (solid line), interpolated measurements (dashed line) and the measured time series (dots) for both data sets (“T” and “TB”).
Format: PDF Size: 547KB Download file
This file can be viewed with: Adobe Acrobat Reader
For validation, we performed resampling which is based on random perturbation of time
series data. A Gaussian noise component
Network interpretation. SOX9 exhibits a central role in this chondrogenic network and is activated by both TGFbeta1 and BMP2. This indicates a complementary effect of both stimuli on the expression of SOX9. Activated SOX9 drives the expression of its target genes COL2A1, ACAN and COL10A1 [2426]. This regulation marks the essential formation of cartilagespecific structural components of the extracellular matrix and the differentiation of hMSC towards chondrocytes. Beside this process, SOX9 also activates the repressor gene TRPS1 and vice versa. Regulatory interactions between both factors has not yet been addressed in the literature. Additionally, the SOX9 binding motif is present in the proximal promoter of the TRPS1 gene according to the prior knowledge. There is also a modelled effect from TRPS1 on MEF2C, which in turn activates COL10A1 and ACAN, but represses SOX9. This represents a negative global feedback from MEF2C on SOX9 in our model. MEF2C also represses the expression of MSX1, which is solely activated by BMP2 stimulus and activates COL2A1 according to the prior knowledge [27]. MSX1 also activates the SATB2 gene, which in turn activates MEF2C expression. Negative regulation of ACAN by TGFbeta1 is contrary to prior knowledge, as indicated by the red connection of the network graph. However, TGFbeta1 can also activate ACAN indirectly through SOX9, [24]. In summary, the central player SOX9 is influenced by both TGFbeta1 and BMP2. Essential structural proteins are not solely regulated by SOX9, but also by other transcription factors (MEFC, MSX1). Moreover, SOX9 and MSX1 are repressed by MEF2C through negative feedback that involves TRPS1 and SATB2.
Discussion
The NetGenerator algorithm for automatic network inference from multiinput multiexperiment time series data and prior knowledge, described in the methods section, will be classified and distinguished from other methods in the next subsection. Therefore, its properties will be reviewed and justified showing advantages and disadvantages to other approaches. Our discussion contains a wide spectrum of other methods, but will only go into detail for the ones closely related to NetGenerator. Also, further specifications of NetGenerator will be summarised without a detailed comparison to other methods.
Classification of the algorithm
Good review articles on methods for automatic inference of GRNs can be found in [1,3,4]. The different methods can be classified by the data type (static or dynamic), the mathematical approach (e.g. probabilistic vs. deterministic) and the result (e.g. undirected vs. directed graphs, algebraic correlation vs. dynamic models) whereby various combinations are possible. Mutual information methods (for a review of ARACNE, CLR and MRNET see [28]) are based on evaluating the statistical dependencies of large data sets resulting in undirected graphs. In comparison to NetGenerator they possess far different preconditions and purposes, for example they do not consider a concerted influence of the variables or the dynamics of the statespace concept, and therefore a more detailed comparison is set aside. Even though dynamical Boolean networks for generegulation, first proposed by [29], possess some similarities to discretevalued statespace models, their rulebased approaches typically lead to rather qualitative results (for an overview of recent methods see the aforementioned review articles).
Very often, like in case of the core elements of NetGenerator, GRNs are based on linear modelling, i.e. the behaviour of one variable depends on a linear combination of other variables. Still the method can be a combination of either probabilistic or deterministic approach as well as algebraic correlation modelling (equations system) or dynamic modelling (differential equations system). In the case of the probabilistic modelling which is especially covered by static and dynamic Bayesian networks (see aforementioned review articles) the inference is based on the application of probability distributions to describe the uncertainties or noise inherent in GRNs. Beside the differences in the mathematical approach, probabilistic modelling includes the determination of statistical parameters and therefore generally more data replicates are required in comparison to deterministic modelling approaches such as NetGenerator.
Deterministic linear modelling applied to automatic network inference, [30], can be distinguished into at least two types depending on the results: (i) algebraic equations systems, e.g. [31], and (ii) differential (difference) equations systems, e.g. [32]. Although they have different prepositions on the dynamics of time series data, both types can be solved by linear regression. Still, there is a disproportion between the number of free parameters and available measurement data on the one hand and the property of sparsity of GRNs on the other hand. For the former interpolated data points can be applied under the assumption that the influence of the chosen interpolation on the results can be neglected. For the reproduction of sparse networks the regression can be combined with model reduction, for example using the large group of LASSObased algorithms, see e.g. [3336], on the basis of PCA (SVD), [37], or a combination of both, [38]. For further approaches, see the aforementioned review articles.
In contrast to all these methods, we propose the NetGenerator algorithm dealing with the problem of data number and sparsity in a different way. The algorithm is not inferring the network structure and parameters in one go. Instead we applied an heuristic approach of explicit structure optimisation, which iteratively generates a system of sparsely coupled submodels. In that way, the GRN property of possessing more or less hierarchical input to output structures is reproduced. Thus, only the parameters of submodels describing one time series have to be determined. A major drawback of regressionbased solutions of linear differential (difference) equations systems is the necessity of applying numerical derivatives of small sample size and noisy data, which have a strong influence on the resulting network and modelled dynamics. NetGenerator uses a different solution, whereby the regression just provides initial parameters for a nonlinear optimisation of an objective function of the least squares type. Overall, the final dynamic network can be obtained by a lower computational effort, because in comparison to the total number of parameters (N^{2} + M·N) in the model description (2) only a small number of parameters has to be determined.
Inference from multistimuli multiexperiment time series data
The concept of inferring from multiple data sets is also applied by [38], however on the basis of principal components of those data sets. The work of [39] provides a multiple methods framework to integrate distinct types of data like steadystate and time series data, focussing mainly on the combination of knockout and stimulation data.
The proposed NetGenerator V2.0 algorithm allows for integrating data sets of multiple experiments with multiple stimuli. In the inferred models, weighed input terms represent external stimuli and resulting GRNs represent the merged effects of the diverse experiments. Therefore, from a biological point of view, the algorithm is able to handle experiments which investigate the degree of crosstalk.
We applied and tested this feature for 3 benchmark examples and 1 realworld example, the gene regulation during chondrogenic differentiation. The evaluation of the benchmark examples’ results showed the power of the algorithm to infer the network structure and to reproduce the time series. Further, for a special system of “full crosstalk”, i.e. all components are influenced by all stimuli, we could show that the simultaneous utilisation of different data sets leads to higher model quality compared to modelling data sets individually. The reason for this effect is due to the different stimulation by another external input which alters the time series data qualitatively and quantitatively, something that could not be achieved by biological replicates of a single input experiment. This underlines the benefit of using our integrated approach. Further, the presented examples LCT and NCT are possible outcomes of GRN investigations. In the first case, there are two different types of genes: some are induced by one stimulus only and some are induced by multiple stimuli. The model inferred by NetGenerator contains both the separate and common structural elements. The special case of NCT occurs, if network parts are stimulated that are not connected at all. In summary, the extended NetGenerator takes advantage of multistimuli multiexperiment data by network refinement and extension.
We further inferred a twostimulus network for hMSC differentiating towards chondrocytes. This network model contains gene regulatory events following the stimulation with two distinct chondrogenic factors, therefore providing a view on how genes involved in differentiation might be controlled by external molecules. Applying a subsequent resampling gives further information about the connections of this inferred GRN: (i) the majority of connections, especially the ones of prior knowledge and predicted binding sites, occur with a high frequency which can be considered a measure of reliability and (ii) the ranking of the frequencies can be used in interpreting the results with regard to biological hypotheses. Overall, this shows the importance for an extension of NetGenerator to deal with multiple data sets.
Consideration of prior knowledge
The means to integrate prior knowledge (fix and flexible) into the network inference is a distinctive feature of the extended NetGenerator algorithm achieved by modifying the objective function. This feature can reduce the complexity of the structure optimization, although it strongly depends on the origin and quality of the given knowledge, see e.g. [7]. Using prior knowledge for network inference can also be found in several other algorithms, see [1,3,4].
For our example of chondrogenic differentiation, we exemplarily showed network inference using flexible prior knowledge about regulatory interactions extracted from a database (Pathway Studio). The graphical evaluation of the inferred network showed very good reproduction of the proposed prior knowledge. Further predicted connections could be associated with potential regulatory binding sites generated from sequence data (Transfac, Ensembl).
Further aspects
Apart from the linear modelling presented in detail, the ability of NetGenerator to infer a nonlinear model has been mentioned as a further option. The additional sigmoid function describing saturation in geneexpression has been proven successful before, e.g. [4042]. Since the sigmoid transformation has also been used for neural network models, those inference methods are sometimes classified as such.
Besides the many advantages and possible application areas, there are minor restrictions of NetGenerator: it should be applied to preprocessed data without high correlations, it infers networks from measured time series data and due to the heuristic approach it cannot be proven that the global solution was found. The latter can be improved by decreasing the influence of noisy data using a bootstrap (resampling) approach, see chondrogenesis example and [1]. One feature which might be introduced in subsequent versions is the application of “interventional” multiexperiment data, i.e. data originating from perturbations within the system. This can be dealt with by applying either experimentwise prior knowledge or an additional module in the structure optimisation explicitly dealing with that kind of data.
Conclusions
We presented the novel NetGenerator algorithm for automatic inference of GRNs, which applies multistimuli multiexperiment time series data and biological prior knowledge resulting in dynamical models of differential equations systems. This heuristic approach combines network structure and parameter optimisation of coupled submodels and takes into account the biological properties of those networks: indirect transcriptional events for information propagation, limited number of connections and mostly hierarchical structures. The analysis of benchmark examples showed a good reproduction of the networks and emphasised the biological relevance of inferred networks with a different degree of crosstalk. The ability to infer a realworld example based on multistimuli multiexperiment data was shown by application of NetGenerator to a system of growth factorinduced chondrogenesis.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MW and SGH drafted the manuscript. SGH and DD contributed to the development and programming of the NetGenerator algorithm and software as well as to the mathematical and modelling background. MW and RG contributed to data processing, application of NetGenerator to examples, statistical evaluation and the biological interpretation. SV contributed to the generation of the benchmark systems and their artificial data. EJvZ contributed to experimental setups, measurements and biological interpretation of the chondrogenic investigation. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank all our LINCONET project partners of the ERASysBio+ initiative. Also, we kindly acknowledge the support of this work by the BMBF (German Federal Ministry of Education and Research) funding MW within this initiative (Fkz. 0315719). Further, we acknowledge the Virtual Liver Network initiative of the BMBF for granting support (Fkz. 0315760 and Fkz. 0315736).
References

Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R: Gene regulatory network inference: Data integration in dynamic models—A review.
Biosystems 2009, 96:86103. PubMed Abstract  Publisher Full Text

Piek E, Sleumer LS, van Someren EP, Heuver L, Haan JRd, Grijs Id, Gilissen C, Hendriks JM, van Ravesteinvan Os RI, Bauerschmidt S, Dechering KJ, van Zoelen EJ: Osteotranscriptomics of human mesenchymal stem cells: accelerated gene expression and osteoblast differentiation induced by vitamin D reveals cMYC as an enhancer of BMP2induced osteogenesis.
Bone 2010, 46(3):613627. PubMed Abstract  Publisher Full Text

de Jong H: Modeling and simulation of genetic regulatory systems: a literature review.
J Comput Biol 2002, 9:67103. PubMed Abstract  Publisher Full Text

Bansal M, Belcastro V, AmbesiImpiombato A, Di Bernardo D: How to infer gene networks from expression profiles.
Mol Syst Biol 2007, 3:78. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S: Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection.
Bioinformatics 2005, 21(8):16261634. PubMed Abstract  Publisher Full Text

Toepfer S, Guthke R, Driesch D, Woetzel D, Pfaff M: The NetGenerator algorithm: reconstruction of gene regulatory networks. In Lecture Notes in Computer Science. Edited by Tuyls K, Westra R, Saeys Y. Nowé A, Berlin and Heidelberg: Springer Berlin Heidelberg; 2007:119130.

Linde J, Wilson D, Hube B, Guthke R: Regulatory network modelling of iron acquisition by a fungal pathogen in contact with epithelial cells.
BMC Syst Biol 2010, 4:148. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Albrecht D, Kniemeyer O, Mech F, Gunzer M, Brakhage A, Guthke R: On the way toward systems biology of Aspergillus fumigatus infection.
Int J Med Microbiol 2011, 301(5):453459. PubMed Abstract  Publisher Full Text

R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2008.
[http://www.Rproject.org webcite]

Ferrari F, Bortoluzzi S, Coppe A, Sirota A, Safran M, Shmoish M, Ferrari S, Lancet D, Danieli G, Bicciato S: Novel definition files for human GeneChips based on GeneAnnot.
BMC Bioinformatics 2007, 8:446. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics 2003, 4(2):249264. PubMed Abstract  Publisher Full Text

Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
Stat Appl Genet Mol Biol 2004, 3:Article3. PubMed Abstract  Publisher Full Text

Draper NR, Smith H: Applied regression analysis. A WileyInterscience publication, New, York: Wiley; 1998.

Byrd RH, Lu P, Nocedal J, Zhu C: A limited memory algorithm for bound constrained optimization.
SIAM J Sci Comput 1995, 16(5):1190. Publisher Full Text

Yuryev A, Mulyukov Z, Kotelnikova E, Maslov S, Egorov S, Nikitin A, Daraselia N, Mazo I: Automatic pathway building in biological association networks.
BMC Bioinf 2006, 7:171. BioMed Central Full Text

ThomasChollier M, Defrance M, MedinaRivera A, Sand O, Herrmann C, Thieffry D, van Helden J: RSAT 2011: regulatory sequence analysis tools.
Nucleic Acids Res 2011, 39(Web Server issue):W86W91. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Matys V, KelMargoulis OV, Fricke E, Liebich I, Land S, BarreDirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, LewickiPotapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.
Nucleic Acids Res 2006, 34(Database issue):D108D110. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart – biological queries made easy.
BMC Genomics 2009, 10:22. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Soetaert K, Petzoldt T, Setzer RW: Solving differential equations in R: package deSolve.

Gansner ER, North SC: An open graph visualization system and its applications to software engineering.
Software–Practice and Experience 1999, 00:15. PubMed Abstract  PubMed Central Full Text

Hartmann C: Transcriptional networks controlling skeletal development.
Curr Opin Genet Dev 2009, 19(5):437443. PubMed Abstract  Publisher Full Text

Jin EJ, Lee SY, Choi YA, Jung JC, Bang OS, Kang SS: BMP2enhanced chondrogenesis involves p38 MAPKmediated downregulation of Wnt7a pathway.
Mol Cells 2006, 22(3):353359. PubMed Abstract

van der Kraan PM, Blaney Davidson EN, Blom A, van den Berg WB: TGFbeta signaling in chondrocyte terminal differentiation and osteoarthritis: modulation and integration of signaling pathways through receptorSmads.
Osteoarthritis and Cartilage / OARS, Osteoarthritis Res Soc 2009, 17(12):15391545. Publisher Full Text

Sekiya I, Tsuji K, Koopman P, Watanabe H, Yamada Y, Shinomiya K, Nifuji A, Noda M: SOX9 enhances aggrecan gene promoter/enhancer activity and is upregulated by retinoic acid in a cartilagederived cell line, TC6.
J Biol Chem 2000, 275(15):1073810744. PubMed Abstract  Publisher Full Text

Yamashita S, Andoh M, UenoKudoh H, Sato T, Miyaki S, Asahara H: Sox9 directly promotes Bapx1 gene expression to repress Runx2 in chondrocytes.
Exp Cell Res 2009, 315(13):22312240. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Oh Cd, Maity SN, Lu JF, Zhang J, Liang S, Coustry F, Crombrugghe Bd, Yasuda H: Identification of SOX9 interaction sites in the genome of chondrocytes.
PLoS ONE 2010, 5(4):e10113. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Craft AM, Krisky DM, Wechuck JB, Lobenhofer EK, Jiang Y, Wolfe DP, Glorioso JC: Herpes simplex virusmediated expression of Pax3 and MyoD in embryoid bodies results in lineagerelated alterations in gene expression profiles.
Stem Cells 2008, 26(12):31193129. PubMed Abstract  Publisher Full Text

Meyer PE, Lafitte F, Bontempi G: minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information.
BMC Bioinf 2008, 9:461. BioMed Central Full Text

Kauffman S: Metabolic stability and epigenesis in randomly constructed genetic nets.
J Theor Biol 1969, 22(3):437467. PubMed Abstract  Publisher Full Text

D’haeseleer P, Wen X, Somogyi R, Fuhrman S: Linear modeling of mRNA expression levels during CNS development and injury.

Altwasser R, Linde J, Buyko E, Hahn U, Guthke R: Genomewide scalefree network inference for Candida albicans.

Gustafsson M, Hornquist M, Lombardi A: Constructing and analyzing a largescale genetogene regulatory network–Lassoconstrained inference and biological validation.
IEEE/ACM Transac Comput Biol Bioinf 2005, 2(3):254261. Publisher Full Text

Tibshirani R: Regression shrinkage and selection via the Lasso.

van Someren E, Wessels L, Reinders M, Backer E: Regularization and noise injection for improving genetic network models. In Computational and Statistical Approaches to Genomics. Edited by Zhang W, Shmulevich I. NJ, USA: World Scientific Publishing Co.; 2002:211226.

Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V: The Inferelator: an algorithm for learning parsimonious regulatory networks from systemsbiology data sets de novo.
Genome Biol 2006, 7(5):R36. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hecker M, Goertsches R, Engelmann R, Thiesen HJ, Guthke R: Integrative modeling of transcriptional regulation in response to antirheumatic therapy.
BMC Bioinformatics 2009, 10:262. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bansal M, Gatta GD, Di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles.
Bioinformatics 2006, 22(7):815822. PubMed Abstract  Publisher Full Text

Wang Y, Joshi T, Zhang XS, Xu D, Chen L: Inferring gene regulatory networks from multiple microarray datasets.
Bioinformatics 2006, 22(19):24132420. PubMed Abstract  Publisher Full Text

Gupta R, Stincone A, Antczak P, Durant S, Bicknell R, Bikfalvi A, Falciani F: A computational framework for gene regulatory network inference that combines multiple methods and datasets.
BMC Syst Biol 2011, 5:52. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Weaver DC, Workman CT, Stormo GD: Modeling regulatory networks with weight matrices.

Wahde M, Hertz J: Coarsegrained reverse engineering of genetic regulatory networks.
Biosystems 2000, 55(13):129136. PubMed Abstract  Publisher Full Text

Mjolsness E, Mann T, Castaño R, Wold B: From coexpression to coregulation: An approach to inferring transcriptional regulation among gene classes from largescale expression data. In Advances in Neural Information Processing Systems, Volume 12. Edited by Solla SA, Leen TK. Müller KR: MIT Press; 2000:928934.