European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK

Biological Engineering Department, Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract

Background

Cells process signals using complex and dynamic networks. Studying how this is performed in a context and cell type specific way is essential to understand signaling both in physiological and diseased situations. Context-specific medium/high throughput proteomic data measured upon perturbation is now relatively easy to obtain but formalisms that can take advantage of these features to build models of signaling are still comparatively scarce.

Results

Here we present

Conclusions

Models generated with this pipeline have two key features. First, they are constrained by prior knowledge about the network but trained to data. They are therefore context and cell line specific, which results in enhanced predictive and mechanistic insights. Second, they can be built using different logic formalisms depending on the richness of the available data. Models built with

Background

Cells receive and interpret information through complex signaling networks. The correct processing of signals is essential and frequently altered in diseases

Gathering medium to high-throughput signaling data is becoming more feasible as proteomic technologies are getting more mature

We recently introduced a method that integrates literature and perturbation data to overcome the shortcomings of both

We present here a tool that implements the methods in

Implementation

The

The

**The ****framework. ****A**. A **B**. Only steps that are specific to a particular logic formalism are coded in add-on packages. **C**. The choice of a logic formalism depends on the data at hand and the modeling goals: with no time course data, the user can choose between the two steady-state implementations (

Import of network and data

The package then performs normalisation of the data for logic modeling, a feature described in

Processing of the network

The network is converted into logic models for training with two pre-processing steps : (1) compression and (2) expansion. In the compression step, species that are neither measured nor perturbed are removed if the logical consistency of the network is not impaired, resulting in a simplified network for training. This step is performed because such nodes are not necessary for the correct training of the model. However, starting from a PKN facilitates: i) identifying and preserving nodes whose presence is necessary to maintain the logical consistency of the network, ii) mapping the trained model back onto the starting network (thereby preserving the interpretability) and iii) restricting the search to a set of interactions that are feasible based on prior knowledge. In the expansion step, interactions are converted into all possible logic gates. For example, if there is an edge from node B to A and node C to A, the following gates are created: (i) B AND C → A, (ii) B OR C → A, (iii) B → A, (iv) C → A. The rationale behind this step is that, although databases record a potentially functional interaction between A and B and A and C, it is rarely recorded whether these interactions are independent or not (i.e. B and C are both required to activate A, or only one of them), or even if any of them are active in the specific context under investigation. Therefore,

Training

Next, the model is trained to data by searching for models (i.e. sub-models of the scaffold model, that include a subset of the edges) that minimize a bipartite optimisation function. The optimisation function weights the fit to data (deviation between data and the output of the Boolean logic model at steady state, in matched conditions) and model size, according to equation 1.

In equation 1, _{
f
} (equation 2) is the mean squared deviation between model prediction (^{
M
}) and data (^{
E
}) across the _{
g
}). _{
s
} (equation 3) penalises the model size by summing across the number of inputs (_{
e
}) of each edge selected in model P and dividing by the total number of inputs across all edges (

Report

Finally, the results of the training are mapped to both the prior knowledge and the scaffold network. The information relating to the analysis run is then plotted, written to file and condensed in a HTML report hyperlinked to the various diagnostic plots. Networks are output in Graphviz DOT format as well as SIF files with corresponding attributes representing the status of nodes (compressed, measured, inhibited, etc.) and the frequency with which edges are selected in the family of solution models.

Simulation variants

This general approach is extended through a series of add-on R packages that use parts of the

Languages and dependencies

All of our packages are written in R. In order to improve computational efficiency, the core of

Results and discussion

Various simulation schemes allow to capture different features of a system

Within the scope of logic models, various formalisms can be used to represent relationships between nodes and simulate a model. The choice of which logic formalism to use depends on the data set and the system to be modeled (see Figure

Simulation schemes in the

**Simulation schemes in the ****and add-ons packages. **

CellNOptR: Boolean logic at steady-state

The default

In equation 4, the state of each species _{
i
} at time _{
iN
} upon which _{
i
} depends. Equation 4 is applied simultaneously to all nodes in the model until all _{
i
}(_{
i
}(

CellNOptR(2t): Boolean logic at 2 steady-states

If, however, we wish to capture the transient activation of ERK, we can do so using a previously unpublished modification of the Boolean steady-state method which is available in

Using this method we first train the model using the data at the first time (_{1}) point just as above. In a second training step, we assume that some edges only become active at the second time point (_{2}), and therefore search through the space of edges not included in the optimal model at _{1}. We simulate the model using the steady state of _{1} as an initial state, with the added constraint that nodes receiving the input of a _{2} edge are locked to the state defined by that edge. This is to avoid nodes in a negative feedback loop never reaching a Boolean steady state, e.g. if protein A activates protein B and B represses A, then when A is active B is turned ON, which turns A OFF and then turns B OFF and re-establishes the ON state for A, etc. With this modified simulation procedure, in this example A would turn B ON at _{1}, then the negative feedback between B and A would become active at _{2} and lock A permanently to the OFF state (see

CNORdt: Boolean logic for time course data

Steady state and multiple steady states methods are useful first approximations to capture the dynamic behavior of a system when limited time resolved data is available. However, when time courses are available, we can get further insight by using methods that can fit such data.

CNORfuzzy: constrained fuzzy logic at steady-state

A main limitation of Boolean logic models is that they are limited to ON/OFF representations of the activation levels of species in a model. This means that subtle effects and partial activations such as the activation of p38 in Figure

In eq. 5, the Boolean function from eq. 4 is replaced by a transfer function
_{
i
} at time

CNORode: logic-based ordinary differential equations

In equation 6, the Boolean updating function is replaced by a continuous activation function
_{
i
} and a first order decay term, divided by a time constant _{
i
}. For each species in the Boolean logic network, the ODE derived satisfies the condition that if the input of the gate to that species are Boolean (i.e. when species states tend to the limit 0 or 1), then the ODE for the species considered returns a value that is consistent with the value returned by the corresponding Boolean logic gate. The formalism used to derive the logic based ODEs was developed by

However, compared to the methods previously mentioned, this method requires: (i) the optimization of more parameters, therefore limiting the scalability, and (ii) the availability of detailed time resolved data.

Case study: application of

We illustrate the Boolean 2 steady-states

**Experimental setting for the HepG2 analysis.** HepG2 cells were stimulated with the above stimuli in combination with the above-mentioned inhibitors in different combinations. The 16 species mentioned here were then measured using a luminex assay at 30 minutes and 3 hours post stimulation, leading to a total of 136 samples. All species are mentioned with their Uniprot identifiers (capital letters) or common name where applicable (small caps letters).

Click here for file

As described, _{1}, between 24 and 27 edges are selected (based on 3 separate optimization runs, see Additional files
_{2}, between 3 and 7 additional edges are selected, leading to an average optimization score of 0.094 (compared to an average of 0.124 if random edges are selected). Additional file
_{1} and 0.03 for _{2}). The improvement at _{2} is not as drastic as the one at _{1}, likely because the PKN was designed for early events and therefore might not include all necessary prior knowledge edges to capture events happening at later times.

**Summary of results from 3 independent trainings for the HepG2 example.** Frequency of selection of each edge in the scaffold model, across all models with a score within 10% of the best scoring model, summarized across 3 independent training runs. The top panel shows the summary for the edges at time 1and the bottom panel shows the equivalent for time 2. For time 1, 13 edges are consistently selected across most (> 80%) of the best performing model, and 24 edges are picked in over 60% of the trained models. A partial redundancy in the effect of some edges explains that a different combination of edges can be picked across different models with limited impact on their scores. At time 2 (lower panel), 5 edges are consistently selected across over 50% of the best scoring models. These lower numbers reflect the fact that the training at time 2 relies on a single trained model as a starting point for both the simulation and the edge search space. Therefore, the family of trained models obtained for each of the training runs explore different search spaces and have different initial conditions.

Click here for file

**Technical aspects of the HepG2 analysis.** This file provides additional information regarding this analysis, such as the parameters used etc.

Click here for file

**Example of results for the HepG2 real data application.** A. Previous knowledge network used for this analysis. B. Example of a trained model obtained in one of the optimization round, with a subset of the simulation results obtained with this network (C). For the networks the color codes are as follows: nodes: green=stimulated, red=inhibited, blue=measured, blue with red stroke=measured and inhibited, dashed stroke=compressed; edges (in the trained model in panel B): green=selected at time 1, blue=selected at time 2, grey=not selected in the trained model. In panel C, black continuous lines=data, dashed blue lines=simulation results obtained with the model in B. The background color reflects the goodness of fit of the model to data: green= the chosen Boolean value is closer to the data than the opposite Boolean value (the darker, the closer), red= the chosen Boolean value is further from the data than the opposite Boolean value (the darker, the further).

Click here for file

Nonetheless, the resulting trained models recapitulate some important behaviors. For example, it correctly captures a context-specific decrease in creb at _{2} (see Figure
_{1} upon IL1A stimulation but this stimulated state is sustained at _{2} only if the signals going through KS6A1 (p90RSK) and KS6A4/KS6A5 (msk1/msk2, which are indirectly stimulated by IL1A) are both present (i.e. whereas an OR gate between these two branches accurately captures the increase of creb at _{1}, an AND gate better captures the behavior at _{2}). If there is an inhibition in either of these branches, creb does get activated at _{1} but then decreases at _{2}. This means that for the creb signal to be maintained at _{2}, the presence of both KS6A1 and KS6A4/KS6A5 is required. Such a behavior could, for example, be explained by a constitutive dephosphorylation of creb that can only be counteracted by the presence of both signals from KS6A1 and KS6A4/KS6A5. Sustained versus transient phosphorylation of creb following stimulation of the same receptor (NMDA) was observed in neurons and was shown to depend on the activity of the phosphatase Calcineurin

Subset of the results of a

**Subset of the results of a ****analysis on two time-point data from human hepatocellular carcinoma cells.** The data consists of phospho-proteomic measurements of 16 proteins in response to multiple inducers of inflammation, innate immunity and proliferation, applied in combination with selected small molecule inhibitors
_{1}, blue edges=picked at _{2}), along with the data associated with the creb node (right, solid black line), overlaid with the simulation results (dashed blue line) for a selected set of conditions. The background color indicates the goodness of fit of simulation results to data. We can see that the model captures the behavior of creb accurately: creb increases at _{1} if either MP2K2/MP2K1 or p38 are activated (in this case, because both are downstream of IL1A, they are both activated in the absence of inhibitors and presence of IL1A). This activation is maintained if both MP2K2/MP2K1 and p38 are activated, and is lost at _{2} (180 minutes) if only one of them is activated (i.e. in this case if either is inhibited). This behavior is captured in the model by selecting an OR gate from MP2K2/MP2K1 and p38 to creb at _{1}, and an AND gate at time _{2}.

Providing a user friendly interface with

Researchers who generate the kind of biochemical data that is amenable to logic modeling might not be familiar with R. Hence, we provide an intuitive and easy to learn graphical user interface (GUI) to our methods through a Cytoscape plugin,

Screenshot of

**Screenshot of ****, the Cytoscape plugin for ** Users can load or build a network in Cytoscape and load a matching data set in the MIDAS format, i.e. a CSV file with a row for each condition/time combination, a “TR:” column for each stimuli/inhibitor (0=absent,1=present) and for each readout a “DA:” column (time) and a “DV:” column (measurement).

Strengths of the

A range of tools exists for manipulating, creating and simulating logic models (see Figure

Comparison with other softwares for logic modeling.

**Comparison with other softwares for logic modeling. **

The method described in

Future developments

We consider the existing version of

While CellNOpt already covers multiple logic formalisms, we are exploring other variants, in particular asynchronous simulation schemes for the CNORdt extension. This could lead to different results to those obtained with the synchronous scheme, which could be particularly insightful when handling single cell time course data. Given the stochastic nature of an asynchronous update scheme, when using population averaged data (as has been the case so far) one needs to run the simulation many times to generate a set of trajectories from which a consensus can be obtained. This is considerably more demanding computationally, and is not likely to provide additional insight in most simple cases. In the case of the example toy model from Figure

**Exploration of an asynchronous updating scheme for the CNORdt extension.** This figure shows the results obtained by training the toy model to data as in Figure

Click here for file

Another main area of development is the integration of data-driven reverse engineering tools to find links missing in the starting network

Finally, we are working to make communication and exchange of data and models to and from

Conclusions

Understanding signal processing in cells is an essential goal of biological research, not only for fundamental reasons but also for its implications and potential applications in disease contexts. Modeling approaches are particularly suited to this task because (i) signaling networks are complex systems assembled from the dynamic and context-dependent interactions of many components, and (ii) obtaining predictive as well as mechanistic insights is extremely valuable in this context.

Our toolkit is implemented in the free and open source R language and Cytoscape platform which benefit from a large user community and already come with a wide range of packages for biological data processing and analysis. Users should therefore be able to use

Availability and requirements

The main

More details:

- **Software name:**

- **Project home page:**

- **Operating system(s):** platform independent

- **Programming languages:** R

- **Other requirements:** R (tested on 2.13 and above), Cytoscape 2.x

- **License:** GNU-GPL, version 3 except CNORfuzzy which is GNU-GPL version 2.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CT and TC wrote the

Acknowledgements

The authors thank J. Banga, J. Egea, E. Balsa for help with optimisation routines, B. Penalver, I. Pertsovskaya and F. Eduati for testing and feedback, R.F. Schwarz for reading and commenting the manuscript, and funding of the Institute for Collaborative Biotechnologies (contract no. W911NF-09-D-0001 from the U.S. Army Research Office), EU-7FP-BioPreDyn and the EMBL EIPOD program.