Bioinformatics Graduate Program, Medical University of South Carolina, Charleston, SC 29425, USA

Department of Biochemistry and Molecular Biology, Medical University of South Carolina, Charleston, SC 29425, USA

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15232, USA

Abstract

Background

Despite large amounts of available genomic and proteomic data, predicting the structure and response of signaling networks is still a significant challenge. While statistical method such as Bayesian network has been explored to meet this challenge, employing existing biological knowledge for network prediction is difficult. The objective of this study is to develop a novel approach that integrates prior biological knowledge in the form of the Ontology Fingerprint to infer cell-type-specific signaling networks via data-driven Bayesian network learning; and to further use the trained model to predict cellular responses.

Results

We applied our novel approach to address the Predictive Signaling Network Modeling challenge of the fourth (2009) Dialog for Reverse Engineering Assessment's and Methods (DREAM4) competition. The challenge results showed that our method accurately captured signal transduction of a network of protein kinases and phosphoproteins in that the predicted protein phosphorylation levels under all experimental conditions were highly correlated (R^{2 }= 0.93) with the observed results. Based on the evaluation of the DREAM4 organizer, our team was ranked as one of the top five best performers in predicting network structure and protein phosphorylation activity under test conditions.

Conclusions

Bayesian network can be used to simulate the propagation of signals in cellular systems. Incorporating the Ontology Fingerprint as prior biological knowledge allows us to efficiently infer concise signaling network structure and to accurately predict cellular responses.

Background

New proteomics techniques enabled large-scale experiments that monitor phosphorylation states of many proteins under different physiological stimuli and/or pharmacological treatments. Each measurement captures a static picture of how the cellular signaling network responses to the binding of a ligand to its receptor, but the interconnections among many different ligand-activated pathways are complex and dynamic. Thus, it is of biological importance to infer which signaling path is at work in response to a particular ligand and how pathways "cross-talk" to each other in a cell-type-specific manner, and eventually to develop computational models capable of predicting cellular responses under different stimuli.

One of the most common approaches to signaling network modeling is to represent the dynamic system as a set of ordinary differential equations (ODEs) using mass action kinetics, by which the concentration of species over time can be analyzed

To assess the current state of the art network inference methods, Columbia University, the New York Academy of Sciences, and the IBM Computational Biology Center have been organizing the Dialogue for Reverse Engineering Assessments and Method (DREAM), an annual international competition to assess methods that infer network structures and predict cellular response to different combination of stimuli from actual experimental data

The provided canonical pathway consists of a union of the known signaling pathways responding to the following ligands

Our approach to this challenge is to employ an enhanced Bayesian network to identify the most plausible HepG2 specific signaling network and to predict the cellular responses to new stimuli. Bayesian network is a directed acyclic graph (DAG) model representing the probabilistic relationships between a set of random variables

We recently developed the concept of the Ontology Fingerprint from biomedical literature and Gene Ontology (GO)

Methods

Combining prior knowledge with experimental data, we adopted a Bayesian network approach to infer the most plausible signaling network from a web of complex networks. Figure

Schematic representation of the methodology

**Schematic representation of the methodology**. The Ontology Fingerprints of the whole human genome were constructed, followed by calculating gene-gene similarity scores using pair-wise comparison of their Ontology Fingerprints. When searching for a cell-type-specific network, the canonical signaling network was repeatedly and stochastically modified by adding or deleting edges based on similarity scores, i.e. the higher the similarity score of a gene pair, the greater possibility of adding the edges connecting the two genes. The candidate networks were trained in parallel using an MCEM (MCMC sampling-based EM) algorithm to infer the states of hidden nodes and estimate network parameters, and LASSO regression was applied in the last round of MCEM. A model selection criteria (BIC) is further calculated for each candidate network. Finally, the best network was selected under the guidance of BIC criteria. The selected network was then applied to predict the phosphorylation activities for the testing data.

Heuristic network search algorithm based on the Ontology Fingerprint

**Heuristic network search algorithm based on the Ontology Fingerprint**. A) The gene-gene similarity scores among the 40 genes of interest were converted into probabilities of adding or deleting edges respectively: i) the similarity scores were ranked in ascending order, and each pair of genes was assigned a corresponding rank R (column "Rank ascendingly"); the probability of adding an edge was obtained by the percentage of its ascending rank out of the total ascending ranks (formula on the left of the arrow); ii) similarly, the probability of deleting an edge was assigned by the percentage of the gene pair's descending rank (column "Rank decendingly") out of the total descending ranks (formula on the right of the arrow). These probabilities ensure that the higher the similarity score of a gene pair, the greater possibility of adding the edge between the two genes; and the lower the similarity score of a gene pair, the more likely the edge between the two genes will be deleted. B) Heuristic rules of adding or deleting edges from the canonical network. A network was updated by either deleting or adding an edge sequentially: i) for deleting edges, an edge was sampled according to its deletion probability (p'); the sampled edge has to exist in the current network and the edges from signals to their corresponding receptors were not allowed to be deleted; ii) for adding edges, an edge was sampled according to its addition probability (p); the sampled edge should not appear in the current network, and the edges between signals, between receptors, between signal and non-receptor, and from other nodes to signal are not allowed to be added.

Data

The training data were provided by the DREAM4 challenge 3, including phosphorylation measurements for 7 proteins under 25 experimental conditions (combinations of different signal stimuli and kinase inhibition) at 3 time points. We used the provided canonical pathway as the original DAG which contains 40 nodes and 58 edges (Figure

Full network comparison of the original canonical pathway and the inferred cell-type-specific pathway

**Full network comparison of the original canonical pathway and the inferred cell-type-specific pathway**. A) Provided by DREAM 4 challenge, the original canonical pathway contains 40 nodes connected by 58 edges: 4 nodes in green represent 4 cytokine receptor which originate signals; 7 nodes in blue or magenta represent observed phosphoprotein with activity measurements; 2 nodes in red represent proteins that are inhibited under some experimental conditions; and 27 hidden nodes in grey have no experimental observation; B) Predicted cell-type-specific pathway activated in HepG2 cell lines, with 37 nodes connected by 47 edges as determined by our algorithm.

In order to incorporate independent biological knowledge to learn the network structure, we evaluated the degree of biological relevance between genes by using the gene-gene similarity scores derived from their Ontology Fingerprints; the pairwise similarity scores among the 40 nodes were calculated. The detailed procedures of constructing Ontology Fingerprint were described in

Bayesian network

A Bayesian network was constructed based on the provided canonical signal transduction network, in which nodes are proteins and directed edges represent signaling flows

Where _{i }
_{i }
_{i}
_{,0 }and _{i}
_{,1 }represent the average activity reading of node _{
i,0 }and _{
i,1 }represent the variance of activity readings of node

Under the causal Markov assumption _{i }

where _{i}
_{j }
_{
i,0 }is the interception and _{
i,j
}is the logistic regression coefficient between node

Learning structure of cell-type-specific signaling network

The DREAM 4 challenge requires inferring the cell-type-specific signal network and predicting the cellular response under certain stimulations. We formulated these tasks as learning the structure and parameterization of the Bayesian network and adopted a Bayesian learning approach to determine the structure. Under this framework, the goal is to identify a network structure, a model

The number of all possible network structures of a Bayesian network

Searching for biological plausible network using the Ontology Fingerprint

Using the provided canonical network as a starting point, we explored the space of the cell-type-specific networks by stochastically adding and deleting edges. The edge selection was based on the available prior biological knowledge in order to search for network structures that are more biologically sensible. To this end, we employed the Ontology Fingerprint

We calculated the similarity scores for all pairs of 40 genes in the canonical pathway. The similarity score was used to assess whether an edge should be added or deleted in the canonical network: edges linking two genes with strong biological relevance (i.e. high similarity score) will be added into the network with a higher chance, while edges with weak biological relevance and weak data support will be deleted from the network with a higher chance. Figure

Searching for network structure based on observed data

Given a candidate network produced in the aforementioned space exploration, we further evaluated if the model explains the observed experimental data well by calculating the term

Bayesian learning of network model

The true phosphorylation states of the protein nodes were not observed but indirectly reflected by the fluorescence signals in the training data. Therefore the nodes representing protein phosphorylation states were latent variables. We used an expectation-maximization (EM) algorithm to infer the hidden state of each node and further estimated the parameters of candidate models

Similarly, the full conditional probability of the observed node was described in Equation (6.1) - (6.3), where the probability of each node's state conditioned on the states of its parents (_{i}
_{i}

Logistic regression was then used in the M-step to estimate the parameters of the generalized linear model. In order to reduce the search space, LASSO regression implemented in the LARS package from R

Prediction of test data

To predict the fluorescent signals of 7 phosphoproteins in response to cytokine stimuli under 40 testing conditions, the phosphorylation states of these proteins were sampled using the aforementioned EM algorithms (E step only) and the belief propagation algorithm. The fluorescent signals were then simulated by mixture of the signals of proteins in both phosphorylated and unphosphorylated states defined in Equation (1). We generated 50 samples of the activation state for each protein node according to its posterior probability and each sample predicted the strength of fluorescent signal of the monitored proteins from the learned normal distribution conditioned on sampled states. The final prediction was then produced by averaging the predicted measurements of the observed nodes across all samples.

Results

The task of learning cell-type-specific network is equivalent to determining which subset of vertices and edges from the canonical network should be retained for that cell type. We addressed the task of learning network structure through combining prior knowledge and experimental data in the following steps: 1) stochastically exploring candidate network structures based on prior knowledge; 2) training candidate Bayesian network using experimental data, which further modifies network structure through parameterization, i.e., setting the parameters associated with certain edges to the values that would be equivalent to deleting these edges; and 3) selecting the network model that best simulates the experimental results. A Bayesian network can also readily simulate the propagation of a signal in the system using a belief propagation algorithm

The novelty of our approach is to update the network by leveraging prior biological knowledge captured in the Ontology Fingerprints

Learning cell-type-specific signaling network

Using the provided experimental data, we trained our Bayesian network-learning algorithm to infer a HepG2 cell specific network. Figure

Comparison of the collapsed original canonical and the inferred cell-type-specific pathways

**Comparison of the collapsed original canonical and the inferred cell-type-specific pathways**. A) Collapsed canonical network provided by DREAM4 challenge where all hidden nodes and corresponding edges are removed; B) Collapsed network predicted by our Ontology-Fingerprint-based graph search algorithm.

The predicted network represents a biologically plausible signaling pathway specific to HepG2 cells, partially due to the novel graph search algorithm based on the Ontology Fingerprints. For instance, the connections between ^{th }percentile. In contrast, the connection between ^{th }percentile. Overall, the model updating process based on the novel graph search algorithm seamlessly included prior biological knowledge embedded in the literature and GO. Based on the training data of HepG2 cell, employing LASSO regression

Our results also indicate that Bayesian network is particularly suitable for modeling cellular signal transduction in that principled statistical inference algorithms, e.g., the belief propagation algorithm, enabled us to represent hidden variables (nodes without observations) in the graph and to infer detailed signal transduction in the pathway. In contrast, other modeling approaches reported at the DREAM4 conference, e.g., methods based biochemical systems theory

Predicting cellular responses to stimuli

Using the final graph and the associated parameters learned from the Bayesian network approach, we performed simulation studies to predict cellular responses to a set of provided stimuli and compared the "predicted" results with the observed training data. The comparison showed a very significant correlation (R^{2 }= 0.93). Figure

Comparison of predicted and observed phosphoprotein activity of 7 proteins of interest across different experimental conditions

**Comparison of predicted and observed phosphoprotein activity of 7 proteins of interest across different experimental conditions**. We used trained Bayesian network to predict the phosphorylation activity of the 7 proteins of interest under all experimental conditions in the training data set. The "predicted" results were compared with the provided observations and a correlation analysis shows significant correlation (R^{2 }= 0.93).

Phosphorylation activity plots of 7 proteins of interest under the treatments of 5 different stimuli

**Phosphorylation activity plots of 7 proteins of interest under the treatments of 5 different stimuli**. We used the trained Bayesian network to predict the phosphorylation level of 7 phosphoproteins under all conditions and compared with the observed data in time-course plots. Within each box, the phosphorylation activity were predicted or observed at 3 time point (0 min, 30 min, and 3 hours post stimulus are plotted, in which the observed data are shown in black and predicted data are shown as red. The blue lines appearing in some boxes indicate that the activity measurement lies within the noise error of the detector (the reading is less than 300).

Using the predicted HepG2 specific network and the learned parameters, we then predicted the phosphoprotein activity levels of the 7 proteins under the test conditions given by the DREAM 4 Challenge. The predicated phosphoprotein activities were evaluated against experimental measurement by the organizers of DREAM4 challenge using two criteria: first, the accuracy evaluated by a prediction cost function (sum of squared errors over all the predictions); second, network parsimony. Our group (Team 451) ranked within the top five (#4 or #5 depending on different DREAM4 ranking methods) among all submissions for this challenge (

Discussion

A signaling network is a complex and dynamic system that governs biological activities and coordinates cellular functions

Participants of the DREAM4 challenge developed various computational approaches to model the signaling network and predict their cellular responses to different stimuli. Dynamic mathematical modeling implemented in a system of differential equations is one of the mainstream approaches

By contrast, Bayesian network analysis represents an effective mean to encode both the prior knowledge of network topology and the probabilistic dependency in signaling networks

Our algorithm was further improved by embedding biological information from the Ontology Fingerprint into the learning stage of the Bayesian network modeling. This was accomplished through the introduction of prior distributions for the variables. The seamless integration of prior knowledge into the Bayesian network framework allowed us to construct a cell-type specific signal transduction pathway and to use the pathway to predict novel perturbation outcomes in the DREAM4 competition. The Ontology Fingerprint derived from PubMed literature and biomedical ontology serve as a comprehensive characterization of genes. Compared to current gene annotation, the Ontology Fingerprints were generated by a largely unsupervised method, thus do not need well-annotated corpus which is difficult to assemble. In addition, the enrichment p-value associated with each ontology term in an Ontology Fingerprint can be used as a quantitative measure of biological relevance between genes--a feature that is lacking in current gene annotations. This comprehensive and quantitative characterization of genes works well as prior knowledge in our graph searching strategy. In contrast, commonly used graph searching algorithms, such as genetic algorithms, only rely on a randomized exhaustive search that is not able to utilize useful prior information. This limitation not only makes these algorithms inefficient in searching the plausible model space but also potentially lead to networks that are biologically irrelevant.

To assess the contribution of the Ontology Fingerprints to Bayesian network learning algorithm, we compared the likelihoods of Bayesian networks iteratively updated with or without the guidance of prior knowledge derived from the Ontology Fingerprints. Starting with the canonical network, we iteratively updated network structure until a fixed number of networks were obtained. The converged likelihood of each network was obtained by Monte Carlo EM algorithm (MCEM) ^{-2}). In addition, we investigated the performance of Ontology Fingerprint enhanced Bayesian network in eliminating biologically irrelevant relationships from the network. We randomly added edges with similarity scores of zero into the canonical network, and considered the new network as a noisy network. Starting with this noisy network, we performed the same comparison as described above, and the resulting likelihoods from Ontology Fingerprint-guided network update were also significantly higher than the update process without prior knowledge (Wilcoxon signed-rank test, p-value = 1.5 × 10^{-3}). Furthermore, the network update with prior knowledge successfully identified and eliminated noisy edges quickly at the first several iterations. These results demonstrated that integrating the Ontology Fingerprint as prior knowledge can speed up the convergence of likelihood, resulting in the increased efficiency of both identifying optimal network structure and retaining biological meaningful connections in the final network.

In addition to prior knowledge, our approach also employed the LASSO technique

Conclusion

By incorporating prior biological knowledge, utilizing advanced statistical method for parameter estimation and modeling unobserved nodes as latent variables, we developed a novel approach to infer active signaling networks from experimental data and a canonical network. Our results demonstrated that these improvements allow us to predict signaling network structure and responses that match closely to those identified by experimental approaches.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

WJZ initiated the idea of incorporating the Ontology Fingerprint for network prediction and guided the development of the Ontology Fingerprints. TQ and LCT worked on the method development and signaling network prediction. KJS advised the biological knowledge about the signaling pathway. XL advised the Bayesian network development. TQ and LCT drafted and WJZ and XL finalized the manuscript. WJZ supervised the overall development of the project. All authors have read and approved the manuscript.

Acknowledgements

This work is partly supported by PhRMA Foundation Research Starter Grant, Computational Biology Core of 1 UL1 RR029882-01, R01GM063265-09S1; P20 RR017677-10 and a pilot project from 5P20RR017696-05 (WJZ), as well as grants 5R01LM010144 and 5R01LM009153 (XL). LCT was supported by NLM training grant 5-T15-LM007438-02. TQ was supported by PhRMA Foundation Research Starter Grant, NIH/NCRR 5P20RR017677-10, NIH/NIGMS R01GM063265-09S1 and T32GM074934 07. KJS was funded by Grant 5K12GM081265-03, an Institutional Research and Academic Career Development Award (IRACDA) program from NIGMS. We would like to thank Dr. John Schwacke for providing us with the R code to generate the plot of protein phosphorylation activity.

This article has been published as part of