Centro de Ciências Computacionais - C3, Universidade Federal do Rio Grande - FURG, RS, Brazil

Abstract

Background

Inference of biological networks has become an important tool in Systems Biology. Nowadays it is becoming clearer that the complexity of organisms is more related with the organization of its components in networks rather than with the individual behaviour of the components. Among various approaches for inferring networks, Bayesian Networks are very attractive due to their probabilistic nature and flexibility to incorporate interventions and extra sources of information. Recently various attempts to infer networks with different Bayesian Networks approaches were pursued. The specific interest in this paper is to compare the performance of three different inference approaches: Bayesian Networks without any modification; Bayesian Networks modified to take into account specific interventions produced during data collection; and a probabilistic hierarchical model that allows the inclusion of extra knowledge in the inference of Bayesian Networks. The inference is performed in three different types of data: (i) synthetic data obtained from a Gaussian distribution, (ii) synthetic data simulated with Netbuilder and (iii) Real data obtained in flow cytometry experiments.

Results

Bayesian Networks with interventions and Bayesian Networks with inclusion of extra knowledge outperform simple Bayesian Networks in all data sets when considering the reconstruction accuracy and taking the edge directions into account. In the Real data the increase in accuracy is also observed when not taking the edge directions into account.

Conclusions

Although it comes with a small extra computational cost the use of more refined Bayesian network models is justified. Both the inclusion of extra knowledge and the use of interventions have outperformed the simple Bayesian network model in simulated and Real data sets. Also, if the source of extra knowledge used in the inference is not reliable the inferred network is not deteriorated. If the extra knowledge has a good agreement with the data there is no significant difference in using the Bayesian networks with interventions or Bayesian networks with the extra knowledge.

Background

The rapid increase in the availability and diversity of molecular biology data has enabled many discoveries and advances in different fields related with systems biology. Many of these studies were based in a single biological entity or the union of several such entities. Nowadays the research community is realizing that the complexity of an organism is related with the network of single entities rather than with the individual biological entity. It is now clearer that the joint acting of several components through a network of interactions plays a pivotal role in determining the development and sustainability of an organism. Therefore, the study of biological networks is highly relevant. The problem is that these intricate biological networks are mainly unknown. Since we have at our disposal many different types of measurements taken from the components of these networks one interesting approach would be to try to reconstruct such networks.

In the last few years, several methods for the reconstruction of regulatory networks and biochemical pathways from data have been proposed. These methods were reviewed for example in

Differential Equations are the most refined mathematical method to describe biophysical processes. They can describe, for example, the intra-cellular processes of transcription factor binding, diffusion, and RNA degradation; see, for instance,

A promising compromise between these two extremes are Machine Learning methods that allow interactions between the nodes in the network to be represented in an abstract way - without the level of detail of the underlying pathways described by Differential Equation models - and to infer these interactions from data in a systems context, that is, distinguishing direct interactions from indirect interactions that are mediated by other nodes in the domain.

A non exhaustive list of methods used to infer the structure of networks from data includes: a system of Coupled Differential Equations

Results

Evaluation criteria

Not all of the edge directions in a Bayesian network can always be inferred. This is due to the existence of equivalence classes of networks

Inference results

MCMC simulations are performed for all the approaches and data sets twice in order to check convergence. The convergence is verified by plotting the posterior probabilities of the edges from two different simulations initializations and checking if the results are similar. Note that this is a necessary but not a sufficient condition for convergence. All the MCMC simulations are executed with 5 × 10^{5} steps from which the first half were discarded as burn-in.

The extra knowledge used in conjunction with the real data in the BN-E approach was obtained from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways database

In Figure

Posterior distribution of hyper-parameter

**Posterior distribution of hyper-parameter**. In the figure

In Figure

Comparison of reconstruction accuracy

**Comparison of reconstruction accuracy**. Each sub-figure presents the results for one type of data set as indicated at the top of the sub-figure. For each data set type there are two groups of results, one obtained when taking the edge directions into account (DGE) and the other obtained when taking only the skeleton of the network into account. Within a figure each bar represents the AUC average over five data sets for different methods which are indicated in the legend of the sub-figures. The errorbars show the respective standard deviations. For Real data only one source of extra-knowledge is used, therefore, there is one less bar in the results.

Discussion

Figure

One interesting aspect observed is the behaviour of the hyper-parameter of the BN-E approach and the reconstruction accuracy obtained with this method. As we can see from Figure

Observing the results for synthetic data in Figures _{100} clearly outperform the others specially when considering the edge directions (DGE criterion). This suggests that the increase in the accuracy of the recovered networks is related with the edge directions i.e. these methods provide a way to the break up the symmetries which imply in the equivalence classes. It is also possible to note that the addition of the half correct extra knowledge, _{50} approach had the hyper-parameter sampled at very small values and, hence, has not improved the accuracy of the reconstructed network.

In Figure

Conclusion

BNs are very attractive to the inference of the structure of networks by various reasons. One of the main advantages of BNs is its flexibility. In this paper we compared different BNs approaches where two of them are extensions of the classical BNs framework. The essence of both of these extensions of BNs is the inclusion of knowledge other than the data in the inference. If the BN-I interventions are taken into consideration and in the BN-E extra knowledge is added to the learning scheme.

Observing the results in Figure _{100} perform better than the simple BN. This performance is significantly better when the comparison takes into account the edge directions (DGE score). This leads to the conclusion that both methods in fact perform better because they are able to destroy the equivalence classes symmetries. Another interesting conclusion is obtained when we observe Figure

Interestingly there are no significant differences when comparing the two best methods, BN-I and BN-E_{100}, as can be observed in Figure

The main conclusion is that the use of more refined Bayesian network models significantly improves the results. Both more refined methods, BN-E and BN-I, performed equally well and, hence, their choice should be made according to the quality and availability of the data obtained from the system under investigation.

Methods

Bayesian Networks - BNs

Bayesian Networks (BNs) are a combination of probability theory and graph theory. A graphical structure **q**, fully specify a BN. The graphical structure **q **specify the functional form of the conditional probabilities associated with the edges, that is, they indicate the nature of the interactions between nodes and the intensity of these interactions. A BN is characterized by a simple and unique rule for expanding the joint probability in terms of simpler conditional probabilities. This follows the local Markov property: _{1}, _{2}, ..., _{N}_{i}_{i}_{i}

The task of learning a BN structure in a score-based approach consists in devising a BN structure from a given set of training data

The integral in Equation 1, our score, is analytically tractable when the data is complete and the prior

According to Equation 1 we have a way to assign a score to a graphical structure given a data set. However, the search for high scoring structures is not trivial. It is impossible to list the whole set of structures because its number increases super-exponentially with the number of nodes. Also when considering an sparse data set

In this paper we use the standard MCMC proposal which consists in to propose, at each interaction, one of the basic operations of adding, removing or reversing an edge. For more details about this scheme see

Bayesian Networks with Interventions - BN-I

Nowadays molecular biology has different techniques for producing interventions in biological systems, for instance, knocking genes down with RNA interference or transposon mutagenesis. The consequence is that the components of the system which are targeted by the interventions are no longer subject to the internal dynamics of the system under investigation. The components of the biological system can be either activated (up-regulated) or inhibited (down-regulated) and under this external intervention their values are no longer stochastic. The intervened components are not subject to the internal dynamics of the system, hence their values are deterministic. However, the other components which are not intervened are influenced by these deterministic values. Therefore, interventions are very useful to break the symmetries within the equivalence classes of BNs and consequently to the discovery of putative causal relationships. For a discussion about equivalence classes see

In order to incorporate the interventions under the BN framework two small modifications are necessary. The calculation of the score for observational data _{i}

The second necessary modification is related to the definition of equivalence classes. In

The sampling scheme of the BNs-I is the same of the BNs and is given by Equation 2.

Bayesian Networks with addition of Extra knowledge - BN-E

In order to be able to incorporate extra knowledge in the inference of networks it is necessary to define a function that measures the agreement between a given network structure

A network structure

Having defined how to represent a BN structure,

where

Following the work of

where the energy E(

For Dynamic Bayesian Networks the summation in the denominator of Equation 4 can be computed exactly and efficiently as discussed in

In this paper we apply the method only to static BNs and thus the summation in the denominator of Equation 4 is in fact an upper bound to the true value. This happens because this summation includes all possible structures and we are only interested in the DAG structures. Furthermore, throughout this paper we use a fan-in restriction of three as has been proposed in several other applications, for instance see

BN-E MCMC sampling scheme

At this point, having the prior probability distribution over network structures defined, an MCMC scheme to sample both the hyper-parameters and the network structures from the posterior distribution

A new network structure

which was expanded following the conditional independence relationships depicted in Figure

Probabilistic Model

**Probabilistic Model**. The probabilistic graphical model represents conditional independence relationships between the data

In order to increase the acceptance probability which in turn can augment the mixing and convergence of the Markov chain the move is separeted into two submoves. In the first move a new network structure

Data sets

One very interesting aspect when comparing different methods applied to the inference of the structure of networks is the ability to compare how they perform when faced with real data sets. In our case a real data set means data obtained with real experiments from a real biological system. Also, the comparison among the methods with real data is only possible if the network which the data was generated from is known. We call this known network the

Raf signalling pathway

**Raf signalling pathway**. The graph shows the currently accepted signalling network, taken from

Because the interest is to compare the BNs approaches in the context of inference of networks, where the data available are usually sparse, we down sampled the original data to 100 data points. Furthermore, we average the results over five data sets. The observational data is obtained from the original data where no interventions were realized. The interventional data is sampled from all the interventions realized in the original data and is composed by: 16 data points without intervention; 14 data points for each of the inhibited proteins (

In order to further investigate how the methods compare synthetic data sets were also prepared. These data are obtained from two different sources: a linear Gaussian distribution and a simulation tool named Netbuilder

Considering _{i}_{ik}_{ik}

In Netbuilder a sigma-pi formalism is implemented as an approximation to the solution of a set of Ordinary Differential Equations that models enzyme-substrate reactions, allowing the acquisition of data typical of signals measured in molecular biology. The data sets simulated with Netbuilder are closely related to real data sets when compared with the Gaussian data. For more details about the data generation see

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

The author acknowledges financial support from Brazilian National Council for Research (CNPq). The author is grateful to the reviewer's valuable comments that improved the manuscript.

This article has been published as part of