School of Computing, University of Southern Mississippi, Hattiesburg, MS 39406, USA

Environmental Laboratory, U.S. Army Engineer Research and Development Center, 3909 Halls Ferry Rd. Vicksburg, MS, 39180, USA

SpecPro Inc., 3909 Halls Ferry Rd, Vicksburg, MS, 39180, USA

Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, MS 39406, USA

Abstract

Background

The regulation of gene expression is achieved through gene regulatory networks (GRNs) in which collections of genes interact with one another and other substances in a cell. In order to understand the underlying function of organisms, it is necessary to study the behavior of genes in a gene regulatory network context. Several computational approaches are available for modeling gene regulatory networks with different datasets. In order to optimize modeling of GRN, these approaches must be compared and evaluated in terms of accuracy and efficiency.

Results

In this paper, two important computational approaches for modeling gene regulatory networks, probabilistic Boolean network methods and dynamic Bayesian network methods, are compared using a biological time-series dataset from the Drosophila Interaction Database to construct a Drosophila gene network. A subset of time points and gene samples from the whole dataset is used to evaluate the performance of these two approaches.

Conclusion

The comparison indicates that both approaches had good performance in modeling the gene regulatory networks. The accuracy in terms of recall and precision can be improved if a smaller subset of genes is selected for inferring GRNs. The accuracy of both approaches is dependent upon the number of selected genes and time points of gene samples. In all tested cases, DBN identified more gene interactions and gave better recall than PBN.

Background

The development of high-throughput genomic technologies (i.e., DNA microarrays), makes it possible to study dependencies and regulation among genes on a genome-wide scale. In last decade, the amount of gene expression data has increased rapidly necessitating development of computational methods and mathematical techniques to analyze the resulting massive data sets. In order to understand the functioning of cellular organisms, why complicated response patterns to stressors are observed, and provide a hypothesis for experimental verification, it is necessary to model gene regulatory networks (GRNs). Currently, clustering, classification and visualization methods are used for reconstruction or inference of gene regulatory networks from gene expression data sets. These methods generally group genes based on the similarity of expression patterns. Based on large-scale microarray data retrieved from biological experiments, many computational approaches have been proposed to reconstruct genetic regulatory networks, such as Boolean networks

Much recent work has been done to reconstruct gene regulatory networks from expression data using Bayesian networks and dynamic Bayesian network (DBN). Bayesian network approaches have been used in modeling genetic regulatory networks because of its probabilistic nature. However, drawbacks of Bayesian network approaches include failure to capture temporal information and modeling of cyclic networks. DBN is better suited for characterizing time series gene expression data than the static version. Perrin et al.

The Boolean Network model, originally introduced by Kauffman

In this paper, two important computational approaches for modeling gene regulatory networks, PBN and DBN, are compared using a biological time-series dataset from the Drosophila Interaction Database

Results

A real biological time series data set (Drosophila genes network from Drosophila Interaction Database) was used to compare PBN and DBN approaches for modeling gene regulatory networks

The

The interactions and scores of Mlp84B with other genes

High Confidence

Scores

Other interactions

Scores

CG10722

0.5642

0.3569

CG13501

0.9005

0.1108

CG17440

0.5811

0.3155

CG7046

0.6626

0.2436

CG7447

0.5411

0.2523

CG11115 (

0.7917

0.1094

Here, we first selected 12 genes to infer GRNs using PBN and DBN. The constructed GRNs are shown in Figure

Drosophila larval somatic muscle development network

**Drosophila larval somatic muscle development network**. The genetic network inferred by PBN. (b)The genetic network inferred by DBN

More comparison results of PBN(n, e) and DBN(n, e) are given in Table

Comparison of PBN and DBN methods using different sample networks

Miss errors

False alarm errors

Correct edges

Accuracy (%) (

Time(s)

min

max

avg

min

max

avg

min

max

avg

recall

precision

avg

PBN(12,18)

2

9

6.4

0

4

2.4

6

9

7.8

54.9

76.5

13.2

PBN(20,35)

12

22

16.8

3

6

4.8

11

15

13.6

44.7

73.9

19.7

PBN(30,60)

33

41

36.0

7

10

8.0

17

20

18.4

33.8

69.6

27.9

PBN(40,80)

48

63

55.4

4

6

5.6

18

22

19.6

26.1

77.8

39.2

DBN(12,18)

3

8

5.8

1

3

2.2

9

11

10.4

64.2

82.5

20.1

DBN(20,35)

13

17

15.2

4

7

5.4

14

18

16.8

52.5

75.7

36.0

DBN(30,60)

30

39

33.6

11

15

12.6

24

30

20.2

37.5

61.6

50.6

DBN(40,80)

46

57

51.2

5

9

7.4

28

34

22.8

30.8

75.5

87.6

We used the benchmark measures recall

The results in Table

Discussion

It is challenging to infer GRNs from time series gene expression data. Among thousands of genes, each gene interacts with one or more other genes directly or indirectly through complex dynamic and nonlinear relationships, time series data used to infer genetic networks have low-sample size compared to the number of genes, and gene expression data may contain a substantial amount of noise. Different approaches may have different performances for different datasets. Moreover, inference accuracy depends not only upon models but also on inference schemes. In this paper, we only select two representative inference algorithms for PBN and DBN to model the GRNs, respectively. It is desirable to perform a more comprehensive evaluation of the two approaches with different inference methods and to develop the more robust algorithm and techniques to improve the accuracy of inferring GRNs.

Conclusion

PBN-based and DBN-based methods were used for inferring GRNs from Drosophila time series dataset with 74 time points obtained from the Drosophila Interaction Database. The results showed that accuracy in terms of recall and precision can be improved if a smaller subset of genes is selected for inferring GRNs. Both PBN and DBN approaches had good performance in modeling the gene regulatory networks. In all tested cases, DBN identified more gene interactions and gave better recall than PBN. The accuracy of inferring GRNs was not only dependent upon the model selection but also relied on the particular inference algorithms that were selected for implementation. Different inference schemes may be applied to improve accuracy and performance.

Methods

Boolean network and probabilistic Boolean network

In a BN, the expression level of a target gene is functionally related to the expression states of other genes using logical rules, and the target gene is updated by other genes through a Boolean function. There are only two gene expression levels (states) in a Boolean network (BN): on and off, which are represented as "activated" and "inhibited". A probabilistic Boolean network (PBN) consists of a family of Boolean networks and incorporates rule-based dependencies between variables. In a PBN model, BNs are allowed to switch from one to another with certain probabilities during state transitions. Since PBN is more suitable for GRN reconstruction from time series data and a Boolean network is just a special case of PBN and we only consider PBN for comparison.

Boolean network

We use the same definition as in _{1}, _{2},..., _{n}} (where _{i }∈ {0, 1} is a binary variable) and a set of Boolean functions _{1}, _{2},..., _{n}}, which represents the transitional relationships between different time points. A Boolean function _{i}. The gene status (state) at time point _{i }taken from a set of Boolean functions

where each _{i }represents the expression value of gene _{i }= 0, gene _{i }= 1, it is activated. The variable _{k(i) }represents the mapping between gene networks at different time points. Boolean function

Probabilistic Boolean network

Probabilistic Boolean network inference is the extension of Boolean network methods to combine more than one possible transition Boolean functions, so that each one can be randomly selected to update the target gene based on the selection probability, which is proportional to the coefficient of determination (COD) of each Boolean function. Here we briefly give the same notation of PBN as in _{1}, _{2},..., _{n}} as in a Boolean network is used in a PBN _{1}, _{2},..., _{n}} is replaced by _{1}, _{2},... _{n}}, where each function set _{i}. A realization of the PBN at a given time point is determined by a vector of Boolean functions. Each realization of the PBN maps one of the vector functions _{k}, the state of the genes after one updating step is expressed as

_{1}(_{2}(_{n}(_{k}(_{1}(_{2}(_{n}(

Let f = (^{(1)}, ^{(2)},... ^{(n)}) denote a random vector taking values in _{1 }× _{2 }⋯ × _{n}. The probability that a specific transition function

Given genes _{1}, _{2},..., _{n}}, each _{i }is assigned to a set of Boolean functions

A basic building block of a PBN

A basic building block of a PBN.

Construction of GRNs from PBN

The Coefficient of Determination (COD) is used to select a list of predictors for a given gene _{i }be the target gene, _{i }can be defined by _{i }relative to the conditioning set

where _{i }is the error of the best estimate of _{i }

Now, if a class of gene sets _{i}, with the probability of

According to the above expressions

Bayesian networks and dynamic Bayesian networks

Among the many computational approaches that infer gene regulatory networks from time series data, Bayesian network analysis draws significant attention because of its probabilistic nature. DBN is the temporal extension of Bayesian network analysis. It is a general model class that is capable of representing complex temporal stochastic processes. It captures several other often used modeling frameworks as its special cases, such as hidden Markov models (and its variants) and Kalman filter models.

Bayesian network

Given a set of variables _{1}, _{2},... _{n}} in gene network, a Bayesian network, for _{i}) denote the parents of the variables _{i }in the acyclic graph _{i}) denote the values of the corresponding variables. Given

For more detail on Bayesian networks, see

Dynamic Bayesian network

A DBN is defined by a pair (_{0}, _{1}) represents the joint probability distribution over all possible time series of variables X = {_{1}, _{2},... _{n}}, where _{i}(1 ≤ _{i }(1 ≤ _{i}. It is composed of an initial state of Bayesian network _{0 }= (_{0}, Θ_{0}) and a transition Bayesian network _{1 }= (_{1}, Θ_{1}), where _{0}specifies the joint distribution of the variables in _{1 }represents the transition probabilities Pr{X(_{i}(0) are assumed to be those specified in the prior network _{0}, which means Pa(_{i}(0)) ⊆ X(0) for all 1 ≤ _{i}(_{i}(

An example of a DBN is shown in Figure

A basic building block of a DBN

A basic building block of a DBN.

Construction of GRNs from DBN

Given a set of training gene data, how the network structure is found that best fits the data is called learning the structure of a dynamic Bayesian network. The goal of constructing a network is to find the model with maximum likelihood (i.e., REVEAL algorithm in

Algorithms for learning gene network structure have focused on networks with complete data. Structural Expectation Maximization (SEM) is developed to handle data with hidden variables and missing values. One of the algorithms to infer network structure from training data is based on the mutual information analysis of the data. For each node, this algorithm learns the optimal parent set independently by choosing the parent set that maximizes a scoring function. The scoring function is defined by

where

For each inferred network, scoring metrics are used to evaluate the probabilistic scores which explain relationships in the given data sets. There are two popular Bayesian scoring metrics: the BDe (Bayesian Dirichlet equivalence) score

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PL implemented the algorithms and inferred gene networks. PL and CZ performed the statistical analysis and drafted the manuscript. CZ and YD coordinated the study. EP, PG and YD gave suggestions to improve the methods and revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the Army Environmental Quality Program of the US Army Corps of Engineers under contract #W912HZ-05-P-0145. Permission was granted by the Chief of Engineers to publish this information. The project was also supported by the Mississippi Functional Genomics Network (DHHS/NIH/NCRR Grant# 2P20RR016476-04).

This article has been published as part of