Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, Osaka, Japan

Abstract

Background

Bayesian networks (BNs) have been widely used to estimate gene regulatory networks. Many BN methods have been developed to estimate networks from microarray data. However, two serious problems reduce the effectiveness of current BN methods. The first problem is that BN-based methods require huge computational time to estimate large-scale networks. The second is that the estimated network cannot have cyclic structures, even if the actual network has such structures.

Results

In this paper, we present a novel BN-based deterministic method with reduced computational time that allows cyclic structures. Our approach generates all the combinational triplets of genes, estimates networks of the triplets by BN, and unites the networks into a single network containing all genes. This method decreases the search space of predicting gene regulatory networks without degrading the solution accuracy compared with the greedy hill climbing (GHC) method. The order of computational time is the cube of number of genes. In addition, the network estimated by our method can include cyclic structures.

Conclusions

We verified the effectiveness of the proposed method for all known gene regulatory networks and their expression profiles. The results demonstrate that this approach can predict regulatory networks with reduced computational time without degrading the solution accuracy compared with the GHC method.

Background

Finding gene regulations is an important objective of systems biology

Recently, microarray

A Boolean network is a discrete dynamical network

A GGM is an undirected probabilistic graphical model

A differential equation model describes gene expression changes as a function of the expression of other genes and environmental factors

BN is a graphical model for representing probabilistic relationships among a set of random variables

Using a BN, it is hard to estimate a large-scale network because the search space grows exponentially as the number of genes increases. Therefore, overcoming this problem has been the focus of much research. The proposed solutions to this problem can be divided into three types. The first type limits the number of estimated genes. Even when estimating a large-scale network, part of the network is often attracted. The second type parallelizes the estimation by supercomputer or other high-performance computer. Effective parallelizing makes it possible to estimate large-scale networks. The third type improve the algorithm itself. These methods reduce computational time and estimate the network by a heuristic.

An example of the first type of solution is proposed by Peña

A solution of the second type proposed by Tamada

A solution of the third type for estimating gene regulatory networks was implemented by Bøttcher

In this paper, we present a novel BN-based deterministic method with reduced computational time to overcome the above-mentioned problems. The proposed method can estimate a network as large-scale as those estimated by the GHC method, run on a workstation, and estimate more accurately than the GHC method. We take another approach to estimate more accurately than the GHC method. First, our method generates all the combinational subsets with three genes. Then, we estimate all possible networks for each subset using the BN method and unite the networks into a single network including all genes. This approach enables us to estimate more accurately for the same computational time than the GHC method.

In order to verify the effectiveness of the proposed method, we perform two experiments, to evaluate scalability and accuracy: i.e., one to verify the proposed method can estimate networks as large-scale as those estimated by the GHC method, and one to verify it can estimate more accurately than the GHC method. These experiments are performed using randomly sampled genes. In addition, we conduct a third experiment to confirm that our method outperforms the GHC method using real data.

Results

Bayesian networks

Let

Each node _{v}. The set of variables associated with the graph _{v}}. Often we do not distinguish between a variable _{v }and the corresponding node _{v}_{pa}_{(}_{v}_{)}), is attached. The set of local probability distributions for all variables in the network is

As a measure of how well a DAG D represents the conditional dependencies between the random variables, we use the relative probability

and refer to it as a network score, where

The log network score contribution of a node is evaluated whenever the node is learned. The log network score

The number of possible DAGs grows exponentially with the number of nodes, and the problem of identifying the network with the highest score is NP-hard. If the number of random variables in a network is large, it is not computationally possible to calculate the network score for all possible DAGs. For these situations, the search strategy

The GHC method is as follows.

1. Select an initial DAG _{0 }randomly from which to start the search.

2. Calculate the Bayes scores of _{0 }and all possible networks that differ by only one directed edge, that is, an edge is added to _{0}, an edge in _{0 }is deleted, or the direction of an edge in _{0 }is reversed.

3. Among all these networks, select the one that increases the Bayes score the most.

4. If the Bayes score was not improved, stop the search. Otherwise, make the select network _{0 }and repeat from step 2.

In the GHC method, we can limit the maximum number of these steps in the search algorithm. Also, the search algorithm can restart an arbitrary number of times. More details on the parameter setting will be described later in this paper.

Methods

We propose a new method to estimate a gene regulatory network with reduced computational time. The proposed method is composed of three steps: dividing the whole problem into partial problems, estimating gene regulatory networks of partial problems, and uniting the estimated networks. In this section, we describe our BN-based method using the analysis of a set of expression data as an example. This example includes five genes _{i}|1 ≤

Conceptual representation of our approach

**Conceptual representation of our approach.** Yellow circles represent genes. Blue circles represent partial problems. Small directed edges represent regulatory relationships between genes. Large directed edges represent the flow of the method.

Step 1: Dividing the whole problem into partial problems

Our approach first divides the set of all genes _{i}, _{j}, _{k }∈ _{5}C_{3 }= 10 partial problems {_{1}, _{2}, _{3}}.{_{1}, _{2}, _{4}}, ..., {_{3}, _{4}, _{5}}.

Step 2: Estimating gene regulatory networks

After making partial problems, we next calculate independently the scores of all the possible networks of each partial problem by exhaustive search and obtain estimated DAGs _{1}, _{2}, _{3}^{3 }= 27 because there are three cases for each potential edge (_{i}, _{j}) (1 ≤ _{i }to _{j }, a directed edge from _{j }to _{i}, and no edge.

Let _{D}, _{D}) be a tuple, where _{D }_{D }is a rank of

We add tuples of all the partial problems to _{1}, _{2}, _{3}}.{_{1}, _{2}, _{4}}, ... , {_{3}, _{4}, _{5}}, we add 270 tuples of networks to

Step 3: Uniting estimated partial problems

To solve the original problem, this step unites three-gene networks into a single gene regulatory network. The policy of the step is to classify relationships between genes, i.e., determine (_{i}, _{j}) (1 ≤ _{i }to _{j}, a directed edge from _{j }to _{i}, or no edge between _{i }and _{j}) according to the score calculated in Step 2.

To select an edge type between genes _{i }and _{j}, we calculate an edge (_{i}, _{j}) value for each of the three types

where _{i}, _{j}). Then we select one edge type that has the highest total value.

When two or more edge types have the highest total value, we use edge scores of the partial problems whose ranks are 2 or more.

Algorithm

**Input**: _{1}, ...,

**Output**: _{V }: DAG including genes

**Variable**:

1: Make a collection of set **V **that includes all the subsets of

2-1: for each **V **do

2-2: Make a collection of set **D**_{u }that includes all the DAGs of

2-3: for each **D**_{u }do

2-4: calculate rank _{D }and score _{D }with GEP

2-5: add (_{D}, _{D}) to

2-6: end for

2-7: end for

3-1:

3-2: repeat

3-3: for each edge between genes (x, y) in D of (_{D},

3-4: add all _{D }of (_{D},

3-5: if one edge type has the highest total _{D }then

3-6: add an edge between genes (x, y) to _{V}

3-7: end if

3-8: if two or more edge types have the highest total _{D }then

3-9: for each edge between genes (x or y, w) in _{V }, where w is a gene ≠

3-10: select edge between genes (x, y) from _{D},

3-11: end for

3-12: add edge (x, y) selected in (3-10) with the highest _{D }to _{V}

3-13: end if

3-14: end for

3-15:

3-16: until directions of all edges in _{V }are assigned

3-17: return _{V}

A flowchart of the algorithm can be found in Figure

Flowchart of the algorithm

**Flowchart of the algorithm.** Circles represent start and end points. Rectangles represent generic processing steps. Diamonds represent decision steps.

Computational experiments

To verify the effectiveness of the proposed method, we performed three experiments. The first experiment determines computational time for different numbers of genes. The purpose of this experiment is to verify that the proposed method is able to estimate gene regulatory networks that are as large-scale as those estimated by the GHC method. The second experiment demonstrates that the proposed method is more accurate than the GHC method. The third experiment shows, through an example, that our algorithm works well for inferring real gene regulatory networks. We estimate the networks, including the known gene regulatory network, and compare the network estimated by the proposed method and that by the GHC method.

Implementation, system, and materials

Steps 1 and 2 are implemented using the deal package version 1.2-33 written in R. We use R 2.10.1. Step 3 is implemented using Perl 5.10.1.

The GHC method is implemented in the deal package version 1.2-33. In these experiments, the maximum number of actions, i.e., adding, deleting, or reversing a directed edge, is set at 50 and the number of restarts is set at 0. We call these parameters the default parameter set.

We performed all the experiments on a computer with Intel Core2 Duo 6600 CPU 2.40 GHz processors with 3.0 GB memory. The operation system is Ubuntu 10.04.

We used a dataset of two time-series gene expression profiles including 45102 genes from a mouse adipocyte and osteoblast. The number of time points is 62.

Experiment 1

We verified that the proposed method can estimate gene regulatory networks as large-scale as those estimated by the GHC method. We used the proposed method, an exhaustive search, and the GHC method, and compared the estimation time for from 3 to 70 genes. In this experiment, we selected genes from the gene expression profile from a mouse adipocyte by random sampling. We ran this process 50 times and calculated the mean estimation time. The results are summarized in Figure

Comparison of the estimation time

**Comparison of the estimation time.** The estimation time of the exhaustive search, the GHC method, and the proposed method.

In Figure

Experiment 2

We verified that the estimation accuracy of the proposed method is higher than that of the GHC method for nearly identical estimation times. We compared the estimation results of the exhaustive search with the results of the proposed method and the GHC method. In this experiment, we selected five genes randomly from the gene expression profile 100 times from a mouse adipocyte and osteoblast. We estimated the network of these five genes by the proposed method and the GHC method. There are 59049 DAGs for five genes, and all the DAGs are ranked by the scores of the exhaustive search. The ranking was used to evaluate the networks estimated by the proposed method and the GHC method. The results are listed in Figure

Comparison of the estimated network

**Comparison of the estimated network.** Frequency that the networks estimated by the GHC method and the proposed method correspond to those of the exhaustive search (from 1 to 59049).

The two bar charts in Figure

The correspondence count of the proposed method from the 1st to 10th networks of the exhaustive search exceeded 50. For the correspondence count from the 30001th to the 59049th network of the exhaustive search, the GHC method exceeded 50 and the proposed method was less than 10.

Experiment 3

We used a known gene regulatory network and verified that the proposed method can estimate more accurately than the GHC method with the same or less computational time. We compared the regulations estimated by the proposed method with those of the GHC method. In this experiment, we used 40 genes from the gene expression profile from a mouse adipocyte. Of these, 7 genes are

Comparison of the network including

**Comparison of the network including Pparγ and genes that regulate or are regulated by Pparγ.** (a) is the known gene regulatory network. (b) is the network estimated by the GHC method with the maximum number of actions set at 50 and the number of restarts set at 0. (c) is the network estimated by the GHC method with the maximum number of actions set at 100 and the number of restarts set at 10. (d) is the network estimated by the proposed method. Blue circles represent genes. Red edges indicate edges also in network (a), blue edges indicate edges with a different direction from those in network (a), and black edges indicate that there are no such relationships in network (a).

In Figure

Figure

Discussion

The GHC method tends to produce local optimal solutions. For example, in Figure

The results of our experiments indicate that dividing the set of all genes and uniting the network results can estimate more accurately than the GHC method. With the GHC method, the maximum number of actions, i.e., adding, deleting, or reversing a directed edge, and the number of restarts can be adjusted. If these parameters are increased as much as possible, the estimation accuracy can be made comparable to that of the exhaustive search. However, this would spoil the advantage of the GHC method that it can estimate with high speed. The GHC method selects the action that increases the network score the most; therefore, a regulation that increases the network score only slightly is rarely selected. In this sense, the search of the GHC method is considerably biased. This aspect becomes pronounced when the limiting parameters are set strictly. With the proposed method, regulations that have a positive effect will be selected independently of whether that effect is slight or strong. For example, in Figure

We verified that the proposed method can estimate networks as large-scale as those estimated using the GHC method. We spend at most 0.1 second to estimate the network of one partial problem with three genes and repeat the estimation _{n}_{3 }times in the proposed method. Therefore, the proposed method can estimate the network with a low amount of memory compared with the GHC method, which, like the exhaustive search, requires much memory. When we estimate a network for a data set from a large number of genes using the GHC method, it is easy to run out of memory, making the actual computational time longer than the theoretical time.

Conclusions

In this study, we present a novel BN-based deterministic method with reduced computational time. We confirmed experimentally that the proposed method can reduce the computational time drastically without degrading the solution accuracy. The proposed method can estimate networks as large-scale as those estimated by the GHC method. Furthermore, the proposed method can estimate more accurately than the GHC method, even if the computational time of the GHC method is increased to more than 20 times that of the proposed method.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YW implemented the algorithm and performed the analyses. YW, SS, YT, and HM conceived and designed the experiments and wrote the paper.

Acknowledgements

This work was partially supported by Grant-in-Aid for Scientific Research (22680023 and 22310125) from the Japan Society for the Promotion of Science (JSPS), and by the HPCI STRATEGIC PROGRAM Computational Life Science and Application in Drug Discovery and Medical Development from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT).

This article has been published as part of