Abstract
Background
Understanding gene interactions is a fundamental question in systems biology. Currently, modeling of gene regulations using the Bayesian Network (BN) formalism assumes that genes interact either instantaneously or with a certain amount of time delay. However in reality, biological regulations, both instantaneous and time-delayed, occur simultaneously. A framework that can detect and model both these two types of interactions simultaneously would represent gene regulatory networks more accurately.
Results
In this paper, we introduce a framework based on the Bayesian Network (BN) formalism that can represent both instantaneous and time-delayed interactions between genes simultaneously. A novel scoring metric having firm mathematical underpinnings is also proposed that, unlike other recent methods, can score both interactions concurrently and takes into account the reality that multiple regulators can regulate a gene jointly, rather than in an isolated pair-wise manner. Further, a gene regulatory network (GRN) inference method employing an evolutionary search that makes use of the framework and the scoring metric is also presented.
Conclusion
By taking into consideration the biological fact that both instantaneous and time-delayed regulations can occur among genes, our approach models gene interactions with greater accuracy. The proposed framework is efficient and can be used to infer gene networks having multiple orders of instantaneous and time-delayed regulations simultaneously. Experiments are carried out using three different synthetic networks (with three different mechanisms for generating synthetic data) as well as real life networks of Saccharomyces cerevisiae, E. coli and cyanobacteria gene expression data. The results show the effectiveness of our approach.
Background
In any biological system, various genetic interactions occur concurrently amongst different genes. While some genes interact almost instantaneously, other genes could have time delayed interactions (see Figure 1). From a biological perspective, instantaneous regulations represent the scenarios where the effect of a change in the expression level of a regulator gene is carried on to the regulated gene (almost) instantaneously. In such cases, the effect is reflected almost immediately in the regulated gene’s expression levela. On the other hand, in cases where regulatory interactions are time-delayed, its effect will be seen on the regulated gene after a finite time delay.
Figure 1. Example of network structure with both instantaneous and time-delayed interactions.
Bayesian network and its extension, dynamic Bayesian network (DBN), has found significant
applications in the modeling of genetic interactions [1,2]. To the best of our knowledge, prior works on inter and intra-slice connections in
the dynamic probabilistic network formalism [3,4] have modelled a DBN using an initial network and a transition network employing the
1st-order Markov assumption, where the initial network exists only during the initial
period of time and subsequently the dynamics is expressed using only the transition
network. Realising that a d-th order DBN has variables replicated d times, a 1st-order DBN for this taskb is therefore usually limited to around 10 variables. Alternately, if a 2nd-order
DBN model is chosen, it can mostly deal with 6-7 variables [5]. Thus, prior works on DBNs were either unable to discover these two interactions
simultaneously or were unable to fully exploit its potential, thereby restricting
studies to simpler network configurations. However, since our proposed approach does
not replicate variables, we can study any complex network configuration without limitations
on the number of nodes. Zou et al. [2], while highlighting the existence of both instantaneous and time-delayed interactions
among genes while considering the parent-child relationships of a particular order,
did not account for the regulatory effects of other parents (having different order
of regulation than the current one) on that particular child. This is in violation
of the biological reality that parents with various orders of regulation can jointly
regulate a child. Our proposed method supports multiple parents to regulate a child
simultaneously, with different orders of regulation. Moreover, the limitation of detecting
genetic interactions like
, which are prevalent in genetic networks [6], is also overcome with the proposed method. Experiments conducted using both synthetic
and real-life GRNs show the effectiveness of our approach.
Results and discussion
We evaluate our proposed method by studying both synthetic networks and real-life biological networks of Saccharomyces cerevisiae (yeast), E. coli and cyanobacteria. The overall accuracy of the inference method and correctness of the modeling approach is evaluated by four widely accepted performance measures given below. The terms, TP, FP, TN and FN, used in the following expressions respectively mean the number of true positives, number of false positives, number of true negatives and number of false negatives.
1. Sensitivity(Se): It measures the proportion of true connections which are correctly inferred. It is defined as follows:
2. Specificity (Sp): Specificity is defined by the following equation:
3. Precision (Pr): Precision is proportional to the inferred connections which are correct. It is defined as follows:
4. F-score (F): Biologically, a good reconstruction algorithm should infer as many correct arcs as possible, in addition to the criteria that most of the inferred arcs should be correct. The F-score measure is the harmonic mean of Se and Pr[7] and represents a compromise between these two objectives:
Since our method uses discrete data for the statistical significance tests embedded in the scoring function, we applied the Persist [8] algorithm to discretize the data into 3 levels. The confidence level (α) is set to 0.9. We will use a local search in the DAG space with the classical operators of arc addition, arc deletion and arc reversal. The starting point of the search is always an empty graph. The parameters for all the other methods that are used for comparison are set to their default values mentioned in their user manuals.
Synthetic network
Synthetic network using differential equation based models
For performing studies using synthetic networks, we generated 3 random networks of size 10, 25 and 50 using the genenetweaver tool [9]. This tool has been used to generate in silico benchmarks in the DREAM (both DREAM3 [10] and DREAM4 [11]) challenge initiative. The tool is able to obtain biologically plausible network topologies (and also biologically plausible network dynamics) of a given size by extracting random sub-networks of Saccharomyces cerevisiae and E. Coli[9,12]. We used the tool to generate time series data as in the DREAM4 challenge with ten different perturbations for each experiment. Initial and final timestamps for the simulations were 0 and 1000, respectively, and the time step was 50. One of the objectives of this experiment was to test the usefulness of the proposed approach in the presence of noise in mRNA expression levels. Since microarray experiments can incur a wide range of noise levels depending on the technology, environment and the subject under study, we experimented under various noise levels that are likely to be present in the expression data. To mimic a real-life noisy environment, as in [13,14], we added 5 different noise levels to the data samples (random Gaussian noise with zero mean and variance, σ2 = 0.0, 0.01, 0.02, 0.05, 0.10). The performance, measured by the four performance measures, corresponding to the three different sized networks is reported in Figure 2. Figure 2(A) shows the performance variation as a function of network size and noise level. The X-axes represent the noise levels while the Y-axes represent the corresponding performance measures (Se, Sp, Pr, F). In Figure 2(B)-(D), we compare our approach with three other methods, namely TDARACNE, BANJO and BNFinder (BDe and MDL) using the F-Score (results corresponding to other measures are available in Additional file 1). It is evident from the results that there is no clear winner in all the cases. Some methods perform good in some cases, while others outperform it in other cases. However, it is clear that our proposed approach, albeit not always the best, it is always among the top performers and has consistently superior performance.
Figure 2. Reconstruction of synthetic networks generated using differential equation based methods. (A) How performance of our method varies with network size and noise (Red(*)-10 gene;
Green(o)-25 gene; Blue(square)-50 gene). The X-axes represent the 5 levels of noise
used, whereas the Y axes represent the corresponding performance measures (see text).
(B)-(D) Comparison of performance with 3 other methods for the 10, 25 and 50-gene network.
Red(+)-Proposed, Green(o)-BANJO, Blue(x)-BNFinder+BDe, Cyan(square)-BNFinder+MDL,
Magenta(diamond)-TDARACNE. X axes-noise levels, Y axes-F-Score. See text for details.
Additional file 1. Comparison of performance with 3 other methods for the 10, 25 and 50-gene differential equation based synthetic network.
Format: PDF Size: 43KB Download file
This file can be viewed with: Adobe Acrobat Reader
Probabilistic network of yeast
We use a sub-network from the yeast cell cycle, shown in Figure 3, taken from Husmeier et al. [15]. The network consists of 12 genes and 11 interactions. For each interaction, we randomly assigned a regulation order of 0, 1, 2 or 3. We used two different conditional probabilities for the interactions between the genes, namely, the noisy regulation according to a binomial distribution and the noisy XOR-style co-regulation. For the binomial distribution dependent noisy regulation, the parameters were set as follows: excitation: P(on|on) = 0.9, P(on|off) = 0.1; inhibition: P(on|on) = 0.1, P(on|off) = 0.9. For the noisy XOR-style co-regulation the parameters were set as: P(on|on, on) = P(on|off, off) = 0.1, P(on|on, off) = P(on|off, on) = 0.9 [15]. Eight confounder nodes were also added, resulting in the total number of nodes to be 20.
We used 30, 50 and 100 samples, generated 5 datasets in each case and compared our approach with two other DBN based methods, namely BANJO [16] and BNFinder [17]. Since these methods detect only regulations of order 1, while calculating performance measures for these methods, we ignored the exact orders for the time-delayed interactions in the target network. We could not apply TDARACNE [7] to this network since the generated data has two levels of discrete values and TDARACNE returns error when applied to such discrete datasets. We show the results for this network in Table 1, where we observe that our method, coupled with a high precision, outperforms the other two in terms of both sensitivity and specificity. The F-score is also the best in all the cases. This points to the strength of our method in discovering complex interaction scenarios where multiple regulators may jointly regulate target genes with varying time-delays.
Table 1. Comparison of proposed method with BANJO and BNFinder on the yeast sub-network
Synthetic network of glucose homeostasis
In higher eukaryotes, glucose homeostasis is maintained via a complex system involving many organs and signaling mechanisms. The liver plays a crucial role in this system by storing glucose as glycogen when blood glucose levels are high, and releasing glucose into the bloodstream when blood glucose levels are low. To accomplish its task, the liver responds to circulating levels of hormones, mainly insulin, epinephrine, glucagon, and glucocorticoids [18].
Le et al. [18] conducted an extensive review of the literature regarding the biological components affecting perinatal glucose metabolism. Based on the study, a Bayesian Network model of glucose homeostasis containing 35 nodes and 52 interactions (shown in Figure 4) was constructed. We used the model for generating datasets of varying size (50, 75 and 100 samples), having first and second-order regulations using the Bayes Net Toolbox [19]. The random multinomial CPDs used by this approach of data generation were obtained by sampling from a Dirichlet distribution with hyper-parameters chosen by the methodc described in [20] with a corresponding Equivalent Sample Size (ESS) value of 10. The choice of this prior distribution for the conditional parameters ensures a reasonable level of dependence between d-connected variables in the generative structure [20].
Figure 4. Synthetic network of glucose homeostasis.
We compare our method with the three other methods that were used previously for comparison, namely BANJO [16] and BNFinder [17](using BDe and MDL). While calculating performance measures for these methods, we ignored the exact orders for the time-delayed interactions in the target network. Similar to the probabilistic network of yeast, we could not apply TDARACNE for this network due to error occurring because TDARACNE is unable to cope with the discrete data. The results are shown in Table 2. We observe that, both in terms of specificity and precision, our method outperforms others. The F-score is the highest in all the cases, indicating a good balance between sensitivity and precision.
Table 2. Comparison of proposed method with BANJO and BNFinder on the glucose homeostasis network
Real-life biological data of saccharomyces cerevisiae (IRMA)
To validate our method with a real-life biological gene regulatory network, we investigate a recent network reported in [21]. In that significant work, the authors built a network, called IRMA, of the yeast Saccharomyces cerevisiae[21]. They tested the transcription of network genes by culturing the cells in presence of galactose and glucose. The network is composed of five genes regulating each other; it is also negligibly affected by endogenous genes. It is one of the first attempts at building a reference data set having an accurately known target network [7]. There are two sets of gene profiles called Switch ON and Switch OFF for this network, each containing 16 and 21 time series data points, respectively. A ’simplified’ network, ignoring some internal protein level interactions, is also reported in [21]. To compare our reconstruction method, we consider 3 other methods, namely, TDARACNE [7], BANJO [16] and BNFinder [17].
IRMA ON dataset
The performance comparison amongst various method based on the ON dataset is shown in Table 3. We observe that our method clearly outperforms the others. There are no false predictions and precision is highest. The sensitivity and F-score measures are also very high.
Table 3. Performance comparison based on IRMA ON dataset
IRMA OFF dataset
Due to the lack of ’stimulus’, it is relatively difficult to reconstruct the exact network from the OFF dataset [7]. As a result, the overall performances of all the algorithms suffer to some extent. The comparison is shown in Table 4. Again, we observe that our method reconstructs the gene network with high precision. Specificity is also quite high, implying that the inference of false positives is low.
Table 4. Comparison based on IRMA OFF dataset
Yeast KEGG pathway reconstruction
In order to test the proposed method’s performance on yeast S. cerevisiae cell cycle, we selected a eleven gene network of the G1-phase: Cln3, Cdc28, Mbp1, Swi4, Clb6, Cdc6, Sic1, Swi6, Cln1, Cln2, Clb5. The data used was obtained from the cdc28 experiment of Spellman et al. [22]. In the later stage of the G1-phase, the Cln3-Cdc28 protein kinase complex activates two transcription factors, MBF and SBF, and these promote the transcription of some genes important for budding and DNA synthesis [7,23]. Entry into the S-phase requires the activation of the protein kinase Cdc28p through binding with Clb5 or Clb6, and also the destruction of Sic1 [24]. Also, Swi4 becomes associated with Swi6 to form the SCB complex that activates CLN1 and CLN2 in late G1. Mbp1 forms the MCB-binding factor complex with Swi6, which activates DNA synthesis genes and S-phase cyclin genes CLB5 and CLB6 in late G1 [7]. In budding yeast, commitment to DNA replication during the normal cell cycle requires degradation of the cyclin-dependent kinase (CDK) inhibitor Sic1. The G1 cyclin-CDK complexes Cln1-Cdk1 and Cln2-Cdk1 initiate the process of Sic1 removal by directly catalyzing Sic1 phosphorylation at multiple sites [7,25].
In Figure 5(B)-(F), we report network graphs reconstructed by our proposed approach, TDARACNE, BNFinder(BDe and MDL) and BANJO. We also report the KEGG pathway [26] of the cell-cycle in yeast in 5(A). Since the ground truth for this network is not known, instead of applying performance measures as a means of determining network accuracy, we refer to the available correct interactions obtained from the KEGG pathway [26] and identify which of the predicted interactions are correct or otherwise. We observe from the results that our approach correctly identifies the regulation of SWI4-SWI6 and MBP1-SWI6 complex by the CLN3-CDC28 complex. Also, the proposed approach infers that the SWI4-SWI6 complex regulates the CLN1-CLN2-CDC28 complex, which is correct. Two more interactions inferred by our approach (CLN1→CLN2 and CLB5-CLB6-CDC28→CDC6) are also correct based on the KEGG pathway. Overall we observe that none of the methods perform particularly well on this network. However, the number of correct predictions by our method (5) is higher than the other methods.
SOS DNA repair network of E. coli
We analyze the well-known SOS DNA repair network in E. coli as shown in Figure 6(A). This GRN is well known for its responsibility of repairing the DNA if it gets damaged. It is the largest, most complex, and best understood DNA damage-inducible network to be characterized to date.
Figure 6. Reconstruction of SOS DNA Repair Network (A) Target Network. (B) Network Inferred by proposed approach. (C) Network Inferred by TDARACNE. (D) Network Inferred by BANJO. (E) Network Inferred by BNFinder+BDe. (F) Network Inferred by BNFinder+MDL.
The expression of the genes in the SOS regulatory network is controlled by a complex circuitry which involves the RecA and LexA proteins [27]. Normally LexA acts as the master repressor of more than 20 genes, including lexA and recA genes. This repression is done by its binding to the interaction sites in the promoter regions of these genes. When DNA damage occurs, one of the SOS proteins, RecA, acts as a sensor. By binding to single-stranded DNA, it becomes activated, senses the damage and mediates LexA autocleavage [27]. The drop in LexA levels in turn stops the repression of the SOS genes and activates them. When the damage has been repaired, the level of activated RecA drops and it stops mediating LexA autocleavage. LexA level in turn increases, starting repression of the SOS genes, and the cell then returns to its normal state.
The expression data sets of the SOS DNA repair system were obtained from Uri Alon Lab [28]. These data are expression kinetics of 8 genes namely uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA and polB. Four experiments were done for various UV light intensities (Exp. 1 and 2:5Jm−2, Exp. 3 and 4:20Jm−2). In each experiment, the above 8 genes were monitored at 50 instants which are evenly spaced by 6 minutes intervals.
The results corresponding to Experiment 1 is presented in Figure 6(B). Along with our result, we include the results from BANJO, TDARACNE and BNFinder in Figure 6(C)-(F) and the target network in 6(A). The results corresponding to the other experiments are available in Additional file 2, Additional file 3, Additional file 4, Additional file 5 and Additional file 6. From the results, we observe that our method correctly identifies lexA and recA as the ’hub’ genes for this network. Again, the exact ground truth for this network is not precisely known, and hence it is not possible to calculate the well-known performance measures. Instead, using the known interactions obtained from literature [13,14], an analysis of correct and incorrect predictions by our method is obtained and shown in Table 5. We observe that most of the interactions inferred by our proposed method are correct. It successfully infers lexA as the regulator of uvrA, uvrD, umuD and recA. Also, considering the indirect regulation of RecA through LexA, two more interactions, namely recA→uvrY and recA→polB can also be considered correct. In contrast, 3 of the 5 identified interactions by TDARACNE [7] are correct. Most of the interactions identified by BANJO and BNFinder+MDL are incorrect. BNFinder+BDe successfully identifies regulation of ruvA, polB and uvrA by lexA. In addition, the regulation of umuD by recA can also be considered correct. However, compared to these methods, our proposed method infers the highest number of correct predictions. Number of incorrect predictions is also very low for our method.
Additional file 2. Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using BANJO.
Format: PDF Size: 23KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3. Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using BNFinder+BDe.
Format: PDF Size: 23KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 4. Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using BNFinder+MDL.
Format: PDF Size: 22KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5. Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using the proposed approach.
Format: PDF Size: 22KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 6. Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using TDARACNE.
Format: PDF Size: 22KB Download file
This file can be viewed with: Adobe Acrobat Reader
Table 5. Analysis of individual interactions inferred by proposed method
Network analysis of strongly cycling genes in cyanobacteria, Cyanothece sp. ATCC 51142
To study our approach on a large scale network, we use a network of a strain of cyanobacteria, namely Cyanothece sp. strain ATCC 51142 [29]. The microarray data corresponding to the genes were collected from two publicly available genome-wide microarray data sets of Cyanothece, performed in alternating light-dark (LD) cycles with samples collected every 4h over a 48h period: the first one starting with 1h into dark period followed by two DL cycles (DLDL), and the second one starting with two hours into light period, followed by one LD and one continuous LL cycle (LDLL) [30]. In total, there were 24 samples. Using a threshold filter with a 2-fold change cutoff, 730 genes were selected for the analysis. The genes are responsible for performing the major tasks of energy metabolism and respiration, nitrogen fixation, protein translation and folding, and photosynthesis, along with several other tasks. Result obtained using our method is shown in Figure 7. The degree distribution is shown in Figure 8. To compare our result with the other methods, we applied BANJO, BNFinder(BDe and MDL) and TDARACNE. The results of all the three except BNFinder(BDe) was not satisfactory. As a result, we compare our method only with BNFinder+BDe.
Figure 7. Network inferred by proposed approach.
Figure 8. Degree distribution analysis of the resultant network of cyanothece. We used a power law fit which yields R2 = 0.93. The result confirms that the inferred network is scale-free.
Similar to other large scale datasets (e.g. the Human HeLa cell data [31], the Arabidopsis L. Heynth dataset [32]), the microarray data set for cyanobacteria also has very few samples. Moreover, being not a well-studied organism, it requires caution in the interpretation of results. We note that GRN reconstruction studies of cyanobacteria reported earlier (e.g. [29,33,34]) commonly emphasise an evaluation criteria, namely “functional enrichment” analysis of sub-networks. Further, another common feature noted for genetic networks [35-37] is that transcriptional regulatory networks possess the scale free nature of the network topologyd. Since we have limited samples and also because the ground truth is unknown, we have carried out the evaluation of the inferred network using both: (i) statistical means i.e. GO functional enrichment analysis (using both p = 0.05 and p = 0.10), and (ii) the R2 measure of the power-law fit of the network to establish its scale-free nature.
The enrichment analysis was done by using gene ontology (GO) database (compiled using two sources: one from the Cyanobase database [38], and another from genome-wide amino sequence matching using the Blast2GO software suite [39]; the the compiled database is available in Additional file 7), where every GO terms appearing in each sub-network is assessed to find out whether a certain functional category is significantly over-represented in a certain sub-network/cluster, more than what would be expected by chance. The Cytoscape [40] plugin BiNGO [41] was used for GO functional category enrichment analysis. For BiNGO, we use the combined and filtered gene set as the reference set, the hypergeometric test as the test for functional over-representation, and False Discovery Rate (FDR) as the multiple hypothesis testing correction scheme.
First, we present the results corresponding to p = 0.05. The network obtained by BNFinder+BDe has 16 sub-networks each containing at least 3 genes. Of these, 6 sub-networks have significantly enriched functionalities (as determined by the GO functional enrichment test). Of the other 10, we compute the 3 most densely connected hubs for each sub-network, and in 2 of 10 such sub-networks, the hubs have defined significantly enriched functionalities. On the other hand, in our result, there are 14 sub-networks in total having at least 3 genes. Of these, 3 sub-networks have defined enriched functions (the largest sub-network has the role of nitrogen fixation according to the enrichment test). Of the other 11, we compute the 3 most densely connected hubs for each sub-network, and in 5 of the 11 such sub-networks, the hubs have defined significantly enriched functionalities.
The results corresponding to p = 0.10 show that for BNFinder+BDe, 7 sub-networks have enriched functionalities (as determined by the test). Of the other 9, we compute the 3 most densely connected hubs for each sub-network, and in 2 of the 9 such sub-networks, the hubs have defined enriched functionalities. On the contrary, the result using our approach has 5 sub-networks with defined significantly enriched functions (the largest sub-network has the role of nitrogen fixation, similar to the p = 0.05 case). Of the other 9, we compute the 3 most densely connected hubs for each sub-network, and in 6 of the 9 such sub-networks, the hubs have defined significantly enriched functionalities.
Additional file 7. The compiled gene ontology annotation database used for the functional enrichment analysis.
Format: TXT Size: 654KB Download file
We also test the networks to assess whether they are scale free, using a power law fit. The R2 value of the fit corresponding to our network is 0.93, which is a better fit compared to BNFinder+BDe (0.62).









Figure 3.
Figure 5.



Figure 9.
Figure 10.

























Figure 11.

