Abstract
Background
A knowledgebased network, which is constructed by extracting as many relationships identified by experimental studies as possible and then superimposing them, is one of the promising approaches to investigate the associations between biological molecules. However, the molecular relationships change dynamically, depending on the conditions in a living cell, which suggests implicitly that all of the relationships in the knowledgebased network do not always exist. Here, we propose a novel method to estimate the consistency of a given network with the measured data: i) the network is quantified into a loglikelihood from the measured data, based on the Gaussian network, and ii) the probability of the likelihood corresponding to the measured data, named the graph consistency probability (GCP), is estimated based on the generalized extreme value distribution.
Results
The plausibility and the performance of the present procedure are illustrated by various graphs with simulated data, and with two types of actual gene regulatory networks in Escherichia coli: the SOS DNA repair system with the corresponding data measured by fluorescence, and a set of 29 networks with data measured under anaerobic conditions by microarray. In the simulation study, the procedure for estimating GCP is illustrated by a simple network, and the robustness of the method is scrutinized in terms of various aspects: dimensions of sampling data, parameters in the simulation study, magnitudes of data noise, and variations of network structures.
In the actual networks, the former example revealed that our method operates well for an actual network with a size similar to those of the simulated networks, and the latter example illustrated that our method can select the activated network candidates consistent with the actual data measured under specific conditions, among the many network candidates.
Conclusion
The present method shows the possibility of bridging between the static network from the literature and the corresponding measurements, and thus will shed light on the network structure variations in terms of the changes in molecular interaction mechanisms that occur in response to the environment in a living cell.
Background
The knowledgebased approach to construct biological network models is recognized as one of the most promising advances in computational biology [1]. In this approach, the causal relations between biological molecules are described as a directed graph, based on the interaction information extracted from a large number of previous reports, in a manual or automatic manner [2,3]. Since each relation has been identified by experimental studies, the existence of edges in the network model is supported by strong evidence. Due to the high reliability of each relation, many network models, even those with large, complex structures, have been constructed for various biological phenomena by a knowledgebased approach [46]. Note that a network generated by a knowledgebased approach is a mixture of molecular relationships identified by experimental studies under different conditions. Indeed, it is well known that the relationships between the molecules in a living cell change dynamically, depending on the cellular environment. Fortunately, an abundance of such information about molecular interactions under different conditions has been obtained by measuring them on a genomic scale, due to recent advances in experimental techniques, and the information about the interactions is available at various web sites [7]. Thus, we can evaluate the consistency of the knowledgebased network structure by the available information about the data measured under the different conditions. Although the inference of static network structures from the data has been intensively studied by various approaches, such as the Bayesian network [8], the dynamic Bayesian network [9], the Boolean network [10], and the graphical Gaussian model [11], the consistency evaluation will be useful to trace the dynamic network structure variations reflecting the molecular relationships that change coordinately in response to the cellular environment.
The consistency evaluation between the network structure and the measured data is well known in statistics as the test for causal hypotheses by using the measured data. The origin of the test for causal hypotheses is attributed to path analysis [12]. Unfortunately, the importance of this cornerstone research was not recognized for a long time, but the natural extension of path analysis has been established as the wellknown structural equation model (SEM) [13]. Indeed, the SEM has been utilized recently in various fields, in accordance with increased computer performance. However, the SEM without any latent variables, which is a natural assumption for its application to biological networks, sometimes has difficulties in the numerical calculation of the maximum likelihood for the observed data. To overcome the problem with this calculation, the dsep test [14] has been developed, based on the concept of dseparation in a directed acyclic graph (DAG) [15]. Note that the graph consistency with the data in the dsep test is considered by focusing on the absence of edges in the graph [16,17].
Recently, linear regression was applied to reconcile the gene regulatory network with the corresponding data [18]. This application is based on the concept that the entire network of gene regulation can be divided into a few network motifs, with a twolayer relationship between the transcription factors and their regulated genes [19]. Indeed, the division of the entire network into a small and simple network enables us to utilize the standard statistical tests in linear regression for the consistency of the gene relationships with the measured data. Unfortunately, the linear regression is limited to the twolayer relationships, and subsequently, its application is constrained to the simple structures of gene regulatory networks.
In this study, we propose a new method for estimating the consistency of a causal graph with the measured data, in combination with the Gaussian network (GN) [20] and the generalized extreme value distribution (GEV) [21]. The present study partly exploits the previous study [18] about the consistency between the network motif with twolayer gene relationships and the measured data. However, instead of the network motifs with simple structures, here we consider rationally complicated network structures based on the graphical model, and its consistency with the data is expressed as a probability, referred to as the graph consistency probability (GCP). The performance of the present method is examined by artificial networks with various structures and actual data measured in Escherichia coli. Furthermore, the merits and pitfalls of our method are discussed in terms of its possible utility with various actual issues and methodologies, in comparison with previous methods.
Results and discussion
Calculation of Graph Consistency Probability (GCP)
We will illustrate the procedure for calculating the graph consistency probability (GCP) with a simple graph, G_{0}, which is a directed acyclic graph with ten nodes and nine edges, and with the corresponding data that are artificially generated on the assumption that the data noise follows the normal distribution. The procedure for calculating the GCP is composed of five steps, as schematically shown in Fig. 1 (see details of the mathematical description in the Materials and Methods and the additional file 1: Details of the schematic description of the procedure).
Additional file 1. Details of the schematic description of the procedure. The graph factorization at Step 1 and the four GEVdiagnostic plots of the probability plot, the quantile plot, the returnlevel curve, and the density plot at Step 4 (PDF file) are shown.
Format: PDF Size: 45KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 1. Flow of the calculation of graph consistency probability. The calculation is composed of five steps (see details in the text).
At the first step, the given graph, G_{0}, is recursively factorized into the subgraphs, according to the parentdescent relationships in DAG [15]. By recursive factorization, G_{0 }is rationally divided into 10 subgraphs, based on the parentdescent relationships in the directed graph.
At the second step, we calculated the loglikelihood of the given graph, l(G_{0}), with the corresponding data by a Gaussian network model [20]. To calculate l(G_{0}), here, we generated the data {X_{kl}} for each node with 50 sample dimensions, i.e., for k = 1,2,...,10 and l = 1,2,...,50, instead of the actual data, by the structural equations (see details in Methods). The l(G_{0}) of the given graph was then calculated to 31.14. We will estimate the probability of l(G_{0}), GCP, by the following three steps.
At the third step, we generated the graphs based on the given graph, and then calculated the loglikelihoods of the generated graphs according to the two preceding steps.
(1) We generated 50 random graph sets, {G_{i}}, to form a data set, in which each graph has the same number of nodes and edges, but with different connections from those of G_{0}.
(2) 50 corresponding loglikelihoods of {G_{i}} were calculated according to the first and second steps. Among the 50 loglikelihoods, the maximum of the loglikelihood, l(G_{max}), is selected.
(3) The above procedure is iterated 1000 times to finally obtain 1000 values of l(G_{max}). In this step, the dimensions of the sampling data, the number of graphs in one set, and that of the iterations to select l(G_{max}) are changeable parameters, and the robustness of our method with them will be evaluated in the following sections.
At the fourth step, we fit the loglikelihoods calculated in the third step to the GEV model. The maximization of the GEV loglikelihood leads to the following estimate:
for which the GEV loglikelihood is 4063.59. Although the maximum likelihood estimate for is negative, corresponding to a bounded distribution, the ξ value larger than 0.5 indicates that the maximum likelihood functioned well for the estimation [21,22]. Furthermore, the goodness of fit can be visually diagnosed, using the three diagnostic plots for assessing the accuracy of the GEV model, fitted to the 1000 loglikelihoods data by the three parameterizations. Neither the probability plot nor the quantile plot gave any cause to doubt the validity of the fitted model: each set of plotted points was nearly linear. The returnlevel curve asymptotes to a finite level as a consequence of the negative estimate , and also provides a satisfactory representation of the empirical estimates. In addition, the corresponding density estimate seems consistent with the histogram of the data. Consequently, the four diagnostic plots lend support to the fitted GEV model.
At the final step, we estimated the GCP of the loglikelihood of the given graph based on the fitted GEV distribution. According to the GEV distribution, the GCP corresponding to l(G_{0}) (= 31.14) was calculated to be less than 10^{10}. As a result, it is natural that the examined given graph was highly consistent with the data generated according to the graph structure.
Robustness of the Present Method
The high performance of the present method described in the preceding subsection depends on a few parameters. By using the same network structure as in Fig. 1, we tested the robustness of the present method in terms of the dimensions of the analyzed data, the two parameters in generating artificial graphs for GEV, and the degree of noise in the data. Furthermore, the robustness of the network structure variation is tested by using the typical network structure in biological interactions in the following four subsections.
Robustness in Terms of the Dimensions of the Analyzed Data
We test the robustness of our method in terms of the number of data samples for one variable (data dimension) that is smaller than the data dimension ({X_{i}} for i = 1,2,...,50) in Fig. 1. This is because the experimental conditions are frequently limited, due to the technical difficulty of performing experiments for different growth conditions. Thus, small data dimensions are expected in the actual data.
We performed the same estimation of GCP as that in Fig. 1, by using the data with 15 and 30 dimensions, and in both cases, the present method operated well. The GEV fit well to the data: the estimated was larger than 0.5: the estimated values are 0.1132 for the 15dimension data and 0.1332 for the 30dimension data. In addition, the four GEVdiagnostic plots for assessing the accuracy of the GEV model show the validity of the fitted model in each case (see additional file 2: Robustness in terms of data dimensions). By the estimated GEV distributions, the GCPs in the two cases were less than 10^{4 }and 10^{8}, respectively. The probability for the 30dimension data was smaller than that for the 15dimension data. Considering that the probability was 10^{10 }in the 50dimension data in the preceding section, this indicates that the resolution degree about the consistency is higher with larger dimensions.
Additional file 2. Robustness in terms of data dimensions. Four GEVdiagnostic plots of the probability plot, the quantile plot, the returnlevel curve, and the density plot (PDF file) are shown for the 15 and 30dimension data, respectively.
Format: PDF Size: 104KB Download file
This file can be viewed with: Adobe Acrobat Reader
Robustness in Terms of Parameters in Generating the GEV Model
The GCP depends on two parameters in the graph generation for GEV: the number of graphs for selecting the maximum of the likelihood in one set of the generated graphs, l, and the number of iterations for sampling the maximum values from each set of generated graphs, n. In Fig. 1, l and n were set to 50 and 1000, and a total of 50,000 graphs were generated for GEV. Here, we examined the fitness of the loglikelihoods to GEV based on the graph shown in Fig. 1, with nine pairs of l and n: l was set to 25, 50 and 100, and n was set to 100, 500, and 1000. The total numbers of graphs for GEV ranged from 2500 to 100000, and all of the examinations with the above parameter pairs are provided in an additional file (see additional file 3: Robustness in terms of the parameters). Here, we focused on the case when fewer graphs are generated than the number in the default case. This is because a small number of generated graphs in each set and iterations may tend to violate the distribution of GEV, due to some biases in the graph generation.
Additional file 3. Robustness in terms of the parameters. Four GEV plots (PDF file) are shown when two parameters were set as follows: l was set to 25, 50 and 100, and n was set to 100, 500, and 1000.
Format: PDF Size: 227KB Download file
This file can be viewed with: Adobe Acrobat Reader
In the comparison of (l, n) = (50, 100) and (25, 1000) with (50, 1000) in Fig. 1, the loglikelihoods calculated in the two cases were fitted to the GEV model. Indeed, the two values were larger than 0.5: the estimated values were 0.1545 in (l, n) = (50, 100) and 0.1670 in (l, n) = (25, 1000), respectively. The two sets of diagnostic plots for assessing the accuracy also showed the validity of the fitted model in each case (see additional file 3). A closer inspection revealed that the value in the (50, 100) case was slightly less than that in the (25, 1000) case. This indicated that the number of graphs in each set, l, is more sensitive to the goodness of fitness than the number of iterations, n, regardless of the total number of generated graphs. At any rate, the present method operates well, even in the case of a relatively small number of generated graphs.
In general, the optimized values of l and n depend on the size of the examined graph, and may be expressed as the fraction to the total number of possible graphs with the same numbers of nodes and edges as those of the examined graph. Although the number of possible graphs composed of arbitrary numbers of labeled nodes and edges can be estimated asymptotically under some constraints on the edge connectivity [23], unfortunately, the total number of possible graphs, in which all of the nodes are connected to form one graph, is not still obtained. In the present stage, we should heuristically define l and n by diagnosing the goodness of fit to the GEV model.
Robustness in Terms of the Magnitude of Noise in the Analyzed Data
We estimated the GCP in various noise ranges. For this purpose, the value of the standard deviation in the structural equations for data generation (σ = 0.1 in Fig. 1) was changed to three values (σ = 0.5, 1.0, and 2.0). By the same procedure as that in Fig. 1, we calculated 100 GCPs for the three ranges of standard deviations. Finally, the probabilities of the generated graphs were calculated.
The histograms of the GCPs in the three ranges of standard deviations are shown in Fig. 2. In this figure, 100 GCPs were plotted against the number of connections in the generated graphs that were different from those in the examined graph in the respective cases of standard deviations. In the cases of the two small standard deviations (σ = 0.5 and 1.0), less than 10^{10 }of the GCPs emerged most frequently, but the most frequent GCP was found at 10^{4 }in the case of the largest standard deviation (σ = 2.0). In the former two cases, the largest GCP was 10^{6 }in σ = 0.5 and 10^{3 }in σ = 1.0. Although some exceptional GCPs were also found, the present method operates well within the range of the two noise levels. In contrast, the last case shows the limitation of our method, in terms of the noise of the measured data. Careful preprocessing of the measured data may be required to apply our method to actual data. Note that the noise is amplified as the number of parents grows in the present simulation. For example, the standard deviation is (α_{1}^{2 }+ α_{2}^{2 }+ 1)σ, when a descent has two parents and α_{i }is the path coefficient between the descent and the ith parent. At any rate, the limitation of the present method in terms of the data noise can be examined by describing the histogram of the GCP, and was estimated between 1.0 and 2.0 for the graph in Fig. 1. In addition, we assumed that the distribution of the data noise also follows uniform and gamma distributions, and obtained similar results in terms of the robustness about the data noise (see additional file 4: Robustness in terms of the noise according to the gamma and uniform distributions).
Additional file 4. Robustness in terms of the noise according to the gamma and uniform distributions. GCP(=P(l(G_{0}))) for the graph in Fig. 1 was calculated with simulated data according to the gamma and uniform distributions, and the frequencies of GCPs are plotted against the probability degree. The horizontal axis indicates the log(GCP) value, and the vertical axis is its frequency: blackcolored bar, λ = 1 in gamma distribution; graycolored bar, λ = 3; striped bar, λ = 5; and boxed bar, between 0 and 1 in uniform distribution.
Format: PDF Size: 15KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 2. Robustness in terms of the noise in measured data. GCP(=P(l(G_{0}))) for the graph in Fig. 1 was calculated with simulated data with distinct standard deviations, and the frequencies of GCPs are plotted against the probability degree. The horizontal axis indicates the log(GCP) value, and the vertical axis is its frequency: blackcolored bar, σ = 0.5; graycolored bar, σ = 1.0; and boxed bar, σ = 2.0.
Robustness Regarding the Variation of the Network Structure
We applied the present method to the three network structures shown in Fig. 3. The three networks are analogous to the typical structures of biological networks; the first is analogous to part of a chain reaction in a metabolic pathway, the second represents the simple structure of a gene regulatory network, and the third depicts a cascade in a signal transduction pathway. According to the connectivity in the network, the data were generated with the corresponding structural equations, and the present method was applied to estimate the graph consistency with the generated data.
Figure 3. Robustness regarding graph structure variation. The calculation is composed of five steps (see details in the text). Three networks with typical structures in biology are examined in (A), (B), and (C). To generate the simulation data by structural equations, we set the standard deviation to 0.1 in all three graphs, and the path coefficients between the variables are as follows: (A) α_{1,2 }= 0.6, α_{2,3 }= 0.3, α_{3,4 }= 0.1, α_{4,5 }= 0.7, α_{5,6 }= 0.8, α_{6,7 }= 0.9, α_{7,8 }= 0.2, α_{8,9 }= 0.5, and α_{9,10 }= 0.4; (B) α_{1,2 }= 0.1, α_{1,3 }= 0.2, α_{1,4 }= 0.3, α_{1,5 }= 0.4, α_{1,6 }= 0.5, α_{1,7 }= 0.6, α_{1,8 }= 0.7, α_{1,9 }= 0.8, and α_{1,10 }= 0.9; and (C) α_{1,2 }= 0.5, α_{1,3 }= 0.7, α_{2,4 }= 0.4, α_{2,5 }= 0.8, α_{3,6 }= 0.6, α_{3,7 }= 0.3, α_{4,8 }= 0.2, α_{5,9 }= 0.1, α_{6,9 }= 1.0, and α_{7,10 }= 0.9. The value of loglikelihood and the parameters of GEV distribution in the respective networks are as follows: (A) l(G_{0}) = 163.4805, μ = 89.8375, σ = 12.9694, and ξ = 0.1743; (B) l(G_{0}) = 61.6096, μ = 3.0217, σ = 12.5220, and ξ = 0.1314; and (C) l(G_{0}) = 124.8894, μ = 46.9002, σ = 12.1395, and ξ = 0.1406. See also the corresponding GEV plots at additional file 5: Robustness regarding the network structure variation.
The present method operated well in all of the network structures. Indeed, the loglikelihoods in the three networks fit well to the GEV (see statistics in the legend of Fig. 3, and additional file 5: Robustness regarding the network structure variation). In addition, the GCPs were very small: The GCPs of the three networks were less than 10^{11}, 10^{4}, and 10^{7}, respectively. Interestingly, the magnitudes of the GCPs may be related to the network structures. The GCP in Fig. 3B is relatively larger than the GCPs in Figs. 3A and 3C. This is because the present path coefficients between the 10 nodes were set at different values, but in the same order of digits. This indicates that the most similar data for respective variables were generated in Fig. 3B, and caused pseudocorrelations between the variables with no edges in the network in Fig. 3B. Although the performance for estimating the graph consistency may slightly decrease, depending on the number of two layer relationships in the examined data, this simulation shows that the present method can be applied to various structures of networks.
Additional file 5. Robustness regarding the network structure variation. GEV plots (PDF file) are shown for the three types of network structures in Fig. 3.
Format: PDF Size: 27KB Download file
This file can be viewed with: Adobe Acrobat Reader
Examinations of Actual Graphs
We examined the performance of the present method with two sets of actual networks in Escherichia coli and the corresponding actual measured data. One set is a regulatory network for the SOS response system with the expression degrees of the constituent genes measured by fluorescence [24], and the other is 29 networks classified by gene functions, with the expression degrees under anaerobic conditions measured by microarray [25]. The former examination is a verification of the present method for an actual network with a size similar to the networks shown in Fig. 1, and the latter is a demonstration of a highthroughput search of network candidates, consistent with the data measured under particular conditions.
Verification for a Simple Network
The gene network in the SOS system is schematically shown in Fig. 4A. The SOS DNA repair system in Escherichia coli is a wellcharacterized transcriptional network [26,27]. One of the SOS proteins, RecA, acts as a sensor of DNA damage, and a master repressor (LexA) binds sites in the promoter regions of these operons. The corresponding data to the constituent molecules in the network are the transcriptional activity of genes measured with realtime monitoring by means of lowcopy reporter plasmids, in which a promoter controls green fluorescent protein [24].
Figure 4. Evaluation of the transcriptional network of the SOS DNA repair system in Escherichia coli. The network is schematically shown in (A), and the corresponding GEV plots and the boxplot are also shown in (B). The value of loglikelihood between the examined network and the measured data is 1168.453, and the parameters of GEV distribution are as follows: μ, 1179.079, σ, 4.957; ξ, 0.236. The data for the promoter activities of eight genes in the SOS system are cited from [24].
The GEV plots with the likelihood values and the statistics are shown in Fig. 4B. The value of was larger than 0.5, and the GEV plots were quite similar to those in Fig. 1. Indeed, each set of plotted points was nearly linear, and the returnlevel curve asymptotes to a finite level. In addition, the corresponding density estimate seems consistent with the histogram of the data. Consequently, the goodness of fitness in the actually measured data lends support to the GEV model.
The GCP of the SOS network with the corresponding measured data was estimated as 0.049, and the network structure was estimated to be consistent with the data measured from the examined network. However, the GCP was large in comparison with the GCPs in the simulation studies in the preceding sections. This is partly because the cyclic relationship of RecA is neglected in the examined network, and partly because most of the relationships in the examined network are composed of 2layer relationships, due to the production of similar degrees of expression data, as in the situation in Fig. 3B. At any rate, the performance of the present method was verified by a wellknown network, with a size similar to that in the simulation, and with the corresponding data measured by an experimental study.
Demonstration for an Actual Network Set
We further tested the performance of our method for selecting the networks consistent with the data measured under specific conditions from many network candidates. Here, we arranged 29 regulatory networks in Escherichia coli and the corresponding gene expression profiles measured under anaerobic conditions (for details about the examined network reconstruction and the profile data, see Methods).
Table 1 shows the analyzed networks and the corresponding graph consistency probabilities of the 29 networks (see additional file 6: the 29 network structures analyzed in the present study). When we set the significance probability to 5%, only two networks (Nos. 14 and 28) in Fig. 5, which are composed of regulatory gene pairs related to carbon compounds and anaerobic respiration, were selected among the 29 networks. As seen in the figure, the network structures are quite different. The network related to the carbon compounds in Fig. 5A shows a relatively simple structure that is a twolayer relationship between one TF and its nine regulated genes. In contrast, the other network related to anaerobic respiration in Fig. 5B has a highly complicated form, with 89 nodes and 161 edges. The selection of the two networks can be interpreted in terms of biological functions, as described below.
Additional file 6. The 29 network structures analyzed in the present study. The 29 regulatory networks of Escherichia coli with more than 8 edges (PDF file) are shown, as constructed from the information on the regulatory relationships between two genes in EcoCyc [44].
Format: PDF Size: 609KB Download file
This file can be viewed with: Adobe Acrobat Reader
Table 1. Consistency of the twentynine networks with expression profiles measured under anaerobic conditions in Escherichia coli
Figure 5. Networks with 5% significance probability in graph consistency search. By corresponding between the regulatory relationships and the gene functions in EcoCyc [44], 29 regulatory networks were reconstructed, and their consistency with the expression profiles measured under anaerobic conditions (accession number GSE1107 in NCBI Gene Expression Omnibus (GEO); http://www.ncbi.nlm.nih.gov/geo/ webcite) [25] was examined. Among the 29 regulatory networks, two networks showed 5% significance probability: the network related with carbon compounds (EcoCyc ID: C9449_11) (A) and that with anaerobic respiration (EcoCyc ID: C9490_1) (B). The details of the network structures of the 29 regulatory networks are shown in the additional file 6: the 29 network structures analyzed in the present study.
The first network (No. 14) is composed of malT and its regulated genes. malT is involved in maltose transport [28]. Besides the malTregulating network, four networks related to carbon compounds (Nos. 3, 7, 12 and 15) were also included in the examined networks, but they showed no significant probability: the TFs in each network are araC, galR, gutM, and exuR, and they regulate products related to the transport of arabinose, galactose, glucitol, and hexuronate, respectively [2932]. Among the four networks, the galR and exuRregulating networks (Nos. 12 and 15) are coordinated in terms of their products: the exuR regulatory gene product controls the expression of the galacturonate pathway operons (exuT, uxaC, uxuA, and uxaB) [33]. Interestingly, galactose was the least efficiently utilized under anaeorbic conditions, among glucose, lactose, galactose, maltose, maltotriose, and maltohexaose [34]. This fact may be one of the reasons why our method revealed the consistency of network No. 14 with the data, and the lack of consistency of two of the networks, Nos. 12 and 15. In the remaining two networks, Nos. 3 and 7, there are no reasons for their lack of consistency with the present data. Thus, the detection of the network related to maltose metabolism is reasonable, at least in comparison with the galactose and hexuronaterelated networks.
As for the second network (No. 28), the biological function is defined as anaerobic respiration, and its detection is clearly reasonable. The gene encoding the transcription factor in the network is fnr, one of the seven global regulators in E. coli [35], and the modular controlled by its product, Fnr, encodes proteins involved in cellular adaptation to growth in anoxic environments [3638]. Since the network is related to adaptations to environmental changes, many genes are comprehensively associated with each other, and the network structure is complex, as seen in Fig 5B. Thus, the consistency of the fnrregulating network with the present data demonstrates the validity of the present method for searching a largesize network consistent with data measured under particular cellular conditions.
It may seem overly strict to estimate the network consistency by the present method. Some other networks besides the two detected networks might be operating under anaerobic conditions. However, the strictness of the consistency estimation is one of the prerequisites for exploring unknown networks. The falsely detected networks should be excluded as much as possible, and the detection of a few definite network candidates may serve as the initial step for investigating the unknown networks that are unexpected, in terms of biological knowledge. In addition, the strictness for consistency estimation is easily modified by setting the selection degree with a significance probability. As a result, the present method reveals the strictly consistent networks with the expression profiles measured under specific conditions, and will be useful to find the activated network candidates among many given networks.
Merits and Pitfalls of the Present Method
The present method successfully evaluates the consistency of a network with the artificial and actual data, which is expressed as a probability, GCP. The GCP of each known network is estimated from one set of data in which the constituent molecules of the network were measured under one particular condition. Although a large amount of noise prevents a confident estimation of the GCP, the present method is robust in terms of the data sampling dimensions, the parameters in the method, and the network structure variation. The plausibility of the structure variation and scale is illustrated by the detection of actual networks for the simple network of the SOS response and the large and complicated network for anaerobic respiration. Thus, the present method is feasible to evaluate the consistency of the networks with a set of data measured under particular conditions.
The present method may be further applied to various analyses of biological issues. One example is a simple extension of the demonstration shown in the preceding section, as follows. Assume that we know more than two distinctive cell stages, and that we can measure the data of the constituent molecules in different stages. Then, we evaluate the consistency of a set of known networks with the respective data. By this evaluation, we may detect the activated networks, among the known networks that are specific to the respective cell stages. For example, the present method may address the problem of which known networks are activated in progressive diseases and in cell differentiation processes. Thus, the present method will be useful to investigate the network variation in various cell stages responding to different environments. Another example is a utilization of the graphs generated in GEV modeling. Assume that we know a network model for a biological phenomenon, and that a few molecules have been newly detected, and are responsible for the phenomenon. Then, we face the issue of how the newly detected molecules should be connected to the previous network. In this situation, our method may present a solution. A new network is tentatively constructed, by connecting the newly detected molecules into the previous network with the full use of biological knowledge, and then the consistency of the tentative network is estimated with the data measured under the conditions where the relationship of the new molecules with the phenomenon was found. If the GCP shows the significance probability, then the network is a promising model for the phenomenon. If not, then we can list some network candidates with the significance probability that commonly share the structure of the previous network, among the generated networks for the GEV distribution. Note that the present method aims to evaluate the consistency between the known network structures and the measured data. Thus, the network inference without any given network structures is beyond the present study. At any rate, these two examples will be demonstrated by appropriate networks and data in the near future.
In terms of the methodology, the present method is a rational extension of the previous study based on linear regression [18], by the combination of the Gaussian network and the extreme value distribution. Indeed, the application range on the network structure is expanded, from simple networks with twolayer relationships to more complex networks with multiplelayer relationships. In addition, the present method is complementary with the dsep test; the graph consistency is estimated for the associations between variables (existence of edges in the graph) in our method, and in contrast, no associations between variables (no edges in the graph) are considered in the dsep test [14]. However, the dsep test failed to select the activated networks: when we set the significance probability to 5%, 27 networks among the 29 networks were consistent with the data measured under specific conditions, and only two networks were not (see additional file 7: dsep test and SEM for 29 network structures). Interestingly, SEM also failed (see also the additional file 7): 27 networks among the 29 networks were consistent with the data, and one of the two remaining networks could not be evaluated, due to a numerical calculation violation. Thus, our method may be appropriate for tightly estimating the graph consistency in comparison with the dsep test and SEM. Furthermore, our method differs from the dsep test in a strict sense. The present method is based on the generation of artificial graphs in the estimation of graph consistency with the measured data, while the dsep test is based on the direct hypothesis of a population distribution [14,16]. Thus, our method is an asymptotic approach, and is similar to various methods for model selection in network inference, such as various Bayesian network models [8,9]. Note that the present GCP is an occurrence probability, and definitely differs from the model selection procedure by using the scores that show a relative difference.
Additional file 7. SEM and dsep test for 29 network structures. The 29 regulatory networks of Escherichia coli were also tested by SEM and the dsep test.
Format: PDF Size: 21KB Download file
This file can be viewed with: Adobe Acrobat Reader
The consistency of a model with the observed data also reminds us of the identifiability problem in the compartmental models for tracer kinetics [39]. The identifiability problem addresses the issue of whether the unknown parameters can be determined uniquely or nonuniquely from the tracer data. Although a systematic algorithm for the identifiability problem was proposed regardless of the model structure [40], its application is limited to the ideal context of noisefree data. Recently, we have partially exploited the identifiability problem algorithm to treat data including noise [41]. Indeed, a network including a cyclic relationship has been examined to estimate the consistency with noisy data. Although this method has a limitation of the network size to smaller than 10 nodes and 15 edges, another method with a symbolic approach may partly compensate for the statistical approach presented here for the limitation of the network structure.
Conclusion
We have proposed a novel method to estimate the consistency of a given network with the measured data as a probability (GCP: graph consistency probability), based on the Gaussian network and the generalized extreme value distribution. The performance of the present method was validated by application to artificial graphs with simulated data and actual graphs with measured data from Escherichia coli. The plausible evaluation of the consistency between the network structures and the corresponding measured data promises to help reveal the network structure variations depending on the environments in a living cell, as well as to form a bridge between the static network from the literature and the corresponding measurements.
Methods
Data Generation for Simulation
We generates the numerical data according to a standard statistical procedure [16]. The data for 10 nodes with 50 sampling dimensions, {X_{kl}, for k = 1,2,...10, and l = 1,2,...50}, are generated by using the following structural equations that correspond to the parentdescent relationships in Fig. 1:
where N(0, σ) means a value that follows a normal distribution with a zero mean and a standard deviation of σ, and α_{i,j }is a path coefficient relating variables i and j. Here, we set σ to 0.1, and the following parameterization was used: α_{i,j }= 0.5. Thus, we obtain a graph and examine the corresponding data to estimate their consistency with the graph. Note that the above data generated by linear equations may not precisely reflect the measured data underlying various nonlinear relationships. Here, we adopted the linear relationships as the first approximation to test the performance of the present method. The performance for the complex relationships will be tested by actually measured data.
Recursive Factorization of Causal Graph
Suppose a causal graph is a directed acyclic graph (DAG), G(V_{i}, E_{j}), where V_{i }is a vertex (i = 1, 2, ..., n_{v}) and E_{j }is an edge (j = 1, 2, ..., n_{e}) in the graph. The DAG can be factorized into subgraphs according to the parentdescent relationships [15]. Then, the joint density function f(X_{i}), corresponding to V_{i }for the graph G, can be factorized into the conditional density functions according to the graph, as follows:
where pa{X_{i}} is the set of variables corresponding to the parents of V_{i }in the graph.
Gaussian Network (GN)
The causal graph meets the measured data based on the Gaussian network model [20]. On the assumption that the probability variable X_{i }is subjected to a multiple normal distribution, each conditional function in equation (2) is obtained by linear regression for the measured data of the constituent nodes (molecules) measured at m points, i.e.,
where x_{ik }is the measured value of X_{i}, at the kth point, and n_{i }is the number of variables corresponding to the parents of V_{i}. Thus, the joint density function in equation (2) is expressed by the regression for the measured data in equation (3). Finally, the logarithm of the likelihood of the equation (3) is calculated for the measured data as
Thus, the GN allows us to quantify a given network into the corresponding numerical value from the measured data, according to the network form. Note that the calculation of likelihood itself requires no assumptions on the relationships between variables. Indeed, the likelihood can be calculated in the case of nonlinear regressions, such as spline regression.
Generalized Extreme Value Distribution (GEV)
Next, we estimate the probability of l(G_{0}) by using the generalized extreme value distribution [21]. First, the loglikelihoods of an ensemble of n networks generated according to the GN are calculated, and then the maximum loglikelihood is selected from them. The above procedure is iterated l times, i.e.,
The distribution of the maximum values by l iterations is expected asymptotically to be a generalized extreme value distribution, i.e.,
defined on the set,
where the parameters satisfy ∞ <μ < ∞, σ > 0, and ∞ <ξ < ∞. The model has three parameters: μ, σ, and ξ are a location parameter, a scale parameter, and a shape parameter, respectively. Maximization of the loglikelihood of equation (6) with respect to the parameter vector (μ, σ, ξ) leads to the maximum likelihood estimate for any given dataset, using standard numerical optimization. In the present study, the R extRemes package [42] was used to fit the data to the GEV distribution.
Note that the standard likelihood ratio test [43] cannot be applied straightforwardly to a Gaussian network in the present case. This is because the density function of the population and the degrees of freedom in the likelihood ratio test are unclear when maximizing the likelihoods of the generated graphs. In the present method, the GEV distribution of the maximum values of likelihoods in the blocks of generated graphs is adopted analogically, instead of the maximum likelihood in the likelihood ratio test. The utilization of the GEV distribution requires the model fitting to the data, but allows us to set the significance probability arbitrarily, as usual in statistical tests.
Graph Consistency Probability (GCP)
If the goodness of fitness of the maximum values from the generated graphs is ascertained, then the occurrence probability of a given graph (GCP: graph consistency probability) can be directly estimated by corresponding the l(G_{0}) in equation (1) to the probability density function of GEV obtained in (6), i.e.
Thus, the present method expresses the consistency in the form of a probability. The probability examines the possibility of whether the tested known networks are activated in the environment where the data were measured. If the probability is small, which corresponds to a large likelihood value, then the data are generated, according to the molecular relationships in the network.
Actual Networks and Data for HighThroughput Consistency Search
We first classified a transcription factor (TF) and its regulated genes compiled in EcoCyc [44], according to the classification scheme of gene functions http://biocyc.org/ECOLI/classtree?object=Genes webcite. Using the gene sets of the TF and the regulated genes in each function, we next reconstructed the networks: respective networks were reconstructed, so as to form the network structure with as many connections between the genes as possible. Thus, we obtained 130 regulatory networks that are characterized by biological functions. Since some networks were characterized by more than two functions, the 130 regulatory networks were redundant in terms of the connectivity and the constituent genes. Then, 29 networks were kept, after excluding the redundancy and the small networks with less than 8 edges (see Table 1 and additional file 6: 29 network structures analyzed in the present study).
The consistency of each of the 29 networks was estimated with one set of expression profiles measured under 22 different anaerobic conditions (GSE1107) [25] cited from NCBI GEO [45]. The expression profiles were standardized by the average and the standard deviation in each condition, as preprocessing of the measured data. In a few nodes (genes) in the original network constructed from the information in EcoCyc, the corresponding expression profiles were not found in the analyzed data (GSE1107), and the corresponding parts in the network were excluded.
Authors' contributions
SS carried out the implementation and the calculations, and participated in the design of the study. SA participated in the design of the study, and helped to draft the manuscript. KH conceived of the study, participated in its design and coordination, and drafted the manuscript. All authors read and approved of the final manuscript.
Acknowledgements
S.A. was supported by a GrantinAid for Scientific Research (grant 18681031), and K.H. was partly supported by a GrantinAid for Scientific Research on Priority Areas "Systems Genomics" (grants 18016008 and 20016028) and by a GrantinAid for Scientific Research (A) (grant 19201039), from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
References

Yuryev A, Mulyukov Z, Kotelnikova E, Maslov S, Egorov S, Nikitin A, Daraselia N, Mazo I: Automatic pathway building in biological association networks.
BMC Bioinformatics 2006, 7:171. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, Latendresse M, Paley S, Rhee SY, Shearer AG, Tissier C, Walk TC, Zhang P, Karp PD: The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases.
Nucl Acids Res 2008, 36:D623D631. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, Cho RJ, Chen RO, Brownstein BH, Cobb JP, Tschoeke SK, MillerGraziano C, Moldawer LL, Mindrinos MN, Davis RW, Tompkins RG, Lowry SF: Inflammation and Host Response to Injury Large Scale Collaborative Research Program. A NetworkBased Analysis of Systemic Inflammation in Humans.
Nature 2005, 437:10321037. PubMed Abstract  Publisher Full Text

Rudd MF, Webb EL, Matakidou A, Sellick GS, Williams RD, Bridle H, Eisen T, Houlston RS, GELCAPS Consortium: Variants in the GHIGF axis confer susceptibility to lung cancer.
Genome Res 2006, 16:693701. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, Silliman N, Szabo S, Dezso Z, Ustyanksky V, Nikolskaya T, Nikolsky Y, Karchin R, Wilson PA, Kaminker JS, Zhang Z, Croshaw R, Willis J, Dawson D, Shipitsin M, Willson JKV, Sukumar S, Polyak K, Park BH, Pethiyagoda CL, Krishna Pant PV, Ballinger DG, Sparks AB, Hartigan J, Smith DR, Suh E, Papadopoulos N, Buckhaults P, Markowitz SD, Parmigiani G, Kinzler KW, Velculescu VE, Vogelstein B: The Genomic Landscapes of Human Breast and Colorectal Cancers.
Science 2007, 318:11081113. PubMed Abstract  Publisher Full Text

Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data.

Ghahramani Z: Learning Dynamic Bayesian Networks.
Adaptive Processing of Sequences and Data Structures 1998, 168197.

Akutsu T, Miyano S, Kuhara S: Algorithms for inferring qualitative models of biological networks.

Toh H, Horimoto K: Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling.
Bioinformatics 2002, 18:287297. PubMed Abstract  Publisher Full Text

Joreskog KG: A general method for analysis of covariance structures.

Shipley B: A new inferential test for path models based on directed acyclic graphs.

Pearl J: Probabilistic Reasoning in Intelligent Systems. California, Kaufmann Morgan Publishers; 1988.

Shipley B: Cause and Correlation in Biology: A User's Guide to Path Analysis, Structural Equations, and Causal Inference. Oxford, Oxford University Press; 2000.

Bisits AM, Smith R, Mesiano S, Yeo G, Kwek K, MacIntyre D, Chan EC: Inflammatory aetiology of human myometrial activation tested using directed graphs.
PLoS Comput Biol 2005, 1:132136. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Herrgard MJ, Covert MW, Palsson BO: Reconciling gene expression data with known genomescale regulatory network structures.
Genome Research 2003, 13:24232434. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

ShenOrr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli.
Nat Genet 2002, 31:6468. PubMed Abstract  Publisher Full Text

Whittaker J: Graphical Models in Applied Multivariate Statistics. New York, John Wiley and Sons; 1990.

Coles S: An Introduction to Statistical Modeling of Extreme Values. London, SpringerVerlag; 2001.

Smith RL: Maximum likelihood estimation in a class of nonregular cases.

Bender EA, Canfield ER, McKay BD: The asymptotic number of labeled graphs with n vertices, q edges, and no isolated vertices.

Ronen M, Rosenberg R, Shraiman BI, Alon U: Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics.
Proc Natl Acad Sci 2002, 99:1055510560. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Covert MW, Knight EM, Reed JL, Herrgard MJ, Palsson BO: Integrating highthroughput and computational data elucidates bacterial networks.
Nature 2004, 429:9296. PubMed Abstract  Publisher Full Text

Kenyon CJ, Walker GC: DNAdamaging agents stimulate gene expression at specific loci in Escherichia coli.
Proc Natl Acad Sci 1980, 77:28192823. PubMed Abstract  PubMed Central Full Text

Little JW, Mounta DW: The SOS regulatory system of Escherichia coli.
Cell 1982, 29:1122. PubMed Abstract  Publisher Full Text

Chapon C: Expression of malT, the regulator gene of the maltose region in Escherichia coli, is limited both at transcription and translation.
EMBO J 1982, 1:369374. PubMed Abstract  PubMed Central Full Text

Lee NL, Gielow WO, Wallace RG: Mechanism of araC autoregulation and the domains of two overlapping promoters, Pc and PBAD, in the Larabinose regulatory region of Escherichia coli.
Proc Natl Acad Sci 1981, 78:752756. PubMed Abstract  PubMed Central Full Text

HugovieuxCottePattat N, RobertBaudouy J: Regulation and transcription direction of exuR, a selfregulated repressor in Escherichia coli K12.
J Mol Biol 1982, 156:221228. PubMed Abstract  Publisher Full Text

Yamada M, Saier MH: Positive and negative regulators for glucitol (gut) operon expression in Escherichia coli.
J Mol Biol 1988, 203:569583. PubMed Abstract  Publisher Full Text

Weickert MJ, Adhya S: Control of transcription of gal repressor and isorepressor genes in Escherichia coli.
J Bacteriol 1993, 175:251258. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Portalier RC, RobertBaudouy J, Stoeber F: Regulation of Echerichia coli K12 hexuronate system genes: exu regulon.
J Bacteriol 1980, 143:10951107. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Muir M, Williams L, Ferenci T: Influence of Transport Energization on the Growth Yield of Escherichia coli.
J Bacteriol 1985, 163:12371242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

MartinezAntonio A, ColladoVides J: Identifying global regulators in transcriptional regulatory networks in bacteria.
Curr Opin Microbiol 2003, 6:482489. PubMed Abstract  Publisher Full Text

Lynch AS, Lin EC: Responses to molecular oxygen. In Escherichia coli and Salmonella typhimurium. In Cellular and Molecular Biology. 2nd edition. Washington DC; 1996:15261539.

Unden G, Schirawski J: The oxygenresponsive transcriptional regulator FNR of Escherichia coli: the search for signals and reactions.
Mol Microbiol 1997, 4:205210. PubMed Abstract

Unden G, Achebach S, Holighaus G, Tran HG, Wackwitz B, Zeuner Y: Control of FNR function of Escherichia coli by O_{2 }and reducing conditions.
J Mol Microbiol Biotechnol 2002, 4:263268. PubMed Abstract

Cobelli C, Foster D, Toffolo G: Tracer Kinetics in Biomedical Research: From Data to Model. New York, Kluwer Academic/Plenum Publishers; 2000.

Buchberger B: An Algorithmic Criterion for the Solvability of a System of Algebraic Equations. In Peer review in Gröbner Bases and Applications. Volume 251. Edited by Buchberger B, Winkler F. London, Mathematical Society Lecture Notes Series; 1998::535545.

Yoshida H, Nakagawa K, Anai H, Horimoto K: An algebraicnumeric algorithm for the model selection in kinetic networks.

Gilleland E, Katz RW: Analyzing seasonal to interannual extreme weather and climate variability with the extremes toolkit (extRemes).
18th Conference on Climate Variability and Change, 86th American Meteorological Society (AMS) Annual Meeting 2006, 215.

Lehmann EL: Testing Statistical Hypotheses. 2nd edition. New York, John Wiley and Sons; 1986.

Karp PD, Keseler IM, Shearer A, Latendresse M, Krummenacker M, Paley SM, Paulsen I, ColladoVides J, GamaCastro S, PeraltaGil M, SantosZavaleta A, PenalozaSpinola MI, BonavidesMartinez C, Ingraham J: Multidimensional annotation of the Escherichia coli K12 genome.
Nucl Acids Res 2007, 35:75777590. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: Mining tens of millions of expression profiles – database and tools update.
Nucleic Acid Res 2007, 35:D760D765. PubMed Abstract  Publisher Full Text  PubMed Central Full Text