Abstract
Background
Transcription factors regulate numerous cellular processes by controlling the rate of production of each gene. The regulatory relations are modeled using transcriptional regulatory networks. Recent studies have shown that such networks have an underlying hierarchical organization. We consider the problem of discovering the underlying hierarchy in transcriptional regulatory networks.
Results
We first transform this problem to a mixed integer programming problem. We then use existing tools to solve the resulting problem. For larger networks this strategy does not work due to rapid increase in running time and space usage. We use divide and conquer strategy for such networks. We use our method to analyze the transcriptional regulatory networks of E. coli, H. sapiens and S. cerevisiae.
Conclusions
Our experiments demonstrate that: (i) Our method gives statistically better results than three existing state of the art methods; (ii) Our method is robust against errors in the data and (iii) Our method’s performance is not affected by the different topologies in the data.
Background
Genes are the smallest functional units of an organism. They carry out vital functions in cells by interacting with each other and with other molecules. Biological networks model such interactions among genes. Using biological networks, researchers are able to take a holistic approach on the analysis of cellular functions. Such analysis has shown that biological networks have a number of global properties. One of these properties is their hierarchical organization. Hierarchical organization defines a partial ordering of the underlying genes. Recent studies have shown that directed interactions between transcription factors (TFs) in transcriptional regulatory networks (TRNs) impose a hierarchy on TRNs [15]. Analysis of the hierarchies of TRNs helps researchers better understand the flow of controlling signals through the transcription machinery [1,3]. TFs are special types of proteins that control the expression of other genes by binding to specific regions of the DNA [6]. Since each protein is coded by genes, we will use the terms transcription factor and gene to refer to TFs throughout this paper. One way to model hierarchy in TRNs is to assign levels to the interacting TFs. Figure 1 shows a sample network with its level assignments. In this paper, we consider the problem of finding the hierarchical organization of a TRN. The formal definition of our problem in this paper is as follows:
Figure 1. Hierarchical decomposition of a sample network with seven nodes denoted by n_{1}, n_{2}, ⋯, n_{7}to three levels. Directed edges represent the interactions. Dashed line splits the nodes into different levels. Each of the seven nodes are assigned one of the three existing. levels
Problem definition
Let us denote a TRN with . Here, N denotes the set of TFs and E denotes the set of directed interactions (i.e. edges) between the TFs in N. We refer to each TF in N as a node. We name the ith node in N as n_{i}. We represent an edge from the node n_{i}to n_{j} with (n_{i},n_{j}). Also, we denote the maximum possible number of levels in with M. We denote the hierarchy level assigned to a node n_{i }with t_{i} where t_{i} is an integer in {1,2,,3,…,M}. Let ϕ(n_{i},n_{j})→{0,1} be a binary function that describes the key topological relationship between n_{i} and n_{j}. (We elaborate on the ϕ function below.) We compute a penalty score p_{ij }for each pair of nodes as follows:
Our aim is to find an assignment of hierarchies to the nodes of N which minimizes .
In this paper, we use two different ϕ functions describing two key topological properties.
1. Adjacency. We define ϕ(n_{i},n_{j}) = 1 if (n_{i},n_{j}) ∈ E and ϕ(n_{i},n_{j}) = 0 otherwise.
2. Reachability. We define ϕ(n_{i},n_{j}) = 1 if there exists a path from n_{i }to n_{j }in by traversing the edges in E. We set ϕ(n_{i},n_{j}) = 0 otherwise.
Depending on the choice of the two ϕ functions, we name the resulting distance function adjacency distance or reachability distance respectively. In summary, using adjacency distance we aim to assign levels such that every TF is above the others it directly regulates, and below its every direct regulator. On the other hand, using reachability distance, we consider any direct or indirect regulation relation between two TFs when assigning levels.
There has been attempts to devise methods to reveal the underlying hierarchies of TRNs. Yu and Gerstein developed BFSlevel method to carry out this task [1]. This method uses breadth first search to assign hierarchies to TFs in a network. Although their method works for most networks, it fails to assign accurate levels for networks that contain cycles. Jothi et al. developed vertex sort method [2]. This method incorporates topological sort algorithm for addressing the network hierarchy problem. Vertex sort method does not have any restrictions on network motifs or cycles. However, rather than a certain hierarchy, it assigns a range of possible levels for the TFs. Hartsperger et al. devised an algorithm based on breadth first search method to solve the problem [3]. Their solution improves the BFSlevel method, and outputs a hierarchy for every network regardless of its topological features. However all these algorithms fail to minimize the number of edges that violate the hierarchy. We name such edges as conflicting edges. We will elaborate on the quality of results calculated by existing methods in Section Comparison with existing hierarchical decomposition methods.
Contributions
In this paper, we develop a novel approach to tackle the problem of discovering underlying network hierarchy. We first consider the topology of the network as a set of constraints. Then, we define two different objective functions using adjacency and reachability penalty functions. We define the minimization of total penalty as the objective of the problem. Using the above explanations, we transform this problem to a mixed integer programming problem(MIPP) [7]. We solve the resulting problem using existing MIPP solvers. We name our method HIerarchical DEcomposition of regulatory Networks (HIDEN). The main advantage of HIDEN is it introduces a sound mathematical formulation to the network hierarchy problem. Our formulation can work with any objective function that is a linear combination of the edges. One drawback of HIDEN is that it does not scale well to very large networks due to the growing size of the MIPP with increasing number of TFs. In order to address this issue we develop a divide and conquer approach.
The rest of this article is organized as follows: In Section Algorithm, we describe the methods we developed in this paper. In Section Results and discussion, we discuss the results of HIDEN in detail. Finally, in Section Conclusion, we briefly conclude the paper.
Method
In this section, we describe the hierarchical decomposition method we developed. Section HIDEN describes our method. Section Example demonstrates HIDEN on a simple example. Section Divide and Conquer method describes divide and conquer method we employ to scale HIDEN to larger networks.
HIDEN
HIDEN transforms the hierarchical network decomposition problem to a MIPP [7]. Given a TRN, HIDEN first constructs a set of linear constraints and a linear optimization function that collectively describe the penalty of the decomposition. Then it uses existing optimizer software to solve the resulting problem. Next, we will explain how we formulate the MIPP.
Let us denote the given network that will be decomposed with . Let us denote the nodes (i.e., TFs) of this network with n_{1}, n_{2}, …, n_{m}, where m is the total number of nodes of . HIDEN, allows the user to set a limit on the maximum number of allowed levels for hierarchical decomposition. Let us denote this number with M. Also, let us represent the level assigned to node n_{i} with t_{i} for all i∈{1,2,…,m}, (i.e. ∀i, t_{i }∈ {1,2,3,…,M}). We aim to find the level assignment T = {t_{1},t_{2},…,t_{m}} that minimizes the total penalty resulting from this level assignment. Therefore, the objective of our problem is the sum of individual penalty scores for each pair of nodes:
Next, we set a limit on the number of levels in the hierarchy. We do this by limiting the variables t_{i} as follows:
We, then, represent each p_{ij} as a linear constraint. Remember that p_{ij }is a binary function in the following form:
We can rewrite this function as follows:
Let us only consider the cases where ϕ(n_{i},n_{j}) = 1. We can represent the rest of this function using two linear inequalities. The following set of constraints represent the function p_{ij}:
In order to prove that these inequalities model the function p_{ij }correctly, we need to inspect all possible scenarios:
1. if t_{i }> t_{j }and p_{ij }= 0, then −1 ≥ t_{j }− t_{i }≥ −(M−1) and M × p_{ij }= 0. Therefore both (4) and (5) holds.
2. if t_{i }≤ t_{j }and p_{ij }= 0, then t_{j}−t_{i }≥ 0 and M × p_{ij }= 0. Therefore, (4) holds, however, (5) does not hold.
3. if t_{i }> t_{j }and p_{ij }= 1, then −1 ≥ t_{j}−t_{i }≥ −(M−1) and M × p_{ij }= M. Therefore the expression t_{j}−t_{i }−M×p_{ij }is smaller than or equal to −M−1. This implies that (4) does not hold but (5) holds.
4. if t_{i }≤ t_{j }and p_{ij }= 1, then (M−1) ≥ t_{j}−t_{i }≥ 0 and M × p_{ij }= M, therefore both (4) and (5) holds.
Therefore, enforcing the constraints (3), (4) and (5) implies:
This corresponds to the latter definition of the function p_{ij} except the condition of ϕ(n_{i},n_{j}) = 1. Since we choose the function ϕ prior to the construction of the MIPP, we know the value of ϕ(n_{i},n_{j}) for every pair (n_{i},n_{j}). Therefore, we can manually ensure this property, by only considering p_{ij} where ϕ(n_{i},n_{j}) = 1 and excluding p_{ij} completely from our calculations where ϕ(n_{i},n_{j}) = 0.
Based on the constraints above, the MIPP we construct to solve the network hierarchy problem is as follows:
Example
In this section, we show the application of HIDEN on the network in Figure 1. We will use adjacency penalty function in this example. Therefore:
Using this ϕ function, the objective of the MIPP is to minimize the following function:
Now we go over to the constraints. First set of constraints limit t_{i}:
Then, we write the remaining functions as follows:
In the resulting problem, M is left as a user defined parameter. When we run the above problem with M = 4, HIDEN returns the following result:
Figure 2 shows the result of HIDEN on the given network. Note that HIDEN computes the level decomposition successfully despite the existence of a cycle in the network.
Divide and Conquer method
HIDEN works well for networks that have up to 100 nodes. For larger networks, however, it becomes difficult to solve the resulting MIPP using current hardware. This is mainly because the number of integer variables of the MIPP that describe the problem for the given network increases. This increases the memory consumption and the running time significantly.
In order to solve our problem for networks that have more than 100 nodes we adopt a divide and conquer approach. Given a large TRN, we randomly divide this network into fixed size partitions. We do this by first randomly selecting a node from the given network. This node is the seed of the first partition, and thus it is a member of that partition. We then chose the remaining nodes in that partition iteratively by randomly growing the partition one node at a time. More specifically, at each iteration, we randomly select a node that is not selected so far and that is interacting with at least one of the nodes in the partition. We repeat these iterations until the number of nodes in the partition reaches to a predefined threshold or all the nodes in the TRN are assigned to a partition. Then, we use HIDEN to decompose the subnetwork defined by the nodes and the edges in this partition into hierarchical levels. Once we determine the levels of all the nodes in the current partition, we store those values as they will remain unchanged in the rest of our solution. Next we randomly pick another node from the given TRN among those that have not been considered yet as the seed of the next partition. We grow the next partition similarly and use HIDEN to decompose it into hierarchical levels. We repeat these steps until we exhaust all the nodes in the given TRN.
This method greatly reduces the running time of HIDEN on large networks. Since MIPP is NPhard, depending on the size and the connectivity of the given TRN, the divide and conquer strategy can be orders of magnitude faster than the unpartitioned HIDEN. However, due to random selection of the nodes, it is possible for us to not achieve the optimal result. This is possible if the partition of the network we start with does not intersect with one or more of the levels in its underlying hierarchy. It is worth mentioning that this probability is usually very low. We can explain this as follows. Consider an N node network which contains n nodes belonging to a specific level x. If we select k nodes among these N nodes randomly, the probability that none of the k nodes belong to level x is (Nn choose k)/(N choose k). As k or n increases, this expression quickly converges to zero. In order to reduce this probability further, we repeat the divide and conquer strategy multiple times, each time starting from a randomly selected node. In our implementation, we repeat this process 1000 times for real TRNs. After 1000 iterations, the probability of all the trials starting with an undesired (i.e. does not intersect with all the final levels) partition becomes very small (i.e. if for 1 iteration, the probability is as high as 0.9, after 1000 iterations, the probability becomes 0.9^{1000}∼10^{−46}). Since the running time of partitioned HIDEN is orders of magnitude less than that of the unpartitioned HIDEN, 1000 repetitions remains to be practical. It took less than 10 minutes for the largest dataset (S. cerevisiae). Our experiments showed that on the average, the results of the divide and conquer method reach its optimum in less than 500 iterations.
Results and discussion
In this section, we evaluate HIDEN using a number of computational tests. In our tests, we let the underlying MIPP solver to handle the case of multiple optimal results. We only consider the unique result reported by the solver in our discussions. In the rest of this paper, we will use the term experiment to refer to in silico experiments for simplicity.
Dataset In our experiments, we used TRNs of E. coli, H. sapiens and S. cerevisiae. We used the previously constructed networks, used by existing methods to test our method [13,5]. Earlier studies used existing interaction data to construct these three networks [817]. For the experiments that require gene function information, we used the information included in the Gene Ontology Database [18]. We downloaded the list of essential genes for S. cerevisiae from the database of essential genes [19].
In the rest of this section, we first compare HIDEN with other existing hierarchical decomposition methods in Section Comparison with existing hierarchical decomposition methods. In Section Biological evaluation of network hierarchies we evaluate the results our method using a number of biological properties of TFs. Finally in Section Effects of input on HIDEN, we analyze the behavior of our algorithm with respect to different quantitative properties of the data.
Comparison with existing hierarchical decomposition methods
The objective of hierarchical decomposition is to arrange the TFs of a given network to levels so that the gene that alter the activity of the other appears at a higher level than the other throughout the network as frequently as possible. The two ϕfunctions described at the beginning of this paper model this relationship in terms of the adjacency and the reachability of the nodes in the given network. In this experiment we evaluate how well our method, HIDEN, compares against three state of the art methods, namely vertex sort[2], HiNO[3] and BFSlevel[1], in achieving this objective. To perform this comparison, we compute the penalty values obtained by HIDEN when it is applied on S. cerevisiae, E. coli and H. sapiens networks. We compute the same penalty values for the vertex sort, HiNO and BFSlevel methods on the same three datasets for which their hierarchical decompositions are available.
The penalty is a quantitative value that can be used to compare different methods on the same dataset. However, since the size (number of genes and interactions) and the topology of these networks deviate significantly, the resulting penalties will differ significantly across datasets. In order to report a statistically sound value that describes the success of a method independent of the network size and topology, we also compute the Zscores of the resulting penalty values.
Let us denote the level assignment obtained by a specific method for an m node network with T = {t_{1},t_{2},⋯,t_{m}}. Let γ denote the penalty of T according to a specific ϕ function. In order to compute the Zscore for T, we randomly produce many level assignments using the same level distribution as that of T. For each such assignment, we compute the resulting penalty value using the same ϕ function. Let μ and σ denote the mean and standard deviation of the resulting penalty values of all these random level assignments respectively. We calculate the Zscore as follows,
A higher Zscore implies a better level assignment. Typically, a Zscore of four or higher is very significant as they indicate a result which is 4 or more standard deviations more extreme than the mean Table 1 summarizes the penalties and the corresponding Zscore values. For HIDEN, we reported the results for each of the six values of maximum number of levels (M = {3,4,⋯,8}). For other methods the number of levels is not a configurable parameter. Hence, we reported the level that we obtained after execution of that method. We discuss the results for each organism next.
Table 1. Comparison of HIDEN with three other methods on different networks
S. cerevisiae
We compared HIDEN with all the three competing methods for this dataset. Our method outperformed all the three methods in terms of both adjacency and reachability penalty values as well as the Zscores regardless of the number of levels. As the number of levels allowed increases, the penalty incurred by HIDEN monotonically decreases. This, however, is not true for the Zscore as it depends on the distribution of nodes to levels. For instance HIDEN attains the highest Zscore for adjacency penalty at level eight whereas it attains that using only six levels for the reachability penalty. The biggest drop of penalty takes place when the number of allowed levels increases from three to four. We observe further, yet, smaller improvement in the penalty as the number of allowed levels increases beyond four.
Among the competing methods, the vertex sort method of Jothi et al. incurs the lowest penalty. It, however, uses significantly more levels than the HiNO and BFSlevel methods. Furthermore, although it uses more levels than HIDEN as well, it performs worse than HIDEN in terms of both penalty and Zscore measures. Among the remaining two methods, HiNO and BFSlevel, there is no clear winner. BFSlevel leads to slightly less penalty at the expense of an additional level. As a result, HiNO produces slightly better Zscores.
E. coli
For this dataset, we compared HIDEN with all three existing methods. The penalty values of all the methods for E.coli are smaller compared to those of S. cerevisiae. This is mainly because E. coli network is much smaller. HIDEN performs the best among all methods for four or more levels according to both penalty and Zscore values. We did not observe any improvement for HIDEN beyond seven levels. Vertex sort attains statistically better results than HiNO and BFSlevel methods.
H. sapiens
We compared HIDEN with vertex sort and BFSlevel methods for this dataset. We omitted HiNO in this experiment because we could not run it on this dataset. The results follow a similar pattern as those of the two other datasets. HIDEN outperformed vertex sort and BFSlevel even when it used fewer levels. The gap between the Zscores of HIDEN and the other methods was even more significant than the previous datasets. HIDEN led to the highest drop of penalty of from three to four levels and continued to improve with increased number of levels.
We conclude that, HIDEN performs significantly better than the competing methods for all the three major datasets.
Biological evaluation of network hierarchies
In this section, we analyze HIDEN using biological evidence. First, we check functional properties of genes across different levels. Then, we evaluate the locations of essential genes in the hierarchy.
Functions of genes
TRNs regulate the expression of genes that take part in many processes in an organism [13]. Earlier works suggest that the concentration of genes participating in certain functions are closely related to the level in the hierarchy [1]. However, majority of cellular functions in the cell take place through multiple reactions happening in succession. Therefore, we expect a uniform distribution of functions among different levels. In order to confirm this theory, we calculated the functional enrichment score of every single level in the hierarchies we discovered. We first decomposed each of the H. sapiens, E. coli and S. cerevisiae TRNs to each of the three to eight levels. Then, for the resulting 18 combinations (i.e., 3 organisms and 6 levels), we calculated a pvalue for each gene ontology term and level pair. We obtain the gene ontology terms from the Gene Ontology database [18]. We calculate these pvalues assuming a hypergeometric distribution of the gene ontology term over all the TFs in the network. We observed that the pvalues were similar at all levels of the hierarchy (see Figure 3). This supports our initial theory that majority of the functions the TFs in our network participate are not enriched at any level. One example among many is the wound healing function in human network [20]. However, in rare instances, we observed some functions being moderately enriched in some levels. For example, third of the eighth level (the third level when we decompose the network into eight levels) of human TRN is enriched with the signal transduction function. However, we could not detect any other levels in any other network enriched with this function.
Figure 3. The pvalues for the observed number of genes annotated with the wound healing process at each level for the H. sapiens TRN. The network is divided into six levels using HIDEN with reachability as penalty scheme.
Each gene in an organism takes part in at least one metabolic function. A gene participating in a large number of reactions is a common phenomena in many organisms. In this experiment, we compare the level of each gene with the number of functions they participate in. By doing so, we aim to discover any existing relation between the two. In order to do this, we use the gene ontology database [18]. The participation of a gene in a reaction is represented using gene ontology annotations in the literature. For each gene in our networks, we first count the number of gene ontology terms it is annotated with. We also decomposed each network into six layers using HIDEN. Then, we visualized the networks using hierarchy information as location and functional information as color of each node. Figures 4, 5 and 6 show our results. Our results suggest that there is no strong correlation between the number of functions of each gene and the level of the gene in the hierarchy. However, in all three organisms, we observed that the genes with the highest number of annotations tend to lie at the middle levels (i.e. 2,3 or 4). This result indicates that regulatory hubs in the TRNs are not at the top levels. They are rather at the middle levels of the hierarchy.
Figure 4. Illustration of the distribution of the number of functions that each gene participates in for the TRN of H. sapiens. Each circle represents a TF. The network is divided into six levels using the reachability as the penalty function and placed in relevant levels. The horizontal lines separate the TFs to different levels. The genes are colored according to the number of Gene ontology terms they are annotated with in gray scale. The least number of functions is assigned the color black, where the largest number of functions is assigned the color white.
Figure 5. Illustration of the distribution of the number of functions that each gene participates in for the TRN of S. cerevisiae. Each circle represents a TF. The network is divided into six levels using the reachability as the penalty function and placed in relevant levels. The horizontal lines separate the TFs to different levels. The genes are colored according to the number of Gene ontology terms they are annotated with in gray scale. The least number of functions is assigned the color black, where the largest number of functions is assigned the color white.
Figure 6. Illustration of the distribution of the number of functions that each gene participates in for the TRN of E. coli. Each circle represents a TF. The network is divided into six levels using the reachability as the penalty function and placed in relevant levels. The horizontal lines separate the TFs to different levels. The genes are colored according to the number of Gene ontology terms they are annotated with in gray scale. The least number of functions is assigned the color black, where the largest number of functions is assigned the color white.
Gene Essentiality
The genes which an organism cannot survive without are called essential genes [19]. Such genes take part in vital functions in the cell. Earlier works proposed that the number of essential genes is strongly correlated to its location in the hierarchy [2]. In this experiment, we aim to find out if there exists any such relation. In order to do this, we count the number of essential genes at each level of hierarchy in a five level decomposition of S. cerevisiae TRN. We then report the ratio of number of essential genes to total number of genes in a level in the hierarchy. We also calculate Pvalues for the number of genes observed in each level to show how significant the observations are. Figure 7 shows our results for this experiment. We observe that there is a higher density of essential genes at the middle levels of the hierarchy. We also observe that, the Pvalues we calculated show that the level three (of maximum four) is highly enriched in essential genes. This result, combined with the results of the previous experiment support our theory that regulatory hubs of an organism are often at the middle levels of the hierarchy, rather than the top level. Indeed strong correlation between hubs and essentiality has been observed in the literature that supports our results [21].
Figure 7. The ratio of essential genes(solid boxes) and the Pvalues(dashed line) for the number of essential genes observed in S. cerevisiae TRN in each level of the hierarchy. The network is divided into five different levels using the reachability penalty. The Pvalues are calculated based on the hypergeometric distribution.
Figure 8 shows a subnetwork of the human TRN. The highlighted TFs are shown to have abnormalities in many types of cancers. cMyc is a TF which has a key role in cell proliferation [22]. Overexpression of cMyc may result in development of different types of cancers. TP53 is an essential gene which is regulated by cMyc. The expression of this gene prevents formation of tumors by activating DNA repair, inhibiting cell growth and finally inducing apoptosis [23]. TP53 executes apoptosis by activating caspases (i.e. CASP8, CASP3) [24]. FLI1 is another protein regulated indirectly by cMyc. The fusion of proteins EWSR1/FLI1 and EWSR1/ERG due to a mutation creates a master regulator for the development in Ewing’s Sarcoma[25,26]. EWSR1/FLI1 causes tumor formation by upregulating genes that are involved in cell proliferation (i.e. IGF1) and downregulating genes that control apoptosis and growth inhibition (i.e. IGFBP3, TGFBR2) [27]. These small scale observations support our previous justifications that regulatory hubs and essential genes are more likely to be situated in the middle layers of the TRNs.
Figure 8. The TRN of H. sapiens with a subnetwork related to cancer highlighted. In this subnetwork, external signals (i.e. Growth factors, other proteins and molecules) regulate or affect the proteins cMyc, FLI1 and ERG. Many other regulatory connections and transcription factors are omitted for simplicity.
Effects of input on HIDEN
In this section, we analyze HIDEN by changing the input of the algorithm. In order to do this, we first change the number of layers we decompose the network into. Then, we assume errors and uncertainties in input networks. Using our results, we explain how reliable our method is under different conditions. Finally, we discuss the quality of our results for different subnetworks.
Navigation of genes across levels in varying hierarchies
The location of a gene in the hierarchy depends highly on the total number of levels. This leads to the following important question: How much can we rely on the relative levels of genes? One key feature of our method is that it allows the user to specify the number of levels in the hierarchical decomposition of the given network. By exploiting this feature, next, we answer this question. Particularly, we show how the change the number of levels affect the locations of the nodes in the hierarchy. In order to do this, we first calculate the levels of every node for S. cerevisiae, E. coli and H. sapiens networks in a six level hierarchy. We use these calculations to place every node in their respective positions. We then decompose the same networks to five levels. We use the result of the second decomposition to assign colors to each node in the network. Figure 9 shows the results of this experiment for S. cerevisiae. Our results demonstrate that for the majority of the genes, the relative position between different genes is preserved. In different decompositions, discovering genes in the same relative positions with respect to other genes suggest the accuracy of our method for the relevant genes. However, there exists some genes that violate this relationship. For example, in Figure 9, nodes YGL013C, YMR280C and YKL109W are at least two levels away from where they were earlier. Therefore, we conclude that the predicted levels of these genes not as reliable as the others. This approach can be used for evaluating the reliability of our results. Figures 10 and 11 present similar results for E. coli and H. sapiens.
Figure 9. Illustration of the navigation of genes across levels for the TRN of S. cerevisiae. Each circle represents a gene. The locations represent the levels of the genes in a 6level decomposition, whereas colors of the genes represent their locations in a 5level decomposition. The color red represents the bottom level in the hierarchy, green represents the topmost level and the gradient of colors in between is used to color the nodes in between.
Figure 10. Illustration of the navigation of genes across levels for the TRN of E. coli. Each circle represents a gene. The locations represent the levels of the genes in a 6level decomposition, whereas colors of the genes represent their locations in a 5level decomposition. The color red represents the bottom level in the hierarchy, green represents the topmost level and the gradient of colors in between is used to color the nodes in between.
Figure 11. Illustration of the navigation of genes across levels for the TRN of H. sapiens. Each circle represents a gene. The locations represent the levels of the genes in a 6level decomposition, whereas colors of the genes represent their locations in a 5level decomposition. The color red represents the bottom level in the hierarchy, green represents the topmost level and the gradient of colors in between is used to color the nodes in between.
Robustness of HIDEN
One weakness of all hierarchical decomposition methods arises from the nature of the biological network datasets that they are incomplete and imprecise. As a result, the actual network topology observed can be slightly different than what is given in existing network databases [28]. Furthermore, studies demonstrate that the network topologies can vary due to many other factors such as external perturbations [29] and varying genetic profiles and disorders [30] even within the same species. This raises the question that how much can we rely on a hierarchical decomposition if the topology of the given network is not accurate?
This section evaluates the robustness of HIDEN with respect to inaccuracies in the given network. In order to do this, we carry out the following steps. Given a network, we first find its hierarchical decomposition, denoted by T. We then create many mutant networks from this network for a given mutation percentage using the degree preserving edge shuffling model [31]. This is the state of the art network alteration method that preserves the degree distribution of the network. We elaborate on this method later in this section. Thus, each mutant network denotes a potential precise network that is different than the original network. Using the topology of each mutated network, we compute the penalty of the hierarchical decomposition T we found at the first step. Thus, this penalty shows how bad our results are if our network is inaccurate. We repeat this for many mutant networks and report the average of their penalties.
Briefly, we mutate a given network as follows. Let (u, v) and (s, t) denote two randomly selected edges from this network such that (i) u and v are different; s and t are different, and (ii) the edges (u, t) and (s, v) do not exist in . We remove edges (u, v) and (s, t) and add edges (u, t) and (s, v). Let η denote the number of times we repeat this edge shuffling process. Then the mutation percentage of the original network is rounded to the nearest integer.
We conducted the experiments on S. cerevisiae, E. coli and H. sapiens and on both penalty metrics adjacency and reachability for different number of levels of hierarchy. Figure 12 summarizes the results for S. cerevisiae, E. coli and H. sapiens using the adjacency and reachability penalties when three, six or eight levels are allowed.
Figure 12. (Evaluation of the robustness of HIDEN) The Zscore of HIDEN’s hierarchical decomposition of the S. cerevisiae, E. coli and H. sapiens network using adjacency and reachability penalties. Level assignment is done on the original network. The Zscore is computed on the mutant network where the network is mutated at increasing mutation percentages. The results are reported for three different highest allowed levels, namely three, six and eight.
The most important observation that follows from our results is that the Zscore remains high even after we mutate the network by 20%. We observe a slight drop as the mutation rate increases, yet the results remain statistically significant. This observation holds for small (3), medium (6) and large (8) number of allowed hierarchical levels. This result has two major implications. First, HIDEN is extremely robust with respect to network mutations. It was able to identify hierarchical structure using the clues that remain in the topology of the given network after all mutations take place. Thus, even if the original network may be imprecise, the decomposition found by HIDEN will be a true decomposition with a high probability. Second, the degree preserving edge shuffling does not affect the decomposability of the network. The fact that even the original level assignment T yielded statistically significant penalties on the mutant network proves that it is possible to find another decomposition T’ of the mutant network that is statistically at least as significant as (possibly more significant than) T.
Stability of HIDEN to network mutations
So far, we have observed that HIDEN was able to decompose the networks of the given three organisms successfully. This observation along with our last conclusion from the previous section begs the following question: Can HIDEN decompose the mutant networks or was there a bias in topology of these three networks in favor of HIDEN? In other words, how stable is HIDEN with respect to alterations in the network topology?
In order to evaluate the stability of HIDEN with respect to network alterations, we carry out the following steps. Given a network , we create many mutant networks ’ from for a given mutation percentage using the degree preserving edge shuffling. We then use HIDEN on each such ’ to find a new hierarchical level assignment T’ specifically for that topology. Thus, this penalty shows how much the performance of HIDEN is affected from network alterations. We repeat this for many mutant networks and report the average of their results.
Tables 2, 3 summarize the penalties and the corresponding Zscores for varying mutation percentages as well as varying maximum number of allowed hierarchical levels with according to adjacency and reachability penalties respectively. For all the three organisms, we observe similar patterns in our experiments. The most important observation is that HIDEN achieves very high Zscores at all mutation rates. Furthermore, these Zscores are comparable to those of the original network (i.e., mutation percentage = 0%). The adjacency penalty values are also comparable to those for the original network. This coincides with the observation we made in the robustness test in Section Robustness of HIDEN that the degree preserving edge shuffling does not alter the decomposability of the given network. As the mutation percentage increases, Zscore and the adjacency penalties do not show a clear trend to increase or decrease. We, thus, reach to the conclusion that HIDEN is stable with respect to network alterations.
Table 2. Stability Experiment for increasing mutation percentages: The numbers in parenthesis is the average adjacency penalty
Table 3. Stability Experiment for increasing mutation percentages: The numbers in parenthesis is the average reachability penalty
Local versus global hierarchy of subnetworks
The entire biological network of an organism can be considered as a (possibly overlapping) collection of smaller subnetworks where each subnetwork corresponds to a coherent functional group. For instance, cell cycle network describes the interactions that take place during the division and replication of a cell to produce new cells. Similarly, meiosis network describes a special type of cell division only observed in reproductive cells. These smaller subnetworks may follow a hierarchical structure as well within their local topologies. Clearly, we can use HIDEN on each of these subnetworks to find their hierarchical structure by isolating them from the rest of the network one by one. We call such hierarchical decomposition as local hierarchy since it only depends on the topology of the subnetwork. We call the hierarchical decomposition we obtain for a subnetwork from the entire network’s topology as its global hierarchy. In this experiment, we evaluate how well the global hierarchy of a subnetwork fits to its local hierarchy.
Let us denote the entire network with and a subnetwork of with ’. Let us denote the level assignments for the networks and ’ by HIDEN with T and T’ respectively. Let be the global hierarchy of ’ induced from T. We calculate the adjacency penalty and Zscore of and T’ using the topology of ’. Table 4 summarizes the results for S. cerevisiae for two major subnetworks, namely cell cycle and meiosis taken from the KEGG database [32] with different values of maximum number of allowed levels.
Table 4. Comparison of the global hierarchy of subnetworks to their local hierarchy
The results demonstrate that the local hierarchy is better than the global one. This is not surprising as the global hierarchy is determined based on the entire network. Thus, the levels are determined with the goal of optimizing all the interactions in the network. On the other hand, local hierarchy is determined only based on the restrictions asserted by the corresponding subnetwork. We observe that the gap between the local and the global hierarchy is small for the cell cycle network. It is, however, significant for the meiosis network. In order to understand the factors that contribute to this gap, we performed a detailed analysis of the topology of the entire S. cerevisiae network as well as these two subnetworks. Cell cycle contained 54 genes and 108 interactions. Meiosis was smaller, containing 44 genes and 62 interactions. We define an interaction from a gene that is not in the subnetwork to a gene that is in the subnetwork as an incoming edge. If the interaction points the opposite direction, we define it as an outgoing edge. We computed the number of incoming and outgoing edges to each subnetwork. The number of incoming edges per node was 1.9 and 3.6 for cell cycle and meiosis respectively. That for the outgoing edges was 20.6 and 18.8 respectively. This suggests that as the number of incoming edges increase, the chance that the global hierarchy deviates from the local one increases. This is indeed intuitive as the incoming edges influence the hierarchy much more than the outgoing edges. For the local hierarchy, we observe that a small number of levels is sufficient to get a good hierarchical decomposition. For instance, HIDEN resolved all conflicts for cell cycle in only five levels. It resolved all but one conflict for meiosis in four levels.
These results demonstrate that the local and global hierarchies can deviate significantly depending on the topological relationship between the subnetwork and the rest of the network. Thus, detailed analysis of both decompositions can yield useful information regarding how the functions of a given subnetwork is depends on the other genes. HIDEN is capable of revealing such information.
Conclusion
In this paper, we took a novel approach to the problem of discovering underlying network hierarchy. We first transformed our problem to a MIPP. Then, we solved this problem using existing optimizers. We named this method HIerarchical DEcomposition of gene regulatory Networks. However, due to the growing size of the MIPP with increasing number of genes, we encountered scalability issues. We proposed a divide and conquer approach to tackle such problems. Later, we experimentally showed that our algorithm outperformed existing solutions in terms of minimizing conflicting edges in hierarchy. We also evaluated our method using biological and statistical tools. Then, we discussed the relation between the hierarchy of a gene in a TRN and its location in cell, essentiality and function, based on our experimental results and biological evidence.
Availability and requirements
The source code for HIDEN can be found in Additional file 1. The code is written in C++. The code requires IBM ILOG CPLEX version 12 or higher to compile and run. Please refer to the documentation of CPLEX for platform specific instructions on how to compile and run applications that use CPLEX libraries. The results of our code using the penalty schemes described in this paper for TRNs of E. coli, H. sapiens and S. cerevisiae can be found in Additional file 2.
Additional file 1. A c++ implementation of the algorithm developed in this paper.
Format: TAR Size: 494KB Download file
Additional file 2. The resulting level assignments for the transcriptional regulatory networks ofS. Cerevisiae,E. coliandH. sapiensusing adjacency and reachability penalties.
Format: XLS Size: 126KB Download file
This file can be viewed with: Microsoft Excel Viewer
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
GG and TK designed the method. GG implemented the method. GG and NB gathered experimental results. GG and TK wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
This work was supported partially by NSF under grants CCF0829867 and IIS0845439.
References

Yu H, Gerstein M: Genomic analysis of the hierarchical structure of regulatory networks.
Proc Natl Acad Sci U S A 2006, 103(40):1472414731.
[http://dx.doi.org/10.1073/pnas.0508637103 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Jothi R, Balaji S, Wuster A, Grochow JA, Gsponer J, Przytycka TM, Aravind L, Babu MM: Genomic analysis reveals a tight link between transcription factor dynamics and regulatory network architecture.
Mol Syst Biol 2009, 5:294.
[http://dx.doi.org/10.1038/msb.2009.52 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Hartsperger ML, Strache R, Stumpflen V: HiNO: an approach for inferring hierarchical organization from regulatory networks.
PLoS One 2010, 5(11):e13698.
[http://dx.doi.org/10.1371/journal.pone.0013698 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Bhardwaj N, Kim PM, Gerstein MB: Rewiring of transcriptional regulatory networks: hierarchy, rather than connectivity, better reflects the importance of regulators.
Sci Signal 2010, 3(146):ra79.
[http://dx.doi.org/10.1126/scisignal.2001014 webcite]
PubMed Abstract  Publisher Full Text 
Bhardwaj N, Yan KK, Gerstein MB: Analysis of diverse regulatory networks in a hierarchical context shows consistent tendencies for collaboration in the middle levels.
Proc Natl Acad Sci U S A 2010, 107(15):68416846.
[http://dx.doi.org/10.1073/pnas.0910867107 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Latchman SD: Transcription factors: an overview.
Int J Biochem and Cell Biol 1997, 29(12):13051312.
[http://www.sciencedirect.com/science/article/pii/S135727259700085X webcite]
Publisher Full Text 
Garfinkel RS, Nemhauser GL: Integer programming. New York: John Wiley & Sons; 1972.

Balaji S, Babu MM, Iyer LM, Luscombe NM, Aravind L: Comprehensive analysis of combinatorial regulation using the transcriptional regulatory network of yeast.
J Mol Biol 2006, 360:213227.
[http://dx.doi.org/10.1016/j.jmb.2006.04.029 webcite]
PubMed Abstract  Publisher Full Text 
Balaji S, Iyer LM, Aravind L, Babu MM: Uncovering a hidden distributed architecture behind scalefree transcriptional regulatory networks.
J Mol Biol 2006, 360:204212.
[http://dx.doi.org/10.1016/j.jmb.2006.04.026 webcite]
PubMed Abstract  Publisher Full Text 
GamaCastro S, Salgado H, et al.: RegulonDB version 7.0: transcriptional regulation of escherichia coli K12 integrated within genetic sensory response units (Gensor Units).
Nucleic Acids Res 2011, 39(Database issue):D98—105.
[http://dx.doi.org/10.1093/nar/gkq1110 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Harbison CT, Gordon DB, et al.: Transcriptional regulatory code of a eukaryotic genome.
Nature 2004, 431(7004):99104.
[http://dx.doi.org/10.1038/nature02800 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Horak CE, Luscombe NM, Qian J, Bertone P, Piccirrillo S, Gerstein M, Snyder M: Complex transcriptional circuitry at the G1/S transition in saccharomyces cerevisiae.
Genes Dev 2002, 16(23):30173033.
[http://dx.doi.org/10.1101/gad.1039602 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Jiang C, Xuan Z, Zhao F, Zhang MQ: TRED: a transcriptional regulatory element database, new entries and other development.
Nucleic Acids Res 2007, 35(Database issue):D137—D140.
[http://dx.doi.org/10.1093/nar/gkl1041 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Lee TI, Rinaldi NJ, et al.: Trascriptional regulatory networks in saccharomyces cerevisiae.
Science 2002, 298(5594):799804.
[http://dx.doi.org/10.1126/science.1075090 webcite]
PubMed Abstract  Publisher Full Text 
Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes.
Nature 2004, 431(7006):308312.
[http://dx.doi.org/10.1038/nature02782 webcite]
PubMed Abstract  Publisher Full Text 
Svetlov VV, Cooper TG: Review: compilation and characteristics of dedicated transcription factors in saccharomyces cerevisiae.
Yeast 1995, 11(15):14391484.
[http://dx.doi.org/10.1002/yea.320111502 webcite]
PubMed Abstract  Publisher Full Text 
Teichmann SA, Babu MM: Gene regulatory network growth by duplication.
Nat Genet 2004, 36(5):492496.
[http://dx.doi.org/10.1038/ng1340 webcite]
PubMed Abstract  Publisher Full Text 
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The gene ontology consortium.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang R, Lin Y: DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes.
Nucleic Acids Res 2009, 37(Database issue):D455—D458.
[http://dx.doi.org/10.1093/nar/gkn858 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Galko MJ, Krasnow MA: Cellular and genetic analysis of wound healing in drosophila larvae.
PLoS Biol 2004, 2(8):E239.
[http://dx.doi.org/10.1371/journal.pbio.0020239 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks.
Nature 2001, 411(6833):4142.
[http://dx.doi.org/10.1038/35075138 webcite]
PubMed Abstract  Publisher Full Text 
Dang CV, Resar LM, Emison E, Kim S, Li Q, Prescott JE, Wonsey D, Zeller K: Function of the cMyc Oncogenic Transcription Factor.
Experimental Cell Research 1999, 253:6377.
[http://www.sciencedirect.com/science/article/pii/S0014482799946864 webcite]
PubMed Abstract  Publisher Full Text 
Cell 2000, 103(5):691694.
[http://www.sciencedirect.com/science/article/pii/S0092867400001719 webcite]
PubMed Abstract  Publisher Full Text 
Schuler M, Green DR: Mechanisms of p53dependent apoptosis.
Biochem Soc Trans 2001, 29(Pt 6):684688. PubMed Abstract  Publisher Full Text

Kauer M, Ban J, Kofler R, Walker B, Davis S, Meltzer P, Kovar H: A molecular function map of Ewing’s Sarcoma.
Plos One 2009, 4(4):e5415.
[http://dx.doi.org/10.1371/journal.pone.0005415 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Owen LA, Kowalewski AA, Lessnick SL: EWS/FLI mediates transcriptional repression via NKX2.2 during oncogenic transformation in Ewing’s Sarcoma.
PLoS ONE 2008, 3(4):e1965.
[http://dx.plos.org/10.1371 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Riggi N, Stamenkovic I: The biology of Ewing sarcoma.
Cancer Lett 2007, 254:110.
[http://www.sciencedirect.com/science/article/pii/S0304383506006811 webcite]
PubMed Abstract  Publisher Full Text 
Sun N, Carroll RJ, Zhao H: Bayesian error analysis model for reconstructing transcriptional regulatory networks.
Proc Natl Acad Sci U S A 2006, 103(21):79887993.
[http://dx.doi.org/10.1073/pnas.0600164103 webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Bandyopadhyay N, Somaiya M, Kahveci T, Ranka S: Modeling perturbations using gene networks.
Proceedings of Life Sciences Society Computational Systems Bioinformatics (CSB) 2010, 2637.
[http://www.lifesciencessociety.org/CSB2010/toc/26.2010.html webcite]

Bandyopadhyay N, Somaiya M, Ranka S, Kahveci T: Identifying differentially regulated genes.
Computational Advances in Bio and Medical Sciences (ICCABS), 2011 IEEE 1st International Conference on 2011, 1925.
[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5729877&tag=1 webcite]

Mcsweeney PJ, Ashkenazi M, States D: Random network plugin.
Google Summer of Code 2008, 2008.
[https://sites.google.com/site/randomnetworkplugin/Home webcite]

Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 1999, 27:2934. PubMed Abstract  Publisher Full Text  PubMed Central Full Text