Centre for Advanced Computational Solutions, Dept WF & Molecular Bioscience, Lincoln University, Ellesmere Junction Road, Christchurch, New Zealand

Abstract

Background

Compared to more general networks, biochemical networks have some special features: while generally sparse, there are a small number of highly connected metabolite nodes; and metabolite nodes can also be divided into two classes: internal nodes with associated mass balance constraints and external ones without. Based on these features, reclassifying selected internal nodes (separators) to external ones can be used to divide a large complex metabolic network into simpler subnetworks. Selection of separators based on node connectivity is commonly used but affords little detailed control and tends to produce excessive fragmentation.

The method proposed here (Netsplitter) allows the user to control separator selection. It combines local connection degree partitioning with global connectivity derived from random walks on the network, to produce a more even distribution of subnetwork sizes. Partitioning is performed progressively and the interactive visual matrix presentation used allows the user considerable control over the process, while incorporating special strategies to maintain the network integrity and minimise the information loss due to partitioning.

Results

Partitioning of a genome scale network of 1348 metabolites and 1468 reactions for

Conclusions

For the examples studied the Netsplitter method is a considerable improvement on the performance of connection degree partitioning, giving a better balance of subnet sizes with the removal of fewer mass balance constraints. In addition, the user can interactively control which metabolite nodes are selected for cutting and when to stop further partitioning as the desired granularity has been reached. Finally, the blocking transformation at the heart of the procedure provides a powerful visual display of network structure that may be useful for its exploration independent of whether partitioning is required.

Background

The genome scale metabolic network of small molecule reactions for cells (particularly eukaryotic cells) is sufficiently complex that it is hard to visualize, let alone interpret. Using conventional biochemical pathways is a bottom-up approach that helps to bridge the complexity gap between individual reactions and the complete network. But this still leaves scope for an intermediate level of granularity, namely subnets. A subnet allows the study of the interplay between pathways and reactions in a broader context, while still focussing attention on a limited biological functionality of interest.

This line of thought has been pursued by many authors in the recent literature, together with algorithms that use a top-down approach utilising the inherent structure of the complete network to determine its natural subdivision points. In addition to the conceptual argument, there are also practical considerations that motivate this endeavour in particular contexts. The use of structural analysis tools such as elementary modes and extreme pathways

Another significant context is flux balance analysis (FBA). There, knowledge of at least some measured fluxes is needed in order to calculate others by applying stoichiometric and other constraints. Current technology allows simultaneous measurement of about a dozen flux values or several hundred metabolite concentrations

Depending on the priority allocated to these three sets of considerations, different approaches have been advocated, and a recent review including the application of more general network theory approaches to biological networks, can be found in a recent article by Nayak and De

The conceptual network simplification problem is typically addressed by clustering- or community finding algorithms. A typical example is the Markov clustering (MCL) algorithm

However, clustering of this kind is not really appropriate for metabolic subnetworks. The most highly connected metabolites are commodity or currency compounds such as H_{2}O and NADH, but generally (depending on the context) they are of least interest in terms of function. Conversely, the conventional pathways of biochemistry that should form the core of a functionally oriented partitioning are typically linear or circular and only weakly connected in terms of graph structure.

An alternative approach to the conceptual clarification of biochemical network structure is as hierarchy trees, an approach advocated in the work of Holme, Huss and Jeong

An approach that prioritises the appropriateness of a biochemical subnet for use in practical applications, was demonstrated by Schuster et al

Using a threshold connectivity of 5, the metabolic network of

The network splitting procedure presented in this article aims to incorporate the insights outlined above. In addition it provides flexibility to interactively guide how the splitting proceeds, based on the purpose and biochemical knowledge of the user, within the limits set by the inherent network structure.

The formulation adopts internal/external reassignment as the splitting paradigm, but only uses the connectivity degree as a preliminary coarse filter to identify the most obvious external metabolites. This is optionally supplemented or refined by an explicit listing of metabolites that are/are not taken as external. The main algorithm uses random walks to explore long range network structure, in a similar way as MCL clustering

Methods

General overview

Processing of a metabolic network consisting of an unordered list of chemical reactions specified in the standard way by a matrix of stoichiometric coefficients, proceeds through four computational stages:

1. Generating a matrix representation of the network connectivity structure from random walks, which expresses each internal metabolite as a distinct source or sink node in an associated directed acyclic graph ( DAG).

2. Using hierarchical clustering and a blocking transformation to rearrange the DAG matrix into latent blocks that express the underlying partially separated subnets.

3. Proposing prospective separator nodes for approval to the user, implementing the decision and recalculating the DAG with improved blocking, leading to the next round of separator selection.

4. Post-processing to consolidate subnets by reincorporating superfluous externals and to reconstitute a stoichiometry matrix specification of each subnet from the DAG matrix blocks.

Each of these stages is described in more detail in the following subsections, followed by introduction of a quantitative measure of effectiveness. Fuller justifications for some of the steps are supplied in a separate subsection at the end of Methods.

Matrix representation of biochemical networks

Random walks and probability matrices

The procedure is based on representing the network as a matrix of probabilities that reflect random walks on a simple graph, similar to that used in the well-known Markov Clustering (MCL) algorithm

For a simple graph, one starts from a probability matrix **P**
_{1 }where the elements in row **C **of the graph, by dividing each element by the sum of all elements in the row. The probability matrix **P**
_{N }for a random walk of **P**
_{1 }to the

If we start from a state where there is a single "random walker" on each node of the network at step 0, the probability associated with each walker has the value 1 for being localised on its starting node. Then **P**
_{1 }represents propagation of this probability to nearest neighbour nodes in step 1, and generally the potentiating of the matrix can be visualised as the flow of probability through the network after increasing numbers of steps. This is expressed in MCL terminology by referring to potentiating as the "expansion" operation.

As constructed, the matrix **P**
_{1 }has non-negative elements with a row sum = 1, which makes it an example of a **P**
_{∞}. In practice, for metabolic networks numerical convergence to an approximation of **P**
_{∞ }is obtained for values of **P**
_{∞ }, obtained by replacing all non-zero elements by 1, can be interpreted as the adjacency matrix of a new graph, containing the same nodes as the original network, but in which all links connect sources directly to sinks in a star-like configuration. This is formally described as a directed acyclic graph, and is directed irrespective of whether any links in the original network were directed. In what follows, either **P**
_{∞ }or its binary version is referred to as the DAG-matrix. Qualitatively the features described above are quite similar to those in the MCL, but note that in MCL terminology only the "expansion" operation (raising **P**
_{1 }to a power) is applied while the "inflation" operation that is key to the MCL, is not used. Consequently the DAG obtained here does not usually separate into disconnected clusters, and needs to be further manipulated by the algorithm to extract subnetworks.

The generalisation of the procedure outlined above to the bipartite metabolic network case, starts by defining two separate adjacency matrices **CR **and **RC**. For a network of **CR **is similarly (**RC **is (**S **by the relations

Here the Sign function takes the values -1, 0 or 1 and serves to ensure that **CR **and **RC **are nonnegative binary matrices. From these the probability matrix for the reduced metabolites-only network is calculated by

Here the function RowNorm normalises each matrix row by a simple row sum, and converts the adjacency matrices to probability matrices. The summation implied in the matrix multiplication accumulates the probabilities for a random walk jump from metabolite node **I **represents an (

Calculation of the DAG matrix proceeds by straightforward iterative potentiation of **P**
_{1}, using convergence of the Frobenius norm of the matrix to within an absolute value of 10^{-10 }as the criterion.

Matrix implementation of partitioning

Reclassifying an internal node as external to produce network partitioning, is implemented by deleting the corresponding row from **CR **and column from **RC**. This implies that at any stage the DAG matrix only represents internal metabolites, and as this changes during the course of the partitioning the DAG is regularly updated. A detailed account of this implementation is given in the justifications section further below.

Preprocessing the DAG matrix

The first step in processing the DAG matrix is to sort its columns so all non-zero columns are collected on the left and rows sorted in the same order, then deleting the zero columns. In this way, by definition only sink nodes remain in the column sequence and rows are sorted with all sinks appearing first, followed by all source nodes.

To demonstrate the method, an example network consisting of 137 metabolites and 117 reactions is used in what follows. This network happens to be a subnet for flavonoid metabolism in

**Demonstration model**. Specification of the network model used for demonstration in the Methods section, as an SBML file.

Click here for file

Figure

DAG matrix for demo network

**DAG matrix for demo network**. Non-zero columns of the DAG matrix (a) with only structural externals recognised (b) after reclassifying 4 high connectivity internals as external. Colour scaling expresses random walk probabilities between source nodes (rows) and sink nodes (columns) of the network; comparison of (a) and (b) shows how connectivity structure is revealed by an appropriate high connectivity cutoff.

The top square 16 × 16 submatrix is seen to be (block) diagonal. For the majority of single diagonal elements, this merely indicates the finite probability that a random walk starting from a sink node will end there as a result of a selfloop, while it will not terminate at any other sink. There are also a few small blocks; they represent small clusters of nodes that are fully connected and hence jointly act as a "supersink". This top square does not reflect much of the overall network structure and further manipulation centers on the lower part, i.e. the DAG matrix is further truncated to contain just the (

Inspection of the lower 55 × 16 submatrix in Figure

Reclassifying just the four highest connectivity internal metabolites (Water, Coenzyme A, NADP and NADPH) in the demonstration network produces the drastic change shown by Figure

A useful strategy to determine such a set of a priori ubiquitous metabolites is to simply choose a fixed threshold value and reclassify all internal metabolites with connectivities higher than the threshold, in order to reveal the connection structure. A threshold of 8 was found to work well for networks over a wide range of sizes, from about 100 metabolites upwards. Manual adjustment of the threshold can also be done as its effect is easily monitored by visual inspection of the truncated DAG as in Figure

Alternatively, an explicit list of commonly occurring ubiquitous metabolites can be used instead of a threshold, to avoid inadvertent reclassifications. The most efficient strategy was found to be a combination, using a threshold to automatically reclassify the most "obvious" carrier metabolites automatically, and supplementing this with an explicit list of less obvious ones.

Rearranging the DAG matrix to identify subnets

Subnetworks and matrix blocks

The key insight needed to use the mathematical infrastructure described so far for network partitioning, is that separated subnets can be made to appear in the truncated DAG matrix as non-overlapping blocks.

A block is defined as a rectangular submatrix, formed by the intersection of a horizontal band of rows and a vertical band of columns, and where any non-zero matrix elements in either band occur only inside the intersection (so elements in the bands outside of the block are all zero). It follows that the row and column ranges of a block does not overlap with those of any other block. So if it exhibits a non-trivial block structure, the full set of rows in the truncated DAG matrix will be partitioned with no overlap into two or more bands, and similarly the columns into the same number of bands. This definition does not require that blocks are arranged diagonally.

The connection with disjoint subnets is established by noting that a non-zero element (

The truncated DAG as constructed so far will not show such block structure, but two operations are available to produce the block structure:

• Rows and columns may be reordered. There is no penalty to this, as the ordering of internal metabolites inherited from the S-matrix is arbitrary.

• Internal metabolites may be reclassified as external and deleted from the adjacency matrices. This carries a penalty, as information is lost - the mass balance of the metabolite is not enforced any more. The DAG matrix needs to be recalculated in this case and usually has a different allocation of sinks and sources.

Rearrangement of rows and columns

The first step is to rearrange rows and columns so that metabolites belonging to a block are grouped together. For computational efficiency, operations described here are performed on a binary version of the truncated DAG matrix, on the grounds that it is the connectivity of the network that is relevant rather than detailed probabilities from the random walk. In a binary matrix with simple rectangular blocks, all rows/columns in a particular block are identical but are orthogonal to those in any other block. However, the definition of a block given previously allows zero elements inside a block as well, so this is relaxed to say that rows/columns belonging to a block needs to be similar to each other but dissimilar from those in other blocks - i.e., it reduces to a vector clustering problem. The Sokal-Sneath vector dissimilarity is used to quantify this, as discussed in more detail in the justifications section.

Using this measure the rearrangement problem reduces to one of finding row and column sequences that give optimal clustering. Of various standard clustering methods that were considered the hierarchical clustering method

Hierarchical clustering, as expressed in a dendrogram representation, has the advantage that - unlike most other clustering methods - it gives a definite sequence (the ordering of leaves in the dendrogram) while not committing to a fixed number or size of clusters. These can be subsequently determined by choosing a cutoff level in the dendrogram, a property exploited in the next stage of the procedure.

Figure

Rearranged binary

**Rearranged binary**. Binary truncated DAG matrix, reordered according to hierarchical clustering of rows and columns. Clustering groups nodes with similar long range connections together, so that black areas that form the cores of latent block structures appear.

Blocking transformation

The next challenge is to identify latent blocks that can be separated in further processing. A crucial decision to be made is the optimal number, size and shape of blocks. Reordering alone as in Figure

The decision is facilitated by introducing a

1. Truncating the column dendrogram at a particular chosen level, defines a collection of consecutive column clusters such as _{ij}, we make the replacement

Here _{ij }are the binary matrix elements and _{j }is the column cluster to which column

In the case of a perfectly blocked matrix, all non-zero elements in a row will belong to the same unique cluster and their values are left unchanged at 1. Any zero element in the same cluster is replaced by 1, i.e. any gaps inside the cluster are filled in. All row elements in the remaining clusters will be, and remain, zero.

However, for an imperfectly blocked matrix, any non-zero element outside the range of a particular cluster will serve to dilute the common value of elements inside the cluster to a fractional value. Hence in a gray-scale representation the row appears as a sequence of bands in different shades of grey; the darkest grey identifies the cluster containing the largest fraction of non-zero elements.

Applying this transformation to all rows of the matrix in Figure

2. Truncating the row dendrogram at a particular chosen level, defines a collection of consecutive row clusters _{i}. In column

Blocking matrices

**Blocking matrices**. Transformed versions of the truncated binary matrix, constructed by (a) blocking rows (b) blocking columns (c) superimposing row and column blocking matrices (d) reordering rows and columns to consolidate blocks. Grey shades in effect expresses the degree to which rows conform to column grouping and vice versa. Combining these and optimising the clustering, expresses subnet cores visually as dark areas and overlaps by lighter shades.

Application of this transformation to all columns similarly gives the column blocking matrix shown in Figure

In a perfectly blocked matrix, blocks based on grouping rows or columns are identical, but the demonstration example shows that for imperfect blocking the row and column blocking matrices are similar but not identical. The next step superimposes the information from the two separate hierarchies.

3. Combine row and column blocking matrices by elementwise averaging:

The combined blocking matrix obtained in this way is shown in Figure

An important aspect of the algorithm has been glossed over above. The dendrograms used in steps 1 and 2 define hierarchical lists of distances between subclusters. It is by choice of a particular cutoff value in each list (defining the minimal distance for subclusters to be recognised as separate) that one can choose between many smaller clusters or fewer larger ones.

To exploit that, a quantitative criterion

4. Calculate

The blocking matrices shown in Figure

However, there is still one noticeable deficiency in Figure

5. Consolidate blocks by reordering rows and columns according to hierarchical clustering now applied to the combined blocking matrix.

As Figure

Finally, in order to computationally process individual subnets, automated recognition of separated blocks is required. This is a fairly straightforward image processing problem, and a heuristic procedure based on the block definition given above is described in Additional file

**Heuristic for block recognition**. A description of the heuristic employed by Netsplitter for automated recognition of non-overlapping matrix blocks as defined in the text.

Click here for file

Selection of separation nodes

Having prepared the DAG matrix to express any underlying partial block structure, the procedure now enters an iterative loop in which in each round, a small number of nodes are identified that when "cut" (i.e., the corresponding metabolite is reclassified as external), will lead to separation into subnets. The goal is to keep this set of separation nodes as small as possible, both to minimise the loss of mass balance information and to preserve as far as possible the local structure of the full network.

For example, applying block recognition scanning to the matrix of Figure

For a number of reasons, it is postulated that the lighter grey cells in the figure are the most promising candidates for removal to induce separation. One rationale is that by construction they reflect a status as exceptions while the majority of cells in their row or column belong to the same group and end up as dark grey. Also, they tend to result from cases where there is already separation from the perspective of the row grouping and only weak overlap from the column grouping, or vice versa. Middle grey, on the other hand reflects either strong evidence from one of the groupings, or moderate consensus that may solidify once the weakest overlaps are removed. Also, there is some analogy to the effects of the "inflation" step of the MCL. In that method, "inflation" is produced by taking the Hadamard power of a probability matrix; that tends to suppress low probabilities and leads to the "weakest" links between node clusters to be removed first. These arguments can be made more elaborate, but in the final analysis the justification lies in the result obtained. As detailed in the justifications section, linear programming is used to select a small number of metabolites that optimally cover the lighter grey cells in the blocking matrix and propose these to the user as candidate externals.

An example is illustrated in Figure **RC **and **CR**, the DAG matrix recalculated and the blocking transformation repeated to identify further candidates in a second round, and this iteration is continued until either sufficiently fine-grained splitting has been achieved, or no more separation nodes are found.

Eliminating separation nodes

**Eliminating separation nodes**. Blocking matrices as presented to the user for selecting separation nodes in subsequent rounds. Rows and columns proposed for cutting are highlighted in colour. (a) First round (b) Round 5 (c) Final result, after seven rounds and restoring all blocks and reincorporation of superfluous externals. The four non-overlapping blocks represent separation into four subnetworks, the largest two still showing minor internal structure.

Figure

Another case of such irreducible blocks that is usually encountered, is the appearance in the DAG matrix of isolated sinks or "orphans". These appear as entries in the top, diagonal section of the DAG matrix with no accompanying source node entries in the corresponding column of the truncated matrix used for blocking. Such an orphan metabolite node signifies the simplest possible subnet, with only a single internal metabolite, and typically containing only two reactions. As these can obviously not be further split and the presence of an empty column complicates block recognition, they are best eliminated in each round from the adjacency matrices along with single row/column irreducible blocks.

Postprocessing and reconstruction of subnetworks

Once the iterative process of progressively selecting separation nodes has terminated, the main outcome is a list of internal metabolites, partitioned into disjoint subsets that belong to each block. The remaining metabolites constitute a list of external metabolites. This list may contain entries that are not, in fact, essential for block separation. For example, a metabolite may have been made external during initialisation on the grounds that it participates in a large number of reactions, but if all of those reactions belong to the same subnet it should be reinstated as an internal metabolite in this subnet. Also, it can happen that the effects of a metabolite selected early on in the progressive selection process, are superseded by one selected later. In the interests of maintaining maximal network integrity compatible with the separation, all such superfluous externals need to be reincorporated before finalising the subnets.

This is done in a loop that inspects the stoichiometry matrix for each external metabolite on the list, to determine all internal metabolites to which it connects by reaction links in either direction in the bipartite representation. If all those belong to a single block, the external metabolite is reincorporated into that block. If they belong to a single block, except for a connection to one or more orphan metabolite nodes, those orphan nodes are also reincorporated into the block as detailed below. As this reincorporation loop changes the composition of the lists of internal and external metabolites, the loop is repeated iteratively until there is no further change in the composition of the lists.

The approach that was chosen to select separation nodes progressively, a few at a time, has the advantage that it allows the user to steer the network splitting by accepting or rejecting proposed separation nodes and terminating the process at the desired level of granularity. However, a disadvantage is that the results may become dependent on the order in which separation nodes are identified. That is counteracted by performing a one-off blocking step in which the full list of external metabolites are applied simultaneously. This step is performed as part of the post-processing done after the selection process is finished; but the question arises whether it should be done before or after the reincorporation step. Each choice has some advantages, and the most robust result is in fact achieved by repeating the reincorporation step. So the full post-processing procedure consists of the 3-step sequence: a first reincorporation step, then the one-off blocking, followed by a second reincorporation. Figure

Once the partitioned list of internal metabolites is finalised by this post-processing, the individual subnets can be reconstructed in a straightforward way from the original stoichiometry matrix **S**. For each subnet, all reactions in which its internal metabolites participate are extracted from **S **and allocated to this subnet. All metabolites that participate in these reactions are collected; those not appearing on the list of internals for the subnet, are by definition the external metabolites of the subnet. The submatrix of **S **pertaining to the reactions and metabolites so identified is extracted and saved in appropriate format as a full specification of the subnetwork, which can be further analysed by standard network analysis or FBA software tools.

By construction, the internal metabolites of different subnets are mutually exclusive sets. External metabolites, on the other hand, are often shared between subnets. In the vast majority of cases, there is also no overlap between external metabolites of any subnet and the internals of any other.

There are, however, rare exceptions where an external of one subnet is in fact an internal of another. This phenomenon can be considered an artefact of the way that the algorithm mainly operates on a reduced metabolites-only simple graph. At this level where the blocking procedure is carried out, there is a strict distinction between internal and external metabolites; they form non-overlapping sets. However, when translated back to the underlying bipartite graph representation, cutting all metabolite nodes that were identified as external, can sometimes still leave subnets connected by a shared reaction node.

A typical case is shown in Figure

Example of internal-external overlap between subnets

**Example of internal-external overlap between subnets**. Subnets A and B, connected by a common product of reaction R1. Metabolite nodes are shown as squares and the reaction node as an octagon. (a) Bipartite representation of the network (b) Reduced metabolites only network. In (b) subnets are fully separated by making

The existence of this kind of limited overlap between two subnets does not compromise the integrity of either as a coherent subnet: it remains true that for all internal metabolites in a subnet, all reactions in which they participate are included in the subnet, and so the mass conservation constraints of all internal metabolites are identical in the subnet and in the full network. However, it does uniquely create the complication that the same reaction is present in both subnets, which can lead to conflicting values for the flux through this reaction in separate FBA calculations for each subnet. To avoid that, it may be preferred to merge the two subnets into a larger one when this exceptional case arises.

It should also be noted that for a similar reason the reincorporation of orphan metabolite nodes is slightly more complicated than outlined above. By definition, an orphan node is isolated from all other internal nodes in terms of probability flow, but it could still be connected by a unidirectional link towards the orphan. Consequently, incorporation of an orphan takes place in two steps. When an external connected to an orphan metabolite node is incorporated into a block, the orphan is first promoted to an external of that block. In the next round of the incorporation loop, it is then tested for links to internals of other blocks and only incorporated as an internal if no such links in either direction is found.

Detailed justifications

Internal and external metabolites and network partitioning

Conventionally, external nodes are placed on the periphery when drawing a network to indicate that they form the interface between the metabolic system that the network represents and its environment. However, the distinction between nodes that are associated with mass balance constraints (internal metabolites), and those that are not (external metabolites) is not apparent when the network topology is simply specified as a list of reactions. Most external metabolites can be recognised computationally by the fact that an external metabolite is either taken up or delivered to the environment so that all network links impinging on an external node are directed away from or towards the node; but in cases where the metabolite is exchanged with the environment that distinction is lost.

A convention commonly used in FBA of metabolic networks

Another feature of representing a chemical network by a bipartite graph, is that as reaction nodes represent a chemical transformation of one or more reactants, reaction nodes can never be external.

These issues become relevant for partitioning a network, because in isolating a subnetwork a new periphery is created for it. Severing the connection between the subnet and the rest of the network, some metabolites are received from or/and delivered to the rest of the network. Their mass balance can no longer be guaranteed by the subnet alone; in other words, the status of these metabolites is changed from internal to external. From a graph theory perspective, partitioning corresponds most naturally to deleting a link of a graph. However, that will not do for the biochemical network; in the bipartite representation, that would make a reaction node external, and it makes even less sense in the metabolites-only simple graph representation where a link represents a sum over several reactions. In clustering methods such as MCL, each node is allocated to a particular cluster, but that would not make sense here either as a metabolite that is made external by partitioning belongs to both subnets - as a product of one, and substrate of the other subnet. Clearly the appropriate way to represent partitioning is to split the metabolite node into two, each becoming an external node in either subnet. This leaves all reaction nodes as internal and uniquely assigned to a subnet. The effect of splitting a node is to stop probability flow through the node, and the simplest way to implement that in the matrix representation, is to delete the corresponding row from **P**
_{1 }and hence ultimately from the DAG.

The problem of partitioning the network hence reduces to finding a suitable (by criteria to be formulated) subset of internal metabolites such that when deleted, the network divides into self-contained subnets with no probability flow between them.

Recognising that the algorithmically found externals are due to be deleted in this way, it follows that metabolites that are already external in the full network should similarly be deleted from **P**
_{1 }even before the partitioning starts. This step in fact corresponds to the restriction of FBA calculations to the internal rows of the S-matrix as mentioned before. However, in the procedure presented here rows are deleted from the adjacency matrices **CR **and **RC **while **S **is left intact, so that the reaction stoichiometries can be used to restore the externals to each subnet once partitioning is complete.

Vector clustering

A quantitative measure of dissimilarity

Here, corresponding elements in two equal length binary vectors are paired, and _{ij }is the number of pairs with value (

Hierarchical clustering in addition needs a measure for the distance between clusters (a "linkage criterion"), and again several common measures were tried: single linkage (minimum intercluster dissimilarity), complete (maximum intercluster dissimilarity), average dissimilarity, dissimilarity of cluster centroids or medians, and finally the Ward minimum variance criterion. Single linkage was found to be both fast to calculate and gives good contrast; average, centroid and median are slower but give similar results, while complete and Ward lead to excessive fragmentation of the network.

Clustering of the combined blocking matrix in step 5 of the blocking transformation is performed broadly as described for the DAG. However since the combined blocking matrix can contain fractional values, the binary dissimilarity measure described by equation (1.6) is replaced by a generalisation of the Dice dissimilarity to real values and known as the Bray-Curtis distance between vectors

Blocking quality

Experimentation with various possibilities yielded the following scoring formula for the blocking quality

Here, W is the total number of zero (white) elements in the matrix. The two factors in this formula express distinct features that seem qualitatively reasonable to judge the quality of the blocking matrix. The first, squared, factor would be 1 if all non-zero elements are 1 (black) and decreases when there are more and lighter grey cells; so it is a measure of "how black" the block parts of the matrix is. On its own, however, maximising this tends to favour a small number of large blocks because that makes it easier to capture all the non-zero values inside blocks. To counteract that, the second factor represents the fraction of cells that are white, so this tends to be maximised by keeping blocks as compact as possible. It clear that for a perfectly blocked matrix, a maximum _{max }< 1 will be achieved if the cutoff produces clustering that coincides exactly with the blocks.

In an imperfectly blocked matrix, adjusting the dendrogram cutoff gives unpredictable fluctuations in the

Optimal selection of separator nodes

To implement the recognition of light grey cells in the blocking matrix as most promising for eliminating block overlap while keeping the metabolites taken as external to a minimum, the strategy is to select the smallest set of rows and columns that together cover all matrix cells with values below a chosen threshold. The threshold is determined as the value that selects a total number of light grey cells, no more than a low multiple of the column dimension of the matrix. This gives a flexible threshold value adapted to the size and nature of the matrix, which will lead to only a few metabolites eliminated at a time before checking for adequate subnet separation.

As any light grey cell could be eliminated by taking either its row or column metabolite external, the optimal selection from both sets is determined by reformulating this as an integer linear programming (ILP) problem. To set that up mathematically, introduce a binary column vector **x **of dimension = number of internal metabolites. Each vector element is 1 or 0 according to whether the corresponding metabolite is selected. The total number selected is obtained by premultiplying **x **with a row vector **b **of the same dimension with all elements equal to 1. The constraints are that for each light grey element _{ij }included, either its row or column or both needs to be present in **x**. That is codified by a constraints matrix **A **in which each row corresponds to a light grey cell, and in such a row the only non-zero elements are 1's for the columns corresponding to

This problem is to be solved in the domain of binary vectors, and is guaranteed to be feasible, since all constraints are satisfied by **x **= **b**. Solution by standard methods typically yields small sets of selected metabolites.

A quantitative measure of overall splitting effectiveness

The goal of subnet splitting is to reduce the complexity of interpretation (mentally or by further computation) by reducing the size of networks that need to be considered. It is shown here that a robust quantitative measure of how effective a particular splitting procedure is in achieving this goal can be developed under quite general assumptions.

The original network constitutes the obvious lower limit of simplification. In the opposite extreme where the network is fragmented into subnets consisting of a single node each, no overall simplification has been achieved either: while the subnets are simple, their interconnections reconstitute exactly the original network. This suggests that to judge overall effectiveness, the subnets should be considered together with a "metanetwork" which is derived from the original network by contracting the internal nodes of each subnetwork to a single meta-node. It should then be possible to construct a measure that evaluates to zero at the two extremes, and reaches at least one maximum at a suitable intermediate network partitioning.

To quantify the concept of simplification, it useful to introduce a monotonically increasing function _{i }

and this is subject to the constraint

The term "effort" is used to emphasize that this is not about network complexity as such. Many sophisticated measures of network or graph complexity have been defined by various authors, and network size usually does not play an important part in this - for example, both a square lattice and a fully connected network are conceptually simple, irrespective of size. Also, biochemical networks are known to be scale-free (having a power law distribution of node connectivity) and so complexity measures should give a similar value when applied to the full network and its subnetworks.

For a given _{i }=

Moreover, setting _{i }= _{i }

When _{u}(_{l}
_{u}. This leads to defining a performance measure (designated as the

This may be interpreted as the percentage of the distance between the upper and lower limits that has been achieved by a given partitioning, as measured on a logarithmic plot. Use of logarithmic scaling is not conceptually essential but helps to smooth the distribution of efficacy values when

It is easily checked that _{i }=

To get concrete values, a power law assumption produces the required concave up behaviour while still allowing the actual rate of increase to be adjusted:

A value

The efficacy curves calculated from equations (1.12) and (1.13) for equal-sized subnets, at subnet counts

Efficacy curves

**Efficacy curves**. Efficacy % as function of subnet count

The main significance of the

This formula is merely a calibration of the efficacy scale and has no fundamental significance. The results below illustrate its effects.

It is finally noted that the efficacy measure is constructed quite independently of the Netsplitter method; as its only required input is a list of subnet sizes, it can equally well be applied to diverse partitioning algorithms.

Results

The results obtained from the Netsplitter procedure are illustrated by considering the problem of investigating the flavonoid metabolism of the model plant

**Genome scale Arabidopsis model**. Specification of the network model extracted from Aracyc 4.5 and used for demonstration in the Methods section, as an SBML file.

Click here for file

For comparison, Figure

Matrix visualisation of simple connection degree network partitioning

**Matrix visualisation of simple connection degree network partitioning**. DAG matrix for genome-scale network of

For a large threshold value of 20, Figure

Figure

Stages in partitioning the network by the netsplitter procedure

**Stages in partitioning the network by the netsplitter procedure**.

**External Metabolites**. Listing of default external metabolites, specified as Biocyc compound ID's.

Click here for file

The reincorporation step is important to keep the number of externals as low as possible. For example, in the full

It is also instructive to see the action of the netsplitter procedure in an explicit network diagram. The actual layout of the flavonoid demonstration network, for which stages in the procedure were traced out in matrix form in Figure

Example flavonoid network split into four subnetworks

**Example flavonoid network split into four subnetworks**. Simplified layout omitting commodity and currency metabolites, to show partitioning into 6 subnetworks by converting the two separator metabolites identified by Netsplitter from internal to external. Reactions are shown as arrows or small circles. Metabolites are shown as rectangles or ovals, colour coded as follows: white - external; yellow, green, blue, purple: subnetwork internals; red - separation nodes, light blue - orphan metabolite nodes. The reaction indicated by "X" is eliminated from the network because after conversion it only involves external metabolites. A fully labelled version of this figure is available in Additional File

**Demonstration network layout**. The network layout shown in Figure

Click here for file

The algorithm identifies two separation nodes in this case - the metabolites trans-cinnamate and coumaroyl-CoA (shown in red); cutting these, the network falls apart into four natural subnets, plus two small fragments or "orphans". By inspection of the metabolite names (not shown in the figure) the subnets can be identified as synthesis of flavonoids (purple), lignin precursors (green), benzenoids (blue) and coumarin (yellow). While in this relatively small network it may have been possible (although not easy) to identify these separators by inspection, it should be borne in mind that much of the work to group nodes coherently has already been done in the manual construction of the two-dimensional layout displayed. In a realistic example, the input to the algorithm is merely arbitrarily ordered lists of metabolites and reactions, making the task much harder.

To place the results in a more general perspective, Table

Efficacy values

**Flavonoid**

**
M. pneumon
**

**
M. musculus
**

S-matrix size

137 × 117

189 × 229

1468 × 1348

2016 × 2158

2.1

2.8

7.2

7.6

Threshold

20

0

33

18

25

10

19

33

24

32

8

19

33

27

37

6

31

37

49

**41**

5

**80**

45

**54**

33

4

76

**60**

43

27

3

29

32

32

21

Netsplitter

88

85

70

48

Values are shown as percentages, and peak values highlighted in bold. The p-value increases with network size as described in the text.

Considering first the connectivity based splitting, Table

In judging efficacy percentages, its increased sensitivity near the optimum as illustrated in Figure

While a single numerical score can hardly be expected to capture all the varied considerations (some subjective) of what constitutes the best partitioning, the more detailed graphical representation in Figure

Stacked bar chart representation of network splits

**Stacked bar chart representation of network splits**. Subnet sizes for different partitioning of genome size networks of three organisms. Yellow bars represent connectivity partitioning with the indicated cutoff values C, magenta bars the Netsplitter partitioning followed by a cyan reference bar that shows theoretical maximal efficacy partitioning of the original network. In each bar, each subnet is represented by a segment with height proportional to subnet size, and subnets have been sorted in order of increasing size towards the top. Fragmentation is indicated by dense stacking at the bottom. Decreasing the cutoff to split the monolithic bar at the top increases fragmentation, but Netsplitter results improves both aspects.

In that figure, each bar segment corresponds to a subnet and the total height of each bar represents the total number of internal metabolites for that partitioning. Thus the height difference from the reference bar on the right, indicates the total number of internal metabolite mass balance constraints that have been sacrificed to achieve a particular split. The reference bar also shows the theoretical maximal efficacy

Figure

In the case of

A rather similar situation is shown in Figure

In all cases, the efficacy score based on equation (1.14) accords quite well with observations from the more detailed graphical display.

The analysis above of the performance of the netsplitter algorithm for the larger networks, shows that there is a decline with size but that this is not due to its efficiency in splitting, but rather that fragmentation becomes an increasing problem as networks grow. A direct approach to solve that is to introduce controlled merging of subnets and this will be further explored in a subsequent article.

Discussion

Previous work

A general counterargument is that removal of constraints cannot reduce the number of solutions to a problem. More specifically, consider for example a single mode that in the full network traverses two of the subsequent subnets. When the subnets are separated by reclassifying the metabolite node on their interface as external, the mode is correspondingly split into two parts. Since the original mode satisfied all constraints set by mass balances of internal compounds along its path, the two parts must separately continue to be viable. One part will now belong to the first subnet and terminate at the boundary node, which has become an unconstrained external sink and cannot affect its viability. The other will start at the corresponding unconstrained boundary source node in the second subnet and similarly remain viable because by construction the network context of internal metabolites nodes in each subnet is identical to that in the full network.

A direct demonstration of this is obtained from a comparison of the null space of the internal stoichiometry matrix in the full network and in the subnetworks. As the flux vector of a mode lies in the null space, a reaction can only be active (i.e., participate in any of the modes) if there is a non-zero entry in at least one basis vector of the null space. For example, in the flavonoid demonstration network shown in Figure

Performing this null space analysis for genome scale networks such as those shown in Figure

The reduction of the flux space is another perspective on the desirability of keeping the set of external metabolites as small as possible, as is implemented in Netsplitter. Nevertheless, it is observed that the reactions eliminated are (like the one in Figure

Computing efficiency has been taken into account in several aspects of the Netsplitter procedure. Performing the main computation on a metabolites only simple graph rather than the bipartite representation, reduces matrix dimensions roughly by a factor of two, since the numbers of metabolites and reactions are usually similar. As about half of the metabolites are typically external, including only internals gives a further dimension reduction by a factor of about two. Focussing on the (

This is borne out by moderate computing times. The total running time observed for the demonstration network of 117 reactions × 137 metabolites is 1.25 seconds, while for a genome scale network of 2037 reactions and 2179 metabolites this increased to 59 seconds, on a Core 2 Duo PC with 4 Gb of memory running at 2.66 GHz. These values appear quite acceptable and indicate that the algorithm scales better than quadratic with network dimension. It may be possible to achieve better performance by rewriting the code in a compiled language, but as extensive use is made of sophisticated graph theory and user interface functions built into

The procedure as presented is quite elaborate and requires considerable programming for its implementation. To facilitate its practical use, a software application "Netsplitter" has been developed as a

An intriguing observation made in applying Netsplitter, is the radical change in resolving network structure that results from excluding high connectivity metabolites. An example is seen by comparing Figure

It is surmised that the reason for this behaviour can be understood from percolation theory _{c}. Long range paths that penetrate the entire network only exist for occupancies greater than _{c }. Values _{c }for a variety of lattices have been derived mathematically or numerically; for a simple infinite 2-dimensional square lattice this is 0.5 and typical values range between 0.3 and 0.6, including some non-regular or randomized lattices, while lower values are obtained in higher dimensions. The values also depend strongly on the coordination number

where a simple sum over all matrix elements is taken, and N is the total number of nodes. Applying this to the example network, it is found that the removal of the four high connectivity metabolites produces a sharp drop in the value of the occupancy from a value of 0.630 for the network in Figure

Based on this understanding, an automated strategy could be pursued to progressively reclassify the highest connectivity internal metabolites as external until there is a sharp drop in the

Regarding the proposed efficacy measure, it was indicated above that it relies mainly on a general framework and even where a particular functional form such as the power law in equations (1.13) and (1.14) was postulated, its parameters merely readjust relative sensitivities to detailed features of the partitioning. This was further tested by experimenting with different functional forms such as an exponential dependence instead of a power law, taking

The efficacy score measures the degree of simplification achieved by a given network partitioning. As shown in its derivation this is mathematically maximised for equal sized subnets. That does not mean that equal sized subnets is the ideal partitioning outcome; simplification is not the only criterion by which to judge success. Clearly there would be no special functional or biological relevance to an equal sized partitioning. On the other hand the low efficacy opposite extreme (towards which simple degree-based partitioning tends for large networks) of a large monolithic block and small fragments, or even complete fragmentation, is also functionally meaningless. As the

Conclusions

The modularization of a large, complex biochemical network into subnets that can be associated with recognisable biological functions, can be helpful both in the conceptual understanding and interpretation of the network, and to reduce practical problems that arise in the application of analysis methods such as constraint-based modelling. The challenge in constructing an algorithm for this task is to accommodate both the objective structural properties of the network, and more subjective requirements such as the desire for a manageable number of subnets of roughly similar sizes. Also, while it is inevitable that some information will be lost when a subnet is isolated from its larger context, it is desirable to restrict this loss to information that is not subjectively relevant for a particular study.

In the procedure proposed here, dealing with the information aspect is facilitated by selecting metabolite node cutting as the partitioning operation, since this pinpoints the nature of the information loss as removal of a mass balance constraint. Then the subjective requirements are met by allowing the flexibility to veto the selection of particular nodes to be cut, or to terminate partitioning at a suitable subnet size. Both local and long range network structure is taken into account by the use of random walks and clustering strategies, and finally information loss is minimised by using optimisation techniques in selecting candidate separation nodes and by explicit reincorporation of nodes not essential for the separation.

The combination of these strategies succeeds in moderating the extremes of the subnet size distribution that results from partitioning based simply on connectivity degree.

This point is illustrated by considering Figure

The efficacy measure

At present, it seems that the most promising applications of subnet splitting would be to studies and interpretation of network structure, such as those based on elementary mode analysis, rather than for the more quantitative FBA. In this context, subnetworks can play an important role in reducing the often very large number of elementary modes in a large network. The use of subnets for FBA would similarly simplify the problem and allow the elimination of extraneous detail not relevant for study of a particular aspect of metabolism. However, the obstacle that arises is that it would usually be more difficult to fix the boundary conditions ( i.e. flux values for metabolite exchange with the environment) for a subnet than for the full network. At least for a single cell organism, full network boundary fluxes reflect overall nutrient uptake or waste elimination rates that are relatively easy to measure. Externals of a subnet are likely to include metabolites shared with another subnet and measuring the associated fluxes may require much more detailed metabolic measurements. In special cases, such as when the subnet is spatially localised e.g. to a particular cellular organelle, this might present less of a problem.

A by-product of the matrix oriented approach used by the netsplitter algorithm, is the visually powerful display of network structure. Even for large networks for which a network layout diagram is totally unintelligible, features of network connectivity can be recognised at a glance from the colourscale plot of the truncated DAG matrix.

Even more striking is the characterisation of fully and partially resolved subnetworks afforded by grayscale plots of the blocking matrices. The blocking transformation that was introduced as the basis for computational recognition and optimisation of blocks and their overlaps, serves this second purpose to visualise rather subtle structural network properties. Quite apart from the purpose to separate subnets, this visualisation should be a useful tool e.g. to explore the structure of large networks or to compare how related networks differ from one another.

Competing interests

The author declares that they have no competing interests.

Acknowledgements

The author gratefully acknowledges the hospitality of the University of Queensland, Brisbane where part of this work was completed.