Abstract
Background
Metabolic reconstructions contain detailed information about metabolic enzymes and their reactants and products. These networks can be used to infer functional associations between metabolic enzymes. Many methods are based on the number of metabolites shared by two enzymes, or the shortest path between two enzymes. Metabolite sharing can miss associations between nonconsecutive enzymes in a serial pathway, and shortestpath algorithms are sensitive to highdegree metabolites such as water and ATP that create connections between enzymes with little functional similarity.
Results
We present new, fast methods to infer functional associations in metabolic networks. A local method, the degreecorrected Poisson score, is based only on the metabolites shared by two enzymes, but uses the known metabolite degree distribution. A global method, based on graph diffusion kernels, predicts associations between enzymes that do not share metabolites. Both methods are robust to highdegree metabolites. They outperform previous methods in predicting shared Gene Ontology (GO) annotations and in predicting experimentally observed synthetic lethal genetic interactions. Including cellular compartment information improves GO annotation predictions but degrades synthetic lethal interaction prediction. These new methods perform nearly as well as computationally demanding methods based on flux balance analysis.
Conclusions
We present fast, accurate methods to predict functional associations from metabolic networks. Biological significance is demonstrated by identifying enzymes whose strong metabolic correlations are missed by conventional annotations in GO, most often enzymes involved in transport vs. synthesis of the same metabolite or other enzyme pairs that share a metabolite but are separated by conventional pathway boundaries. More generally, the methods described here may be valuable for analyzing other types of networks with longtailed degree distributions and highdegree hubs.
Background
High quality metabolic reconstructions are available for many organisms and provide a rich scaffold for interpreting data from highthroughput biological experiments. The topology of a metabolic network, defined by connections between enzymes and metabolites, can be used to predict genetic interactions, transcriptional correlations and disease comorbidity [13].
Previous studies have used the topology of the metabolic network to predict coexpression of transcripts for yeast metabolic enzymes [4]. This study first removed highdegree metabolites from the bipartite metabolic network, generated an enzymeonly network by connecting enzymes that shared at least 1 remaining metabolite, and calculated the shortestpath distance between all pairs of enzymes. Shorter distances were correlated with stronger coexpression. Similar procedures, also excluding highdegree metabolites from consideration, were used recently in a study linking diseases to metabolic enzymes [3].
Methods that involve calculation of optimal fluxes subject to constraints, such as flux coupling [5], have performed better than local topological metrics based on shared neighbours in predicting transcript coexpression. Flux coupling methods are much more computationally expensive than topological analysis, however. Furthermore, flux coupling methods suffer from the disadvantage that reactions with small flux values (and hence the enzymes involved in those reactions) are typically removed from the network. This is a problem if an enzyme of interest is removed from the network based on low reaction flux.
Our goal is to provide improved topological measures for enzyme functional associations from metabolic networks without the need for expensive calculations of optimal fluxes or sampling over feasible flux space. The motivation of our approach is that methods that count shared metabolites, or methods that generate a pvalue for shared metabolites based on a hypergeometric distribution, essentially assume a flat degree distribution for metabolites. Highdegree metabolites violate the assumption of a flat degree distribution, and the hypergeometric distribution is inappropriate for calculating pvalues for metabolite sharing. Randomization methods based on rewiring, which maintain the observed degree distribution, are robust to highdegree metabolites but unfortunately are computationally expensive.
In this work, we provide a series of scores that are the Bayesian equivalent of the hypergeometric distribution, but adjusted for the known metabolite degree distributions. These scores are fast to calculate, essentially no more expensive than a hypergeometric pvalue, and much faster than any methods that require rewiring permutations, flux sampling, or flux optimization. Results from applying these methods to metabolic networks in yeast demonstrate performance better than previous methods based on local connectedness. The results also reveal functional associations that are not captured by conventional metabolic pathway definitions, but which are inherent in the network structure.
Results
Overview
Metabolic networks can be represented as bipartite graphs with edges between enzymes and metabolites. An enzyme can use a metabolite in multiple unique reactions involving distinct subsets of other metabolites, and the number of unique reactions defines an integervalued edge weight.
A graphical overview shows how metabolic network information is incorporated to yield increasingly sophisticated models (Figure 1). The models discussed here are all designed to rank pairs of enzymes for functional association. Methods termed "Local" are capable only of producing rankings for enzymes directly connected by at least one metabolite. The raw number of shared metabolites [3,6] and the hypergeometric distribution that corrects the shared count for the enzyme degree (Figure 1A) are both local methods.
Figure 1. Methods overview. A bipartite graph representation of a metabolic network connects enzymes (squares) to metabolites (circles). (A) The number of metabolites shared by two enzymes can be used to rank the functional association of the pair. This raw count can be corrected to account for enzyme degree, using a hypergeometric distribution under a null hypothesis. (B) Hub metabolites, such as M1, can be down weighted when assessing shared neighbours. One approach is to remove these highdegree metabolites from the network. An alternative thresholdfree approach is to incorporate the metabolite degree distribution, as carried out here with a Poisson distribution. (C) One failing of shared neighbour scores is that they are inherently local, providing information only about enzymes that share at least one metabolite. Global methods provide information about all enzyme pairs, no matter how far separated. An improved approach to shortest path between two enzymes is to count paths of all length, with decreasing weight given to longer paths. This method is termed a graph diffusion kernel, with a single parameter used to determine the discount rate for longer paths. (D) Edges in metabolic networks correspond to metabolic fluxes, which have directions and are constrained by maximum capacities and reaction stoichiometry. Functionally related enzymes may have correlated fluxes. The flux correlations can be calculated by repeated sampling, which is computationally intensive but provides a more realistic model for the metabolic network than simpler topological models.
In this work, we introduce a more sophisticated local method that also corrects for metabolite degree, discounting the contribution of highly connected metabolites like water, protons, and ATP (Figure 1B). Our new local methods are motivated by Bayesian model selection using the loglikelihood ratio of a null model (random connectivity between enzymes and metabolites) to an alternative model. The number of shared metabolites is modelled as a Poisson distribution for both the null and alternative models. For the alternative model, the Poisson parameter is the maximum likelihood estimate for the observed network, which is the observed number of shared metabolites. For the null model, the Poisson parameter is estimated from a random network model (virtually identical to the leading contribution to the hypergeometric distribution). We present results for an improved Poisson model that uses knowledge of the observed metabolite degree distribution.
Methods termed "Global" are capable of generating rankings for enzymes that are not directly connected by metabolites by using the full network topology. Examples are shortest paths and the more robust graph diffusion kernel (GDK) (Figure 1C). GDKs are nonlocal in that they sample over all paths between two enzymes, rather than just the shortest paths defined by shared metabolites. GDKs have been successfully applied to functional inference in metabolic networks [7]. Recently parityspecific kernels been used to analyze genetic interaction networks [8]. We also used a method based on the Pearson correlation of the weighted metaboliteenzyme edge connectivity structure between two enzymes.
Yet more elaborate flux balance analysis methods sample flux states that are feasible under steadystate constraints. Flux correlations can then be used to rank enzyme pairs for functional associations (Figure 1D). Other flux balance methods have generated functional associations by predicting synergistic or buffering epistatic interactions for deleting pairs of enzymes from the network.
Performance of local methods
Performance is assessed primarily by the ability to predict synthetic lethal genetic interactions between metabolic enzymes, and secondarily by the ability to identify classes of enzymes with similar Gene Ontology (GO) annotations. The synthetic lethal interactions provide a direct link to testable experiments. The database annotations are not necessarily testable, but instead show whether inference from computational models is consistent with known biology.
We generated rank ordered lists of enzyme pairs based on the local methods. Performance was assessed from the receiver operating characteristic (ROC) curve using the area under the curve (AUC), and from the precisionrecall (PR) curve using the maximum Fscore, the harmonic mean of precision and recall. Known positives were taken from experimentally reported synthetic lethal/growth defect interactions recorded in the BioGRID database. There were 170 growth defect/lethal interactions in which both genes involved were part of the metabolic network model. Known negatives were defined as gene pairs where each gene has at least one synthetic lethal interaction, and one of the two has at least 5 synthetic lethal interactions as a query in a highthroughput screen, to exclude pairs that might not have been tested experimentally.
Performance metrics for the local methods in predicting synthetic lethal genetic interactions are the AUC and Fscore (Figure 2). Raw counts of shared metabolites perform the worst. Thresholding the network to remove highdegree metabolites introduces changes in the performance only slightly, and results remain inferior to global methods described below (Figure 3, see [Additional file 1] for details of metabolites removed). The methods that account for the degree of an enzyme, but assume a flat metabolite degree distribution, improve on the raw counts. The new method, which accounts for the metabolite degree as well, performs at least as well as previous methods, and the precision degrades the most slowly of all methods at higher recall.
Figure 2. Performance comparison, local methods. Methods are listed, from bottom to top, roughly in order of improving performance in predicting synthetic lethal genetic interactions. Shared metabolites indicates raw counts of shared metabolites, and "no hubs" indicates that the most highly connected metabolites were removed before calculating sharing. The Poisson score, hypergeometric pvalue, and Bayes score are based on the number of shared metabolites, but do not use the full metabolite degree distribution. The method "Poisson score (met. deg.)" uses the metabolite degree distribution for an additional correction. Performance is assessed using synthetic lethal gene pairs as known positives, and nonsynthetic lethal pairs as known negatives. (A) Receiver operator characteristic (ROC) curve. (B) Precisionrecall (PR) curve.
Figure 3. Performance of raw neighbour count with varying threshold for removing highdegree metabolites. The degree of each metabolite (the number of reactions it participates in as reactant or product) was calculated, and metabolites were progressively removed. After each removal, the performance of the raw neighbour count for predicting synthetic lethal interactions was assessed by the area under the receiver operating characteristic curve (AUC, panel A) and by the FScore, defined the maximum harmonic mean of precision and recall along the precisionrecall curve (FScore, panel B). The results in the main text used a nominal threshold of the top 5% of all metabolites, corresponding to 53 metabolites with degree ≥13 of the entire set of 1061 metabolites. The results using a graph diffusion kernel (GDK) are shown to be superior to the raw neighbour regardless of the threshold used.
Additional file 1. Performance of raw neighbor count. The data from Figure 3 are replotted with the addition of the name, cellular localization (c = cytoplasm, m = mitochondrion, n = nucleus, x = extracellular), and degree of each metabolites at the position it is removed.
Format: PDF Size: 27KB Download file
This file can be viewed with: Adobe Acrobat Reader
Performance of global topology methods
We then compared the performance of the best local method (Poisson score using metabolite degree distribution) with methods that use global topology and incorporate local edge weights (Figure 4). Global topology is incorporated using a graph diffusion kernel that sums paths of all lengths between enzyme pairs, rather than just the number of shortest paths involving shared metabolites. Local weights for enzymemetabolite edges are taken from the integer number of unique reactions involving each enzymemetabolite pair; these weighted edges were then used to construct a graph diffusion kernel.
Figure 4. Performance comparison, local vs. global methods. The best local method, the Poisson score incorporating metabolite degree, is compared with global methods based on graph diffusion with and without edge weights accounting for multiple reactions for each enzyme. The global methods perform better, and weighted and unweighted diffusion kernels have equivalent performance. (A) ROC curve. (B) PR curve.
Overall, incorporating global information through the graph diffusion kernel improves the performance in identifying synthetic lethal pairs. Adding edge weights to the graph diffusion kernel does not appear to improve performance. Further comparisons with the graph diffusion kernel use the unweighted model only, as it is simpler.
Global topology with metabolic constraints
We next considered possible improvements that use the knowledge that the network edges represent a flux balance model for metabolism. While others have investigated models that investigate the robustness of metabolism to pairwise gene deletions [9], correlations of fluxes through enzymes provide improved predictions of genetic interactions [2]. We therefore used the flux sampling approach to calculate enzyme correlations, whose absolute values were used to rank enzyme pairs. The flux sampling excludes reactions with negligible flux, which reduces the model to 477 metabolites, 582 reactions and 469 enzymes and reduces the known positive pairs to 69 genetic interactions.
Comparison of the flux sampling method to the best local and global topology methods considers only the enzymes present in the reduced model (Figure 5). The flux sampling method has the best precision in the highrecall region. Enzyme pairs that are used in exactly the same reactions, or which are coupled in an obligate serial pathway, have a flux correlation of 1. These pairs are all ranked identically by the flux sampling method, but may be ranked in a definite order by the local and global topology methods. In the lowrecall region, consequently, the local and global methods appear to provide improved precision over the flux sampling approach.
Figure 5. Performance comparison, incorporating metabolic constraints. The best local and global methods are compared with methods that use the knowledge that the network models steadystate metabolism. These methods predict synthetic lethal interactions based on decreases to a fitness function ("Scaled epistasis", ref. [10]) and based on correlations of steadystate fluxes sampled over feasible space ("Flux corrln.", ref. [2]). (A) ROC curve. (B) PR curve.
The epistatic estimates, obtained from previous work [9], do not perform as well in predicting synthetic lethality.
Compartmentalized metabolism
Reactions in the metabolic network are localized to specific compartments (cytoplasm, mitochondria, extracellular, peroxisome, nucleus, Golgi apparatus, endoplasmic reticulum and vacuole). Of the total 1266 metabolic reactions in the model, 47% are entirely contained in the cytosol, roughly 10% are either mitochondrial or extracellular, and 24% couple metabolites from different compartments (Table 1). If one metabolite appears in two different compartments, it is represented as two unique metabolites; enzymes in different compartments that process this metabolite are not scored as sharing it, reducing the ability to detect functional associations for enzymes in different compartments. On the other hand, functional association as measured by synthetic lethality is roughly 3× higher for enzyme pairs that share at least one compartment (Table 2, twosided pvalue = 3 × 10^{6}), and removing compartment information neglects this information.
Table 1. Compartment distribution of reactions.
Table 2. Synthetic lethality and shared compartments.
To test these competing possibilities, we applied the local and global methods to a network in which compartment information was removed. The yeast metabolic network we used in this study specifies compartments for enzymes and metabolites [10]. We reasoned that identical reactions occurring in different compartments can functionally compensate for each other due to diffusion or transport of metabolites across compartment boundaries. We therefore generated a simplified network that ignores the cellular compartments of the metabolites. Removing the compartments reduced the number of metabolites from 1061 to 646.
The general result for both the local method (Poisson score incorporating metabolite degree) and the global method (graph diffusion kernel with an unweighted network) is that removing compartments improves the precision at high recall at the expense of worse precision at low recall (Figure 6).
Figure 6. Performance comparison, compartments. Local and global methods are tested for the original metabolic network, and for a reduced network in which metabolite compartment information has been removed. (A) ROC curve. (B) PR curve.
Assessment based on database annotations
We used published methods to assess the ability of different ranking methods to identify enzyme pairs with similar database annotations [11]. This assessment tests the significance of the hypothesis that the average coupling score between all pairs of genes associated with a GO term is higher than between pairs of genes associated with different GO terms. All 835 GO terms mapping to at least 5 and less than 100 enzymes in the network were considered, comprising 538 biological process (BP), 209 molecular function (MF), and 88 cellular compartment (CC). If the null is rejected by the BenjaminiHochberg procedure at a starting pvalue of 0.05/(number of GO terms tested) [12], then the GO term is termed consistent. The fraction of consistent GO terms was computed by this procedure for each method. The assessment was performed for the complete compartmentbased network, the reduced compartmentbased network with enzymes with negligible fluxes removed, and the network with compartments removed. The category size of 5100 matches the original publication. Assessment with category sizes of 2100 and 2200 yield similar results, but with more categories overall [Additional file 2].
Additional file 2. Consistent GO terms and category size. GO consistency analysis results are provided for category sizes of 2100 and 2200, compared with 5100 in the main text.
Format: PDF Size: 58KB Download file
This file can be viewed with: Adobe Acrobat Reader
Of all the methods, the graph diffusion kernel clearly performs best overall in producing consistent GO annotations (Table 3). The performance of the methods generally tracks the assessment based on synthetic lethal interaction prediction. Of the local methods, the raw count of shared metabolites performs worst; better are the improved methods that account for enzyme degree; and best of all is the new method that accounts for metabolite degree as well. The graph diffusion method also outperforms the flux sampling method.
Table 3. Performance summary statistic (AUC and Fscore) and percentage consistent GO terms.
We investigated whether the models with or without compartments perform better using a binomial test under the null hypothesis that each of the two models has equal probability of performing better for a particular method. Tests were performed separately for AUC/ROC, PR/Fscore, and the three GO categories, and separately for the full and reduced network. In all cases except GO/Cellular Compartment, twotailed tests showed no significant deviations from the null hypothesis at p = 0.05. The test for Cellular Compartment is significant (p = 0.0396), with improved consistency for models that include compartment information.
Flux correlation performs best in the Biological Process category. A surprising result here is that overall flux correlation performs worse than GDK and Poisson (metabolite degree, with or without metabolite associations) with only 70% consistent GO terms. It is possible the flux correlation method could be improved with longer runs that reduce the statistical noise in the correlations, although the flux calculations already require over 100× more computer time than any of the other methods (see Computational cost below).
The different methods generate a similar set of consistent GO term, and terms missed by the graph diffusion kernel are almost always missed by the other methods as well (Figure 7). For cellular compartment, most annotation terms (72%) are consistent by all three methods. In the case of biological process and molecular function only 50% and 71% of all the annotations are captured by all the three methods.
Figure 7. Consistency of Gene Ontology annotations. Venn diagrams indicate the number of consistent Gene Ontology (GO) terms identified for Biological Process, Molecular Function, and Cellular Compartment annotations. GO terms that are inconsistent in all three methods are shown as 'GO terms missed'.
Discussion
Performance of local, global, and fluxbased methods
We have investigated the performance of three classes of methods for predicting functional associations in metabolic networks: (i) local methods, based primarily on the metabolites shared by two metabolic enzymes; (ii) global methods, based on the probability that a random walk started at one enzyme will visit a second enzyme; (iii) fluxbased methods that use flux balance to identify enzymes with correlated fluxes. The local and global methods are fast and generally applicable to other types of networks, whereas the fluxbased methods are computationally expensive and dedicated to metabolism (or other networks that have similar conservationofmass constraints). In terms of performance in predicting functional associations, however, the dedicated fluxbased methods have typically been superior. Developing fast methods with performance similar to expensive fluxbased methods has been a challenge.
Previous local and global methods have had difficulties with highdegree metabolites. For local methods, metabolites such as water and ATP are often shared by enzymes with very different functions. For global methods, these metabolites introduce many short paths through the network. Often, highdegree metabolites are removed from a network prior to analysis. This approach is undesirable because it introduces an ad hoc tuning parameter, which can lead to overfitting, and it excludes potentially interesting metabolites from the analysis.
The hypothesis motivating this work is that the difficulties from highdegree metabolites arise from an implicit assumption of a narrow metabolite degree distribution, as opposed to the known longtailed degree distribution. The hypergeometric distribution for shared metabolites corrects for enzyme degree, but not for metabolite degree. Consequently a highdegree metabolite such as water is given the same weight as a lowdegree metabolite when counting shared metabolites. Intuitively, highdegree metabolites should be downweighted. Our improved local method uses the known enzyme and metabolite degrees to generate a degreecorrected score with excellent performance.
The global method we examine, a graph diffusion kernel on the bipartite enzymemetabolite network, also includes a degree normalization that downweights the contribution of highdegree metabolites (and highdegree enzymes). This method is somewhat more expensive than the local methods, requiring a full matrix inverse rather than sparse matrix multiplication.
The graph diffusion kernel explores the topology of the metabolic network using random walks that visit metabolites and enzymes. Enzymemetabolite edges are treated as undirected, permitting random walkers to traverse both directions even for a unidirectional reaction. There are no constraints on the flux of random walkers through any enzyme, and the stoichiometry of a metabolite as a reactant or product is ignored.
Fluxbalance methods go beyond graph diffusion by adding constraints specific to metabolic networks. Enzyme fluxes are coupled by mass balance and reaction stoichiometry, and correlations between enzyme fluxes can propagate through the network. These additional constraints capture more of the biological reality of metabolism than either shared metabolites or graph diffusion. Predictive performance is also better, presumably because of the biological constraints. A curious point is that flux sampling, with a uniform sample over the feasible space, performs better than calculations of epistatic effects based on reductions to an optimized fitness objective function. This may indicate errors in the assumed objective function for cellular fitness. The main drawback of fluxbalance methods is the high computational cost.
In summary, graph diffusion methods have become a method of choice for analyzing many types of networks. While the degreecorrected local methods provide a substantial improvement over previous local methods, the graph diffusion kernel using the entire network topology performs somewhat better. Dedicated fluxsampling methods are slightly better for predicting genetic interactions, but take over 100× longer to calculate and are much more difficult to implement.
Disagreement between networkbased predictions and database annotations
As the graph diffusion kernel (GDK) method provides a good tradeoff between performance (assessed by genetic interaction prediction) and computational efficiency, it is likely to be a method of choice. The average semantic similarity for gene pairs highly ranked by the GDK provides an additional assessment of performance (Figure 8). GDK scores of 10 and more have an average semantic similarity of 4 or greater, corresponding to no more than 116 genes annotated to the parent category. Thus, on average, a high GDK score indicates a similar database annotation.
Figure 8. Cumulative semantic similarity. The graph diffusion kernel was used to calculate scores for gene pairs (compartment model, diffusion parameter γ = 1). For each gene pair, the semantic similarity was also calculated for Biological Process, Cellular Component, and Molecular Function, and the largest of the three values was retained. The cumulative average of this maximum value was then calculated for GDK thresholds of decreasing stringency. (A) The threshold is shown as the GDK score, from least stringent (score = 5 × 10^{6}, the smallest GDK score) to most stringent (617.2, the largest GDK score). (B) The threshold is shown as the rank order, from most stringent (rank = 1) to least stringent (rank = 279,000, the total number of pairs).
These average results, however, do not always hold for individual gene pairs. A crosstabulation of semantic similarity vs. GDK score demonstrates that many gene pairs with high GDK scores nevertheless have essentially no semantic similarity (Figure 9). These GDK predictions of functional association would essentially be scored as falsepositive predictions based on the lack of similarity in database annotations. Because of the overall good performance of the GDK method, we systematically investigated the most extreme cases in an attempt to suggest a reason for the disagreement.
Figure 9. Putative false positive functional associations. A density plot of gene pairs binned by semantic similarity and graph diffusion kernel (GDK) score indicates putative false positive functional associations. The semantic similarity for each gene pair was calculated as the maximum of the values for Biological Process, Cellular Component, and Molecular Function. The average of this value over all gene pairs is 2.52, and gene pairs with semantic similarity below this value have essentially no semantic similarity. Gene pairs with GDK scores of 50 and above have an average semantic similarity of 5.6, corresponding to 23 genes annotated to the parent category. Putative false positives were defined for exploratory purposes as having GDK scores of 50 and above but semantic similarity of 2 or below, corresponding to 850 genes annotated to the parent category. The putative falsepositive region is indicated and contains 101 gene pairs.
We therefore selected examples with high GDK scores and low semantic similarity. A GDK threshold of score ≥50 focused attention on the topranked GDK pairs, where on average the semantic similarity corresponds to only 23 genes annotated to the parent category of a pair. We then set a threshold of semantic similarity ≤2. This value is substantially below the semantic similarity of gene pairs selected at random, and was chosen to yield a number of example that was feasible for casebycase analysis, a data set of 101 pairs denoted putative false positives [Additional file 3].
Additional file 3. Putative false positives. The 101 gene pairs with high network association and low semantic similarity are tabulated and the reasons for disagreement are annotated.
Format: TXT Size: 78KB Download file
Manual inspection suggests that these cases can be classified into 4 main categories: unannotated, transportsynthesis, pathwayboundary, and secondary activity (Table 4). The unannotated category comprises 10% of the cases and indicates that a gene lacks database annotations for Biological Process, Cellular Component, and Metabolic Function, despite being included in a metabolic reconstruction. The secondary activity category, with only 3 cases, is a related discrepancy in which the secondary metabolic activity of an enzyme is present in the metabolic reconstruction but not noted in GO.
Table 4. Categories of gene pairs with high network association and low semantic similarity.
The transportsynthesis category is the largest, with about 65% of the cases. In these examples, the GDK predicts an association between enzymes responsible for synthesis and transport (usually extracellular to intracellular) of the same metabolite. Thus, both enzymes are fulfilling the same role of increasing the intracellular concentration of a metabolite. The structure of the GO annotations does not reflect this close functional association, however. Examples include transport of choline/ethanolime (HNM1CKI1), allantoate (DAL5DAL1), sterols (AUS1ERG27), and uridine (FUI1URK1).
The final category, pathwayboundary, arises when the boundary between two wellaccepted pathways cuts through a metabolite. Enzymes that connect to this metabolite are then annotated to very different pathways, despite a close networklevel association. These cases are responsible for 20% of the total. Examples include associations between enzymes in the TCA cycle and those using TCA metabolites for amino acid synthesis, enzymes with different roles in glycoprotein synthesis, and enzymes responsible for quinone metabolism [Additional file 3].
Conclusions
In analyzing large networks, it has become common to delete highdegree vertices. This practice is questionable. It depends on an arbitrary highdegree cutoff, usually without any clear break in a vertex degree distribution. It can remove vertices that are of interest, and it can introduce unknown biases into the analysis.
Here we have introduced methods that are readily applied to networks with highdegree hubs. Local methods use known degree distributions to correct for highdegree enzymes and metabolites, and global methods use graph diffusion kernels to rank the association between pairs of enzymes. We show that these methods outperform previous methods that eliminate highdegree vertices from the networks. The context is cellular metabolism, where highdegree metabolites like water and ATP are shared by many enzymes. Our methods are able to infer functional associations between enzymes, without being misled by sharing of these highdegree metabolites.
In several cases, enzymes predicted by network analysis to have high functional association have very little similarity in database annotations. Some of these cases are due to a discrepancy between the metabolic reconstruction, which records a reaction for an enzyme, and the annotation database, which lacks information or omits a secondary activity. Two additional patterns were observed, however, which relate to the structure of Gene Ontology hierarchies. First, enzymes that are responsible for synthesis and transport of the same metabolite often have little annotation similarity. Second, conventional pathway definitions may place two enzymes with strong networklevel associations on opposite sides of a pathway boundary.
The methods developed here should be applicable in general to other bipartite networks, particularly those with highdegree hubs.
Methods
Yeast Metabolic network reconstruction
The Yeast metabolic network used in our study was obtained from the database maintained by systems biology group, University of California, San Diego [10,13]. The file "Sc_{}iND750_{}GlcMM.xml" corresponding to the minimal media condition was obtained from http://gcrg.ucsd.edu/Downloads/Cobra_Toolbox webcite. This network has 1061 metabolites, 1266 reactions and 750 genes. The stoichiometry matrix S(m, r) provides the number of metabolites m consumed or produced in reaction r. The reactiongene association matrix E(r, e) in the metabolic network indicates whether reaction r can be catalyzed by enzyme e.
Coupling measures based on metabolic bipartite network
A bipartite network has two disjoint sets of vertices with edges only between vertices of different sets. In the case of the metabolic network, we consider enzymes e and the metabolites m as disjoint vertices in a bipartite graph. We use various metabolic coupling measures between two enzymes in this graph to predict synthetic lethal genetic interactions. Towards this goal, we use both methods from literature (based on shared metabolites [6], shared metabolites after removing high degree metabolites [3,4]) and other methods proposed here.
Shared metabolite count (with and without hubs)
The coupling between metabolites can be calculated using the stoichiometry matrix as [6]. The elements of , denote the participation of a metabolite i in a reaction j with a value 1 and 0 otherwise ( is binary version of the stoichometry matrix, S). This idea extended to the coupling of genes based on the bipartite metabolic network could be represented as , where M^{T }= E, and is the binary version of the matrix M. The element C_{ij }of the matrix C now represents number of metabolites shared between enzyme i and enzyme j.
The metabolite degree is defined as the number of reactions in which a metabolite participates. In previous work, highdegree metabolites have been excluded from metabolite sharing (equivalent to ignoring the rows corresponding to metabolite hubs in the stoichiometry matrix) [3]. In our calculations of shared metabolites, the top 5% of metabolites were excluded (53 metabolites participating in 13 or more reactions).
Bayesian score
We used our previous method based on 2 × 2 contingency table for calculating the metabolites shared between two enzymes [2]. Briefly, this measure is obtained as a log likelihood ratio of alternative to null hypothesis. Under the null, the probability of connection of a metabolite to both enzymes is product of the individual probabilities of connections. Let n, n_{1}, n_{2 }and n_{12 }represent the total number of metabolites in the network, the number connected to enzyme 1, the number connected to enzyme 2, and the number connected to both. The score is then
The combinatorial factor C(n, k) is n!/k!(n  k)!. This score increases when n_{12 }is either larger or smaller than the value n_{1}n_{2}/n expected under the null hypothesis (analogous to a twosided test). For n_{12 }smaller than the null expectation, we used score(n_{12})  score(n_{12})  score(n_{1}n_{2 }/ n) to restrict attention to enrichment.
Hypergeometric pvalue score
The score based on the hypergeometric pvalue for characterizing the metabolic coupling between the enzymes 1 and 2 is
Poisson score
This score is also obtained as the loglikelihood ratio of an alternative to null model for the observed number of shared partners. Both the alternative and the null employ a Poisson distribution with a single parameter λ. For the alternative, λ_{alt }= n_{12}, the observed count; for the null, λ_{null }= n_{1}n_{2}/n_{tot}. The total number of metaboliteenzyme edges is n_{tot}; n_{1 }and n_{2 }are the numbers of metabolites connected to each enzyme; and n_{12 }is the intersection of the metabolites in n_{1 }and n_{2}. The Poisson score is
The absolute value for the second term in Eq. 3 ensures that large scores come from enrichment rather than depletion of shared metabolites.
Poisson score with metabolite degree
The null model in the Poisson score was further improved by considering a different value for λ_{null }using the degree distribution of the metabolites connected to the enzymes. Let k_{i1}, k_{i2},..., be the degree of the metabolites m_{i1}, m_{i2},..., m_{in }connected to enzyme i in the enzyme pair (i = 1,2). The probability of the metabolite m_{11 }to be connected to enzyme 2 is
The average number of metabolites connected to enzyme 1 that are also connected to enzyme 2 is then
The values of and are in general different due to different degree distribution of the metabolites connected to enzymes 1 and 2. The value of λ_{null }used in the improved null model is obtained by arithmetic mean of and . The final expression for the improved Poisson score is
Poisson score with metabolite associations and degree
A further improved Poisson score can be generated using a more elaborate λ_{null}, accounting for the full metabolite degree distribution. The probability of k connections given the status of the metabolites m_{1},..., m_{i }connected to enzyme E, P^{E}(km_{1},..., m_{i}) is
The probability of enzyme connected and not connected to the metabolite m_{i }are estimated as
We take P^{E}(k = 1m_{1},..., m_{i}) = 0, P^{E}(k = 0) = 1 in the calculations and obtain P^{E}(km_{1},..., m_{i}) for the enzymes A and B in the pair (and respectively connected to n_{1 }and n_{2 }metabolites) to obtain the λ_{null},
We use λ_{null }(Eq. 9) to obtain the Poisson score that takes metabolite associations and degree into account,
Results from this more complicated model are included in Table 1, but not discussed in the text as the method is more complicated yet performs no better than simpler Poisson score methods.
Graph diffusion kernel
Graph diffusion kernels are solutions to the steadystate density distribution for continuoustime random walk or diffusive process on a graph with sources and sinks [8]. The adjacency matrix for the calculation of the GDK measure (Eq. 11) has a block structure whose dimension is given by the sum of number of metabolites and enzymes in the yeast metabolic network (i.e. 1061+750 = 1811),
The matrix is the binary version of the matrix M defined as M^{T }= E and captures the direct links from metabolites to enzymes. The degrees of the nodes are summarized by the diagonal matrix D (D_{ii }= ∑_{j }A_{ij }, with A_{ij }the elements of the adjacency matrix, A). The graph diffusion kernel score with normalization for the node degrees is
The parameter γ controls the extent of diffusion, or equivalently the length of the random walks. These lengths are distributed exponentially, with the probability of a dstep walk proportional to e^{γd}. The results shown in this work are for a value of γ = 1. Results were not sensitive to the value of γ, with similar results over a range from 0.5 to 120. The entries in the kernel corresponding to the enzymeenzyme relationships were then extracted to predict genetic interactions. For readability, GDK scores displayed in the figures are multiplied by 10^{4}.
Graph diffusion kernel with weighted edges
We also considered a version of the graph diffusion kernel with weighted edges. Here we used the full version of the matrix M(M^{T }= E) in the adjacency matrix A rather than a binary version as used in the nonweighted case (Eq. 11). The element (M^{T })_{ij }of the matrix M^{T }= E represents the number of times a metabolite i is associated with the enzyme j through various reactions. Then kernel score with weighted edges is obtained using the same procedure as described above. Results are shown for γ = 1 and remained the same for higher γ values.
Scores without compartments
The model obtained from the BIGG database is a fully compartmentalized model with same metabolites localized to different compartments represented separately. Some metabolites may move between compartments freely, others through ionchannels by chemical gradients or through transporters. This may bring the reactions that use the metabolites that diffuse freely in different compartments closer in that they share either the substrates or products. To investigate the effect of this in our analysis, we combined same metabolites localized to different compartments (by adding the rows of the stoichometric matrix). Then we calculated all the metabolic coupling measures (except enzyme flux correlation measure) with the compartmentfree metabolic network. There were 646 metabolites in the compartmentfree network.
Enzyme flux correlation
The performance of various scores considered in this work were compared with enzyme flux correlation score used in our previous study [2]. Briefly, the method is based on feasible reaction fluxes obtained under stoichometric and reaction flux constraints at steady state [14]. The set of feasible reaction fluxes were sampled using a Markov random sampling algorithm under in silico medium similar to YPD [15]. Then reaction fluxes were transformed to enzyme fluxes. The enzyme flux correlation between two enzymes is then obtained by calculating the Pearson correlation coefficient over the various enzyme flux samples. The details of the calculations are available elsewhere [2]. This entire procedure from random sampling to calculating correlations was repeated 3 times with different random seeds, with no evidence of nonergodic sampling among the three runs. Final predictions used absolute value of correlation averaged over three runs. A preliminary step before random sampling removes all blocked reactions which carry no flux. The reduced model used for flux sampling had 477 metabolites, 582 reactions and 469 enzymes. The flux sampling procedure was carried out with the COBRA MATLAB toolbox [16]. The absolute value of the flux correlation is used for ranking.
Scaled epistasis
The scaled epistasis values corresponding to the minimal media were obtained from a previous study [9]. The file "fitness_data_nominal.txt" containing the fitness of insilico single and double gene yeast knockouts were obtained from http://kishony.med.harvard.edu/prism/index.html webcite. For calculating the AUC and Fscore from scaled epistasis score, only enzyme pairs with epistasis score were considered. We did not calculate GO consistency scores for this method because it performed poorly for predicting genetic interactions.
Synthetic lethality data sources
Synthetic lethality data for this study was obtained from the BioGRID database (version 2.0.46) [17]. There were 97 synthetic lethal and 73 synthetic growth defect interactions in the BioGRID database that had both the genes in the yeast metabolic network. There were 39 synthetic lethal and 30 growth deflect interactions in the BioGRID database that had both the genes in the reduced model used in the flux sampling procedure.
Performance metrics
The ROC curves, AUC, Fscore and PR curves were generated in R using the ROCR package [18]. The ROC and PR curves are shown with a downsampling option in the plot set to 5000.
Gene Ontology assessment
We used a procedure proposed in a previous study for validating the functional gene similarity measures [11]. We used gene ontology (GO) annotation terms from all the three categories biological process, cellular compartment and molecular function. A GO term is termed consistent if the average metabolite coupling score between all pairs of metabolic genes associated with the GO term is greater than genes that do not share a same annotation.
The percent of consistent GO terms were calculated for each bipartite coupling measure. We considered only GO terms associated with 5 though 100 genes. For each GO term, the pairwise coupling score between metabolic genes associated with it are averaged. The statistical significance of this averaged score is assessed by random shuffling of gene GO annotation associations, maintaining both the annotation and gene distribution. We calculated an empirical pvalue based on 10,000 iterations for each GO term. These empirical pvalues were corrected for multiple testing of many GO terms to control for false discovery rates [12]. The consistency score was obtained by the proportion of GO terms that were significant with a false discovery rate of 0.05. The GO gene associations of yeast corresponding in the file gene_{}association.sgd was obtained from the Saccharomyces genome database, http://downloads.yeastgenome.org/literature_curation/ webcite.
Semantic Similarity
Semantic similarity was calculated as l_{n}(N_{T}/N_{P }), where the total number of genes N_{T }= 6310 for yeast, and the number of genes annotated to the closest parent category of two genes is N_{P }[19,20]. Semantic similarity values were calculated separately for the three main Gene Ontology hierarchies: Biological Process, Cellular Component, and Metabolic Function. These three values were then summarized by the maximum of the three to identify functional associations inferred from network structure that do not match any known annotation similarity.
Computational cost
The methods proposed in this work are computationally less expensive as compared to the flux correlation based approaches. The computation of flux correlation takes about 15 hours (each sampling run taking about 5 hours). Computing the other scores was performed as a single calculation that required only 9 minutes. The graph diffusion kernel, part of this single calculation, was computed directly as the inverse of the graph Laplacian rather than as a repeated matrix multiplication.
Authors' contributions
BV and JSB conceived the study and wrote the manuscript. BV performed the work. All authors read and approved the final manuscript
Acknowledgements
BV and JSB acknowledge funding from NIH R24 DK082840. JSB acknowledges support from NSF CAREER MCB 0546446. We acknowledge help from Yasir Suhail in performing semantic similarity calculations.
References

Ihmels J, Levy R, Barkai N: Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae.
Nat Biotechnol 2004, 22:8692. PubMed Abstract  Publisher Full Text

Veeramani B, Bader JS: Metabolic flux correlations, genetic interactions, and disease.
J Comput Biol 2009, 16(2):291302. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lee DS, Park J, Kay KA, Christakis NA, Oltvai ZN, Barabási AL: The implications of human metabolic network topology for disease comorbidity.
Proc Natl Acad Sci USA 2008, 105(29):98809885. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kharchenko P, Church GM, Vitkup D: Expression dynamics of a cellular metabolic network.
Mol Syst Biol 2005, 1:2005.0016. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Notebaart RA, Teusink B, Siezen RJ, Papp B: Coregulation of metabolic genes is better explained by flux coupling than by network distance.
PLoS Comput Biol 2008, 4:e26. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Becker SA, Price ND, Palsson BO: Metabolite coupling in genomescale metabolic networks.
BMC Bioinformatics 2006, 7:111. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tsuda K, Noble WS: Learning kernels from biological networks by maximizing entropy.
Bioinformatics 2004, 20(Suppl 1):i326i333. PubMed Abstract  Publisher Full Text

Qi Y, Suhail Y, yi Lin Y, Boeke JD, Bader JS: Finding friends and enemies in an enemiesonly network: a graph diffusion kernel for predicting novel genetic interactions and cocomplex membership from yeast genetic interactions.
Genome Res 2008, 18(12):19912004. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Segrè D, Deluna A, Church GM, Kishony R: Modular epistasis in yeast metabolism.
Nat Genet 2005, 37:7783. PubMed Abstract  Publisher Full Text

Duarte NC, Herrgård MJ, Palsson BO: Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genomescale metabolic model.
Genome Res 2004, 14(7):12981309. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rokhlenko O, Shlomi T, Sharan R, Ruppin E, Pinter RY: Constraintbased functional similarity of metabolic genes: going beyond network topology.
Bioinformatics 2007, 23(16):213946.. PubMed Abstract  Publisher Full Text

Benjamini YHY: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Schellenberger J, Park J, Conrad T, Palsson B: BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructions.
BMC Bioinformatics 2010, 11:213. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Price ND, Reed JL, Palsson BO: Genomescale models of microbial cells: evaluating the consequences of constraints.
Nat Rev Microbiol 2004, 2(11):886897. PubMed Abstract  Publisher Full Text

Harrison R, Papp B, Pál C, Oliver SG, Delneri D: Plasticity of genetic interactions in metabolic networks of yeast.
Proc Natl Acad Sci USA 2007, 104(7):23072312. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Becker SA, Feist AM, Mo ML, Hannum G, Palsson BO, Herrgard MJ: Quantitative prediction of cellular metabolism with constraintbased models: the COBRA Toolbox.
Nat Protoc 2007, 2(3):727738. PubMed Abstract  Publisher Full Text

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets.
Nucleic Acids Res 2006, (34 Database):D535D539. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR visualizing classifier performance in R.
Bioinformatics 2005, 21(20):39403941. PubMed Abstract  Publisher Full Text

Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.
Bioinformatics 2003, 19(10):12751283. PubMed Abstract  Publisher Full Text

Resnik P: Semantic Similarity in a Taxonomy An InformationBased Measure and its Application to Problems of Ambiguity in Natural Language.
Journal of Artificial Intelligence Research 1999, 11:95130.