Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Germany

Faculty of Biology, Ludwig-Maximilians-Universität, Planegg-Martinsried, Germany

Institute of Epidemiology, Helmholtz Zentrum München, Germany

Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, Germany

Department of Mathematics, Technische Universität München, Germany

Abstract

Background

With the advent of high-throughput targeted metabolic profiling techniques, the question of how to interpret and analyze the resulting vast amount of data becomes more and more important. In this work we address the reconstruction of metabolic reactions from cross-sectional metabolomics data, that is without the requirement for time-resolved measurements or specific system perturbations. Previous studies in this area mainly focused on Pearson correlation coefficients, which however are generally incapable of distinguishing between direct and indirect metabolic interactions.

Results

In our new approach we propose the application of a Gaussian graphical model (GGM), an undirected probabilistic graphical model estimating the conditional dependence between variables. GGMs are based on partial correlation coefficients, that is pairwise Pearson correlation coefficients conditioned against the correlation with all other metabolites. We first demonstrate the general validity of the method and its advantages over regular correlation networks with computer-simulated reaction systems. Then we estimate a GGM on data from a large human population cohort, covering 1020 fasting blood serum samples with 151 quantified metabolites. The GGM is much sparser than the correlation network, shows a modular structure with respect to metabolite classes, and is stable to the choice of samples in the data set. On the example of human fatty acid metabolism, we demonstrate for the first time that high partial correlation coefficients generally correspond to known metabolic reactions. This feature is evaluated both manually by investigating specific pairs of high-scoring metabolites, and then systematically on a literature-curated model of fatty acid synthesis and degradation. Our method detects many known reactions along with possibly novel pathway interactions, representing candidates for further experimental examination.

Conclusions

In summary, we demonstrate strong signatures of intracellular pathways in blood serum data, and provide a valuable tool for the unbiased reconstruction of metabolic reactions from large-scale metabolomics data sets.

Background

Metabolomics is a newly arising field aiming at the measurement of all endogenous metabolites of a tissue or body fluid under given conditions

A major drawback of correlation networks, however, is their inability to distinguish between direct and indirect associations. Correlation coefficients are generally high in large-scale

In this manuscript we now study the capabilities of GGMs to recover metabolic pathway reactions solely from measured metabolite concentrations. First, we discuss the quality of the method and possible problems and pitfalls on computer-simulated systems. We then apply GGMs to a lipid-focused targeted metabolomics data set of 1020 blood serum samples with 151 measured metabolites from the German population study KORA

Results and Discussion

GGMs delineate direct relationships in artificial reaction systems

Computer-simulated reaction systems are a valuable tool for the evaluation of correlation-based measures prior to their application to real metabolomics data sets. Previous works focused on the modeling of biological replicates with intrinsic noise on the metabolite levels

**Further results on computer-simulated networks**.

Click here for file

The first network we analyzed consists of a linear chain of three metabolites with different variants of reaction reversibility (Figure

Evaluation of correlation networks (CN) and Gaussian graphical models (GGM) on artificial systems

**Evaluation of correlation networks (CN) and Gaussian graphical models (GGM) on artificial systems**. Line widths represent relative edge weights in the respective networks (scaled to the strongest edges). **A: **Linear chain of three metabolites with reversible intermediate reactions. While the standard Pearson correlation network (CN) is fully connected, implying an overall high correlation of all metabolites, the GGM correctly discriminates between direct and indirect interactions. **B: **Linear chain with irreversible intermediate reactions. Neither CN nor GGM can distinguish direct from indirect effects, as metabolite A equally determines the levels of both B and C. **C: **Linear chain with irreversible reactions and input/output reactions for each metabolite. Although the edge weights for both CN and GGM are generally lower, the GGM now correctly predicts the network topology. **D: **Branched-chain first-order networks are correctly reconstructed by the GGM. **E: **End-product inhibition modules. When modeled as an open system, **F: **Cofactor-driven network resembling the first three reactions from the glycolysis pathway. A correlation network fails to predict the correct pathway relationships. **G: **Non-linear system with a bi-molecular reaction. The GGM predicts only a only weak interaction between B and C. This is due to counterantagonistic processes of isomerization and substrate participation in the same reaction.

Interestingly, for some reaction setups, the accuracy of the method improves drastically with an increasing amount of external noise. Specifically, if the metabolite transport towards a pathway is subject to higher fluctuations, the GGM edge weight difference between directly and indirectly connected metabolites becomes larger. For a detailed discussion of this finding we refer the reader to Additional file _{max }were set twice as fast as the backward reactions in order to ensure a directed mass flow. We found that the usage of Michaelis-Menten-type enzyme kinetics instead of mass-action kinetics does not alter our general findings. When forward reaction rates exceed backward reactions by far, the GGM discrimination quality is impaired. This is in line with the observation that purely irreversible reactions cannot be distinguished in the mass-action case (see above). Other specific parameters, like the Michaelis constant _{M }

Next, we studied the influence of cofactor-driven reactions on the reconstruction. Cofactors are ubiquitous substances usually involved in the transfer of certain molecular moieties or redox potentials

The drawbacks of correlation-based methods discussed in this section, especially inhibitory mechanisms with exchange reactions and antagonistic mechanism, have to be kept in mind when attempting to reconstruct metabolic reactions from steady state data. For the present study, however, we assume the primarily linear lipid pathways not to contain such problematic reaction motifs.

A GGM inferred from a large-scale population-based data set displays a sparse, modular and robust structure

In the following we estimated a Gaussian graphical model using targeted metabolomics data from the German population study KORA

Network properties of the correlation network (CN) and Gaussian graphical model (GGM) inferred from a targeted metabolomics population data set (1020 participants, 151 quantified metabolites)

**Network properties of the correlation network (CN) and Gaussian graphical model (GGM) inferred from a targeted metabolomics population data set (1020 participants, 151 quantified metabolites)**. **A+B: **Graphical depiction of significantly positive edges in both networks, emphasizing local clustering structures. Each circle color represents a single metabolite class. **C+D: **Histograms of **E+F: **Modularity between metabolite classes measured as the relative out-degree from each class (rows) to all other classes (columns). The GGM (right) shows a clear separation of metabolite classes, with some overlaps for the different phospholipid species diacyl-PCs, lyso-PCs, acyl-alkyl-PCs and sphingomyelins. Values range from white (0.0 out-degree towards this class) to black (1.0). PCs = phosphatidylcholines.

**Effects of genetic variation on GGM calculation**.

Click here for file

Pearson correlation coefficients show a strong bias towards positive values in our data set (Figure

The GGM displays a modular structure with respect to the seven metabolite classes in our panel, while the class separation in the correlation network appears rather blurry (Figure ^{5 }randomized GGM networks (random edge rewiring). For the original GGM we obtain a modularity of

**Modularity: Optimized partitioning and weighted calculation**.

Click here for file

An important question for a multivariate statistical measure such as partial correlations is the robustness with respect to changes in the underlying data set. Furthermore, the dependence of the measure on the size of the data set needs to be addressed. To answer these questions, we performed two types of perturbations of our data set. First, we applied sample bootstrapping with 1000 repetitions and compared the resulting partial correlations to the original data set (Additional file ^{-4}). This indicates that for a large data set with ^{-4}. Only when the number of samples gets close to the number of variables (

**Stability of the GGM with respect to changes in the underlying data set**.

Click here for file

Strong GGM edges represent known metabolic pathway interactions

The next step in our analysis was the manual investigation of metabolite pairs displaying strong partial correlation coefficients. Remarkably, we are able to provide pathway explanations for most metabolite pairs in the top 20 positive partial correlations (Table

Top 20 positive GGM edge weights (i.e. partial correlation coefficients, PCC) in our data set along with proposed metabolic pathway explanations

**Metabolite 1**

**Metabolite 2**

**PCC**

**Comment**

Val

xLeu

0.821

Branched-chain amino acids

SM C18:0

SM C18:1

0.767

SCD/SCD5 desaturation

SM C16:1

SM C18:1

0.765

ELOVL6

PC ae C34:2

PC ae C36:3

0.752

2 reaction steps

SM (OH) C22:1

SM (OH) C22:2

0.743

sphingolipid-specific desaturation?

PC aa C34:2

PC aa C36:2

0.735

ELOVL1/ELOVL6 elongation

C10:0-carn

C8:0-carn

0.735

lysoPC a C16:0

lysoPC a C18:0

0.731

ELOVL6 elongation

PC aa C38:6

PC aa C40:6

0.709

ACOX1/3 + various ELOVLs

SM (OH) C14:1

SM (OH) C16:1

0.686

sphingolipid-specific elongation?

PC aa C36:4

PC aa C38:4

0.672

ACOX1/3 + various ELOVLs

PC aa C32:1

lysoPC a C16:1

0.661

C16:0/C16:1 phospholipid association

PC aa C38:5

PC aa C40:5

0.653

various ELOVLs

PC ae C34:3

PC ae C36:5

0.607

at least 3 reaction steps

PC aa C36:5

PC aa C38:5

0.596

ACOX1/3 + various ELOVLs

SM C24:0

SM C24:1

0.577

sphingolipid-specific desaturation?

PC ae C32:1

PC ae C32:2

0.574

SCD/SCD5 desaturation

SM (OH) C22:2

SM C24:1

0.567

possible elongation intermediate

C18:1-carn

C18:2-carn

0.561

Most metabolite pairs can be directly linked to reactions in the fatty acid biosynthesis pathway, the

The highest partial correlation in the data set with ζ = 0.821 is found for the two branched-chain amino acids Valine and xLeucine, where the latter compound represents both Leucine and Isoleucine (which have equal masses and are not distinguishable by the present method). The three metabolites are in close proximity in the metabolic network concerning their biosynthesis and degradation pathways. Further related amino acid pairs that display significant partial correlations are Histidine and Glutamine (ζ = 0.383), Glycine and Serine (ζ = 0.326) as well as Threonine and Methionine (ζ = 0.298).

Clear-cut signatures of the desaturation and elongation of long chain fatty acids can be seen for various sphingomyelins and lyso-PCs (Figure

Biochemical subnetworks identified by the GGM

**Biochemical subnetworks identified by the GGM**. Line widths correspond to partial correlation coefficients. **A: **Elongation and desaturation signatures, most likely mediated by ELOVL6 and SCD, for C16 and C18 fatty acids incorporated in lyso-PCs and sphingomyelins. **B: **Top: Diacyl-phosphatidylcholine (PC aa) species with elongation and peroxisomal **C: **Recovered **D: **Two high-scoring triads, where metabolite pairs with a pathway distance of two constitute strong partial correlations. This feature of partial correlations aids in the reconstruction of the network topology beyond the direct neighborhood of each metabolite.

We identify a variety of strong GGM edges between diacyl-PC (lecithins, PC aa) and acyl-alkyl-PC (plasmalogens, PC ae) metabolite pairs (Figure

For the acyl-carnitine group we observe a remarkably high partial correlation of ζ = 0.735 for C8-carn and C10-carn and further acyl-carnitine pairs with a carbon atom difference of two (Figure _{2 }units are continuously split off from the shrinking fatty acid chain. Four

We observe several associations that were not directly attributable to enzymatic interactions in the fatty acid biosynthesis or degradation pathways. For instance, lysoPC a 18:1 and lysoPC a 18:2 share a strong GGM edge (ζ = 0.543) although the Δ12-desaturation step from oleic acid to linoleic acid is known to be missing in humans

Negative values play a particular role in the interpretation of partial correlations coefficients. On the one hand, they obviously occur whenever regular negative correlations are involved. Mechanisms giving rise to negative correlations are, for example, coparticipation in the same reaction (cf. Figure

Partial correlation coefficients discriminate between directly and indirectly connected metabolites in a literature-curated fatty acid pathway model

The analyses from the previous section strengthened our conception that a GGM inferred from blood serum metabolomics data represents true metabolite associations. To systematically assess how GGM edges and pathway proximity between our lipid metabolites are related, we generated a literature-based model of fatty acid biosynthesis (Figure

Fatty acid biosynthesis model and pathway distance calculation method

**Fatty acid biosynthesis model and pathway distance calculation method**. **A: ****B: **Exemplary distance calculation on two lyso-PCs. We projected lipid side chain compositions onto the respective fatty-acid biosynthesis reactions. Reaction reversibility is not taken into account in our calculation, i.e. distances are always symmetric. **C: **If no known pathway connection between two fatty acids exists, we assign a formal distance of infinity. **D: **For phospholipids that contain two fatty acid residues we need to take into account all combinatorial variants. We here show three variants for the connection between PC aa C38:4 and PC aa C38:5. In these examples, PC aa C38:4 could either consist of C18:0+C20:4 or C16:0+C22:4, while PC aa C38:5 could be C18:0+C20:5 or C16:0+C22:5. The shortest possible distance, one in this case, will be used for further calculations.

We observe a strong tendency towards significantly positive partial correlations for a pathway distance of one, i.e. directly connected metabolite pairs, for all five metabolite classes (Figure

Systematic evaluation of partial correlation coefficients versus pathway distances

**Systematic evaluation of partial correlation coefficients versus pathway distances**. Dashed lines in A and B indicate a significance level of 0.01 with Bonferroni correction. **A: **Pathway distances from our consensus model against partial correlation coefficients for the five lipid-based metabolite classes in our data set. We observe an enrichment of significant partial correlations for a pathway distance of one, which rapidly drops for an increasing number of pathway steps. **B: **Comparison of partial correlation coefficients and Pearson correlation coefficients. Pearson correlation coefficients are generally high, independent of the actual pathway distance, indicating for systemic coregulation effects throughout the lipid metabolism. **C: **Wilcoxon rank sum test p-values between the partial correlation distributions of directly and indirectly connected pairs, and sensitivity/specificity/_{1 }

A direct comparison of both partial and Pearson correlation coefficients for the diacyl-phosphatidylcholine class is shown in Figure

The significantly different correlation value distributions between directly and indirectly linked metabolites (Figure _{1 }score, which is defined as the harmonic mean of both quantities _{1 }for all 5 metabolite classes along with an evaluation of partial correlation distribution differences between directly and indirectly linked metabolites (determined by Wilcoxon's ranksum test). _{1 }values over 0.75 and significant p-values for the ranksum test indicate a strong discrimination effect of partial correlation coefficients concerning direct vs. indirect pathway interactions. Possible reasons for non-perfect sensitivity and specific city values will be discussed in detail at the end of this text.

Low-order partial correlations

The data set from our present study contained enough samples to calculate full-order partial correlations, that is to calculate pairwise correlations conditioned against all other _{1 }values close to those displayed in Figure

**comparison with low-order partial correlation approaches**.

Click here for file

Conclusions

In this paper we addressed the reconstruction of metabolic pathway reactions from high-throughput targeted metabolomics measurements. Previous reconstruction approaches employed pairwise association measures, primarily standard Pearson correlation coefficients, to infer network topology information from metabolite profiles

From computer simulations of metabolic reaction networks we deduced a set important aspects to be considered when interpreting partial correlation coefficients in reaction systems: (a) Metabolites in equilibrium due to reversible reactions can readily be recovered, whereas irreversible reactions pose a substantial problem to association-based reconstruction attempts (in concordance with

In the next step we inferred both a GGM and a regular correlation network from a large-scale metabolomics data set with 1020 strictly standardized samples from overnight fasting individuals measured by state-of-the art metabolomics technologies

Manual investigation of high-scoring substructures in the GGM revealed groups of metabolites that could be directly attributed to reaction steps from the human fatty acid biosynthesis and degradation pathways. We detected effects of ELOVL-mediated elongations and FADS-mediated desaturations of fatty acids as well as signatures of the catabolic _{1 }measure. Interestingly, we could show that the discrimination quality of low-order partial correlations

Taken together, our results demonstrate that GGMs inferred from metabolomics measurements in blood plasma samples reveal strong signatures of intracellular and even inner-mitochondrial processes. Previous studies on blood plasma samples detected similar relationships with cellular processes based on genetic associations

However, GGMs can never provide a perfect reconstruction of the underlying system. There are several factors that lead to the absence of high partial correlations between interacting metabolites, that is false negative edges in the GGM: (a) Counterantagonistic correlation-generating processes and bimolecular reactions (see above) might lead to the elimination of pairwise association; cf.

Conclusively, this study presented Gaussian graphical models as a valuable tool for the recovery of biochemical reactions from high-throughput targeted metabolomics data. The present work could be extended by comparing high partial correlation coefficients with enzyme activity or expression data, or by the experimental validation of promising interaction candidates. We suggest using GGMs as a standard tool of investigation in future metabolomics studies, utilizing the upcoming wealth of metabolic profiling data to form a more comprehensive picture of cellular metabolism.

Methods

In silico simulation of artificial reaction networks

Let _{1},..., _{r}) be a vector of metabolite concentrations and ^{
m×r
}the stoichiometry matrix of a dynamical system with ^{e}, which only contains the negative values from _{1},...,_{r}) represents a vector of elementary rate constants and _{1 }+ _{2 }→ _{3 }we obtain _{1}
_{2}, and 2_{1 }+ 3_{2 }→ 2_{3 }yields _{1}
^{2}
_{2}
^{3}. For enzyme-catalyzed reactions

Where

with [_{i }
_{ii }
_{i }= K_{ii}

The ordinary differential equations describing the temporal evolution of the system are now given as

To introduce variability each parameter is subject to fluctuations according to a log-normal distribution with mean 1 and changing variances:

Computation of correlation network and Gaussian graphical model

Let _{kl}
^{
n×m
}matrix of logarithmized metabolite concentrations (either measured data samples or computer-simulated steady states), where _{
ij
}) between metabolites are calculated as

where

A partial correlation value _{ij }
_{ij}

where _{ij }

Bootstrapping was performed by randomly drawing 1020 samples with replacement from the original data set. For the second stability analysis, the investigation of different data set sizes, the respective number of samples was randomly drawn from the original data set. The whole procedure was repeated 100 times to get a stable estimate of the deviation.

Network modularity calculation

We define the adjacency matrix _{ij }
_{ij}

where _{1},...,_{6}) be the partitioning of the metabolites into the six metabolite classes: acyl-carnitines, diacyl-PCs, lyso-PCs, acyl-alkyl-PCs, sphingomyelins and amino acids (the hexose is left out as only a single metabolite belongs to that class). We calculated the _{ij }
^{6×6 }from each class to the other classes, (i.e. the proportion of its edges each class shares with the other classes) as:

where _{i }

Intuitively, this measure compares the within-class edges with the edges to the rest of the network. The more edges there are within each class in comparison to the other classes, the higher

Study cohort and metabolite panel

KORA (Kooperative Gesundheitsforschung in der Region Augsburg) is a research platform in southern Germany with a primary focus on cardiovascular diseases, Diabetes mellitus type 2, and genetic epidemiology

A total of

Sensitivity and specificity

In order to objectively evaluate the discrimination between directly and indirectly connected metabolites, we calculated sensitivity and specificity as:

with TP true positives, FP false positives, TN true negatives, FN false negatives

A metabolite pair is considered true positive if it exhibits a partial correlation above the threshold and has a direct pathway connection; a false positive represents a metabolite pair also above the threshold but with no direct pathway connection; a false negative pair lies below the threshold but does have a direct pathway connection; and finally a true negative pair lies below the threshold and also has no direct pathway connection. The _{1 }score was calculated as the harmonic mean of both quantities:

Pathway model

Pathway reactions in the human fatty acid metabolism were drawn from three independent databases: (1)

**Literature-curated pathway model of human fatty acid biosynthesis and degradation**.

Click here for file

Fatty acid residues with identical masses, that cannot be distinguished by our mass-spectrometry technology, are merged into a single metabolite in the reaction set. For instance, the polyunsaturated fatty acids C20:4Δ8,11,14,17 from the omega-3 pathway and C20:4 Δ5,8,11,14 from the omega-6 pathway have identical numbers of carbon atoms and double bonds and are thus merged into a single metabolite C20:4.

Authors' contributions

JK, KS and FJT conceived this data analysis project. TI and JA performed the sample preparation and data acquirement. JK performed the analysis and wrote the primary manuscript. All authors approved the final manuscript.

Acknowledgements

The authors thank the anonymous reviewers for valuable comments and suggestions to improve the original manuscript. This research was partially supported by the Initiative and Networking Fund of the Helmholtz Association within the Helmholtz Alliance on Systems Biology (project CoReNe), by a grant from the German Federal Ministry of Education and Research (BMBF) to the German Center Diabetes Research (DZD e.V.), and by the BMBF-funded "Medizinische Systembiologie - MedSys" initiative (subproject SysMBo, project label 0315494A). Jan Krumsiek is supported by a PhD student fellowship from the "Studienstiftung des Deutschen Volkes". Thanks to Harold Gutch for critically proofreading and correcting this manuscript.