Institute of Genetic Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany

Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Neuherberg, Germany

Department of Genome-oriented Bioinformatics, Life and Food Science Center Weihenstephan, Technische Universität München, Freising, Germany

Institute of Epidemiology I, Helmholtz Zentrum München, Neuherberg, Germany

Institute of Medical Informatics, Biometry and Epidemiology, Chair of Epidemiology, Ludwig-Maximilians-Universität, München, Germany

Klinikum Grosshadern, Munich, Germany

Faculty of Biology, Ludwig-Maximilians-Universität, Planegg-Martinsried, Germany

Department of Physiology and Biophysics, Weill Cornell Medical College in Qatar, Education City - Qatar Foundation, Doha, Qatar

Abstract

Background

Genome-wide association studies (GWAS) with metabolic traits and metabolome-wide association studies (MWAS) with traits of biomedical relevance are powerful tools to identify the contribution of genetic, environmental and lifestyle factors to the etiology of complex diseases. Hypothesis-free testing of ratios between all possible metabolite pairs in GWAS and MWAS has proven to be an innovative approach in the discovery of new biologically meaningful associations. The p-gain statistic was introduced as an ad-hoc measure to determine whether a ratio between two metabolite concentrations carries more information than the two corresponding metabolite concentrations alone. So far, only a rule of thumb was applied to determine the significance of the p-gain.

Results

Here we explore the statistical properties of the p-gain through simulation of its density and by sampling of experimental data. We derive critical values of the p-gain for different levels of correlation between metabolite pairs and show that B/(2*α) is a conservative critical value for the p-gain, where α is the level of significance and B the number of tested metabolite pairs.

Conclusions

We show that the p-gain is a well defined measure that can be used to identify statistically significant metabolite ratios in association studies and provide a conservative significance cut-off for the p-gain for use in future association studies with metabolic traits.

Background

With the advent of modern metabolomics techniques, hundreds of endogenous organic compounds (metabolites) from tissue samples, cell cultures and body fluids can now be measured in a highly standardized and often non-targeted manner. Current technologies are based on liquid chromatography–mass spectrometry (LC-MS), gas chromatography–mass spectrometry (GC-MS), flow injection analysis mass spectrometry (FIA-MS/MS) or nuclear magnetic resonance spectroscopy (NMR)

Specific ratios between selected pairs of metabolite concentrations (metabolite ratios) have been introduced in the past as biomarkers in many biomedical applications. For instance, medium-chain acyl-CoA dehydrogenase deficiency (MCADD) is detected in systematic “newborn screens” on the basis of elevated blood concentrations of octanoylcarnitine (C8) and other acylcarnitines, in combination with ratios between acylcarnitine concentrations, including hexanoylcarnitine (C6), decanoylcarnitine (C10), decenoylcarnitine (C10:1), C8/C6, C8/C10, and C8/C12 (dodecanoylcarnitine)
_{2}) exposure

With modern high-throughput technologies, the concept of metabolite ratio analysis has been scaled up to systematically analyzing all possible combinations of ratios between metabolite pairs in a hypothesis-free approach. A number of recently published papers highlight the power of this approach: Altmaier

**Metabolite ratio**

**Association**

**Interpretation**

**Reference**

In all studies pairs of metabolites were identified by a high increase in the strength of association when ratios were used. Note that all of these metabolite pairs are found to be biochemically related to the concrete biological questions of these studies (Interpretation). However, they were singled out from the large number of all possible metabolite pair combinations on the basis of the p-gain without any prior hypotheses.

SM(OH)C28:0/SM(OH)C26:0

Diabetic (db/db) versus wild type mice

Increased beta-oxidation in diabetic mice

Altmaier

PC aa C36:3/PC aa C36:4

Genetic variance in delta-5 fatty acid desaturation

Gieger

PC aa Cx:y/PC ae Cx:y

Smoking

Reduced or lack of activity of the enzyme alkyl-DHAP in smokers

Wang-Sattler

PC aa C40:3/PC aa C42:5

Genetic variance in elongation of fatty acids

Illig

Medium chain fatty acids / long chain fatty acids

Diabetes state

Perturbed lipid metabolism associated with diabetes

Suhre

PC aa C40:5/PC aa C40:6

Self-reported nutritional intake of polyunsaturated fatty acids

Confirmation of questionnaire based life-style parameters

Altmaier

Ratios between phospholipids with lipid side chains from the C16:0, C16:1, C18:0, C18:1 pool and C20:3, C20:4, C22:4 PUFAs

Plasma, tissue (mouse) and cell lines (human) treated with FABP4 inhibitor

Molecular inhibition of FABP4 activity

Suhre

Formate/ acetate in human urine

Genetic variance in N-acetylase activity

Suhre

Ratio between phosphorylated and unphosphorylated fibrinogen peptides

Genetic variance in fibrinogen phosphorylation

Suhre

Several reasons explain why metabolite ratios provide additional information in these association studies: (1) Ratios between related metabolite pairs reduce the overall biological variability in the dataset and thereby increase statistical power. For instance, study participants may have strongly varying nutrition habits, which introduce high variance in the distribution of that nutrient, but also in those of its biochemical break-down products. However, individuals who consume a higher amount of a certain nutrient also exhibit higher levels of its biochemical break-down products. Ratios between these metabolites can thus be considered as some kind of internal normalization. (2) Systematic experimental errors, such as variance in the concentration of external standards result in errors that are comparable for certain metabolite pairs. Such errors are cancelled out in ratios and thereby reduce the overall noise in the dataset. (3) Probably most importantly, when a metabolite pair is connected by a biochemical pathway, metabolite ratios approximate the corresponding reaction rate under idealized steady state assumptions. Metabolite ratios then represent a biologically most relevant entity, namely the flux through a biochemical pathway. For example, in Suhre ^{-21} and an explained variance of 5.2 % with concentrations of the omega-6 fatty acid 20:4, whereas the p-value of association with ratios between the fatty acids 20:4 and 20:3 was p = 9.987 × 10^{-66} with an explained variance of 15.3 %

Results and discussion

Formal definition of the p-gain

Testing ratios between two metabolite concentrations

The p-gain was introduced in order to measure whether the association with a genetic locus is significantly stronger for a metabolite ratio than for the belonging metabolite concentrations. As notation, we use ‘p-value(M_{1} | X)’, short ‘P(M_{1})’, to reference the p-value corresponding to a test for association between a trait X (in a GWAS this would be a genetic locus represented by a SNP and in an MWAS it would be a phenotypic trait) and the metabolite M_{1}. With this definition, the p-gain for the association of the ratio M_{1}/M_{2} of metabolites M_{1} and M_{2} with a trait X is defined as

Conservative critical p-gain values for common statistics

Although the p-gain is now frequently used in MWAS and in GWAS with metabolic traits, only a rule of thumb for the determination of critical values has been applied so far. The p-gain was considered as being significant when its value exceeded the number of analyzed metabolite concentrations, that is, the number of additionally performed tests

Critical values of the distribution of this p-gain are conservative to the critical values of the distribution of the p-gain given in equation (1), because

and therefore

The variation of the distribution of the p-gain defined in equation (2) depends on the correlation between M_{1} and M_{1}/M_{2}. For example, highly correlated metabolic traits contain mainly the same information and have similar p-values in association tests. This results in p-gain values which are close to one. Hence, the variation of the distribution is small. In contrast, weakly correlated metabolic traits contain different information and may have different p-values in association tests. This results in p-gain values distributed broadly around the one. Therefore, assuming

In the situation of the universalized p-gain (equation (2)) we can use the convolution formula for density ratios which yields a split density (see Methods):

as displayed in Figure

Distribution of the p-gain.

**Distribution of the p-gain.** This Figure shows the distribution of the p-gain for the calculated conservative p-gain of uncorrelated traits as well as for four loci which were significant in Suhre

Herewith, the critical value becomes

**Supplementary Figure S1 and Tables S1-S3.**This file contains supplementary information.

Click here for file

Critical values for multiple testing

In MWAS and in GWAS with metabolomics a large number of ratios are tested in parallel. Therefore, a correction for multiple testing has to be applied. We select Bonferroni correction as the most conservative method. When admitting a type I error rate of α and applying a correction for B tests, i.e. aiming at a level of significance of

P-gain for correlated metabolites

The case of uncorrelated metabolites (equation (2)) is conservative with respect to the p-gain as defined in equation (1). Here we analyze the density of the p-gain as defined in equation (1) for selected correlation settings. In the situation of correlated metabolic traits the convolution formula cannot be applied anymore. Thus, we simulate the density using a copula to generate the correlation among the metabolic traits. A copula is a joint probability distribution whose one-dimensional marginal distributions are uniformly distributed over the interval [0,1]. It takes the dependency among the marginal distributions into account (see Methods). Quantiles for the p-gain densities of correlated metabolic traits are provided in Table S1 (Additional file
_{
1
}
_{
2
}
_{
1
}
_{
2
}
_{
1
}
_{
2
}
_{1} and M_{2} leads to an increase in the values for the p-gain quantiles when the correlation between M_{1} and M_{2} is not close to 0. Extending these observations to the most extreme case of having fully correlated metabolite concentrations which are uncorrelated with their ratio (i.e. _{
1
}
_{
2
}
_{
1
}
_{
2
}

**R-script for simulation of the distribution of the p-gain.**This file contains supplementary information.

Click here for file

Dependence on sample size in real data

In order to examine the behavior of the p-gain in the situation of real data, we compute the observed correlation structure among metabolite ratios which were published in Suhre ^{3} and 1.68 x 10^{66} for the 20 loci published in Suhre ^{2} for a sample size of N = 100, of 1.1 x 10^{5} for N = 500, of 2.8 x 10^{10} for N = 1000, of 3.1 x 10^{15} for N = 1500 and of 1.4 x 10^{21} for N = 2000.

Conclusions

We derived critical values for the p-gain to determine significance in various situations. We recommend the use of metabolite ratios and the p-gain statistic when analyzing large scale metabolomics data sets and to apply the critical values with correction of multiple testing as provided in this paper. Given the success of the approach in the metabolomics field, hypothesis free testing of ratios between biologically related quantitative traits should also be considered for association studies with other ‘omics datasets.

Methods

Study description

The KORA (Cooperative Health Research in the Region of Augsburg) study is a series of independent population-based epidemiological surveys and follow-up studies of participants living in the region of Augsburg, Southern Germany

Blood collection

We collected blood samples between 2006 and 2008 during the KORA F4 examinations. To avoid variation due to circadian rhythm, blood was drawn in the morning between 8:00 a.m. and 10:00 a.m. after a period of overnight fasting. Blood was drawn into serum gel tubes, gently inverted two times and then allowed to rest for 30 min at room temperature (18 − 25 °C) to obtain complete coagulation. The material was then centrifuged for 10 min and 2,750

Metabolomics measurements

On 1,768 fasting serum samples of the KORA F4 study for which we had already genotypes available, metabolic profiling was done using ultrahigh performance liquid-phase chromatography and gas chromatography separation coupled with tandem mass spectrometry

Statistical analyses

Density of p-gain for uncorrelated metabolites (calculation)

The p-gain for two uncorrelated metabolites is defined as:

We calculated the density of the p-gain of two uncorrelated metabolites by using the convolution formula for ratios:

with P(M_{1}) and P(M_{1}/M_{2}) having a uniform distribution on the interval [0,1]. Transformations lead to

The corresponding cumulative distribution is

Therefore,

with

Density of the p-gain (simulation)

To determine the density of the p-gain we assumed a given correlation structure among the metabolic traits. This confers to a correlation structure among p-values corresponding to these metabolic traits. With these correlated p-values the density of the p-gain can be derived. For simulation of the variables with a given correlation structure we choose the “copula” package

Dependence of p-gain values on sample size

We determined the dependency of the p-gain of the sample size by drawing randomly between 100 and 2000 samples from the KORA data (with replacement). For each sample size, we repeated this analysis 1500 times. For all sample subsets we calculated the p-gain. We then determined the median p-gain as well as the 1^{st} and 3^{rd} quantile of the p-gains for each sample size.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AKP designed the study, performed the statistical analysis and wrote the manuscript. JK provided data and critically reviewed the manuscript. BW and FJT provided data. HEW provided material. CG and KS designed the study and critically reviewed the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The KORA research platform was initiated and financed by the Helmholtz Center Munich, German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education and Research (BMBF) and by the State of Bavaria. Part of this work was financed by the German National Genome Research Network (NGFN-2, NGFNPlus 01GS0823, and NGFNPlus 01GS0834) and through additional funds from the University of Ulm. Our research was supported within the Munich Center of Health Sciences (MC Health) as part of LMUinnovativ and by a grant from the BMBF to the German Center for Diabetes Research (DZD e.V.), as well as from the BMBF funded German Network for Mitochondrial Disorders (mitoNET 01GM0862) and Systems Biology of Metabotypes (SysMBo 0315494A). Furthermore, the study received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013), ENGAGE project, grant agreement HEALTH-F4-2007-201413. BW is funded by ERA-NET grant 0315442A (project PathoGenoMics). JK is supported by a PhD student fellowship from the "Studienstiftung des Deutschen Volkes". KS is supported by Qatar Foundation.