Universitaet Potsdam, D-14415 Potsdam, Germany

Institute of Pathology, Charité University Hospital and provitro GmbH, D-10117 Berlin, Germany

Universitaet Potsdam, Inst. f. Mathematik, D-14415 Potsdam, Germany

University of California Davis, Genome Center, Davis CA 95616, USA

Abstract

Background

The size and magnitude of the metabolome, the ratio between individual metabolites and the response of metabolic networks is controlled by multiple cellular factors. A tight control over metabolite ratios will be reflected by a linear relationship of pairs of metabolite due to the flexibility of metabolic pathways. Hence, unbiased detection and validation of linear metabolic variance can be interpreted in terms of biological control. For robust analyses, criteria for rejecting or accepting linearities need to be developed despite technical measurement errors. The entirety of all pair wise linear metabolic relationships then yields insights into the network of cellular regulation.

Results

The Bayesian law was applied for detecting linearities that are validated by explaining the residues by the degree of technical measurement errors. Test statistics were developed and the algorithm was tested on simulated data using 3–150 samples and 0–100% technical error. Under the null hypothesis of the existence of a linear relationship, type I errors remained below 5% for data sets consisting of more than four samples, whereas the type II error rate quickly raised with increasing technical errors. Conversely, a filter was developed to balance the error rates in the opposite direction. A minimum of 20 biological replicates is recommended if technical errors remain below 20% relative standard deviation and if thresholds for false error rates are acceptable at less than 5%. The algorithm was proven to be robust against outliers, unlike Pearson's correlations.

Conclusion

The algorithm facilitates finding linear relationships in complex datasets, which is radically different from estimating linearity parameters from given linear relationships. Without filter, it provides high sensitivity and fair specificity. If the filter is activated, high specificity but only fair sensitivity is yielded. Total error rates are more favorable with deactivated filters, and hence, metabolomic networks should be generated without the filter. In addition, Bayesian likelihoods facilitate the detection of multiple linear dependencies between two variables. This property of the algorithm enables its use as a discovery tool and to generate novel hypotheses of the existence of otherwise hidden biological factors.

Background

In recent years, time course analyses of metabolic perturbations have become more important to understand metabolic networks based on experimental data

(I) concentrations alter and hence increase variance due to intentionally changing the experimental conditions, for example by altering environmental parameters like external nutrients or by using different genotypes

(II) metabolite data will found to vary in a stochastic manner caused by the imprecision of the analytical method

(III) interestingly, even under very controlled environmental conditions, a high degree of biological variation is found for metabolite levels due to stochastic biological events that trickle through the biochemical network and thus reflect the underlying control structure at this particular biological condition

Therefore, if enough biological replicates are analyzed for a given organism at a given physiological situation, the metabolic phenotype can be investigated not only by its corresponding average metabolic values, but also by a snapshot of its corresponding metabolic network. However, biologists often do not know the inherent biological variability in advance and hence tend to use just a few independent biological replicates based on preliminary power analysis. Resulting data may be sufficient to estimate arithmetic means of metabolic levels but do not enable analyzing the linear control structure between different biological conditions. One of the challenges for calculating linearity networks is to compute the likelihood or significance of the presence of a truly linear relationship, with the aim of excluding both false negative and false positive detections of linearities.

Estimating optimal linearity parameters has been solved decades ago for cases, for which linear dependence of variables could be reasoned based on background knowledge. However, in metabolic data sets, the control structure of metabolites is unknown

(a) For which pairs of variables can a linear relationship be hypothesized?

(b) Are there sub sets of data that reflect differences in linear behavior of variables? For example, linearity may be given for only a group of data but absent in another group, or the linearity parameters between these groups may be different.

An unbiased analysis of linear relationships between pairs of variables needs to test whether there is one or more valid linear hypotheses that could explain data in complex data sets. This procedure defines a novel approach for testing biological data: instead testing pre-defined hypotheses

(1) Linear relationships must be detected in an unbiased and observer-independent manner.

(2) Sub sets of data need to be grouped according to presence of (multiple) linear relationships.

(3) Criteria have to be applied that verify linear hypotheses based on test statistics.

(4) Technical errors: varying degree of analytical-chemical measurement errors and missing data have to be accounted for.

Especially, the potential presence of multiple linear relationships and independence of both variables poses problems for simple regression analyses. As a substitute for regression, the degree of correlation has been used for detecting linear relationships despite the fact that correlation only relates the covariance to the total variance, but does not verify genuine linearities. Moreover, Pearsons' correlation coefficients lack robustness against outliers, especially for multivariate datasets, and a number of different approaches have been suggested to link estimates to better test statistics

A further approach has been taken using partial correlations that deconvolute contributions by additional parameters in order to reduce the list of correlations to basic dependencies

We here present a different approach. Using the Bayesian law

Results and Discussion

(1) A model for the technical error in metabolomic data

Let {_{ij }} denote the entirety of _{ij }with rows i = 1, ...,

The technical errors _{ij }include the chemical-analytical error, but can also include a contribution from different storage manners or times of the biomaterial after its extraction. The technical variance of the j_{j }that reflects knowledge about sample storage and data acquisition. For missing data, it is only known that these can be expected in a defined range but with uniform probability distribution. For non-missing data, the technical error is modeled by a multivariate normal distribution that is centered around zero. More precisely, the probability density for the technical error _{j }= (_{1 j },..., _{nj})^{t }of the

In principle, the variance matrices Σ_{j }can be estimated from the covariance matrix of replicated measurements of the same biomaterial. In practice, correlations between the technical errors of different metabolites are often disregarded, leading to a model with diagonal variance matrices

respectively.

(2) Maximum Likelihood (ML) function for a general linear problem

Using the Bayesian law, the likelihood for parameters of a given linear hypothesis can be calculated. The Bayesian law allows to interconvert the conditional probabilities of cause (linear relationship) and effect (measured data value)

The general form of a linear relationship in the metabolomics data is

In what follows we collect the coefficients of the above equation in a vector _{1},..., _{n})^{t }and express the linear relationship as ^{t }_{•j }= _{•j }we denote the metabolic profile of the

The constant value

_{1}, ..., _{n})^{t }

_{k1 }_{1 }+ ... + _{kn }_{n }= _{k}

_{ki}) and a vector β = (β_{1}, ..., β_{N})^{t}. The matrix elements can also be arranged in vectors α_{k }:(_{k1}, ..., _{kn})^{t}.

_{k }^{t}Σ _{l }= 0

The theorem is proven in Additional File _{1}, _{2 }and technical covariance _{12}. Then, the likelihood for the metabolite concentration to lie on the straight line _{1 }_{1 }+ _{2 }_{2 }=

Returning to the general line of the text and the Bayesian reasoning we obtain for the likelihood for a linear relationship described by _{ki}, _{k}} after measurement of the metabolomic data _{ij}} the result

with

The corresponding likelihood function is a sum of contributions from each of the biological samples,

Maximizing

(3) An adapted Maximum Likelihood estimator for robust verification of linear hypotheses

The product

Additional File _{1}(

The adapted likelihood function is a sum, to which every data point adds a contribution between zero and a maximum value of ln 2 if it coincides with the linear hypothesis that is under investigation. This step alters the impact of the Bayesian law. It results in assessing each individual variable pair by a likelihood of contribution to a (linear) hypothesis, and not by assessing the entirety of all variable pairs. Consequently, the contribution of outliers is evanescent as demonstrated in figure

Comparison between simple and adapted maximum likelihood estimation

Comparison between simple and adapted maximum likelihood estimation.

Additional considerations are outlined for the case of missing data (NANs, not-a-number) which are often found in metabolomic data sets. In such cases, the probability function

Concluding, the following properties are observed for the adapted ML-estimator:

(i) The adapted ML-estimator considers technical errors.

(ii) The adapted ML-estimator detects linear patterns and groups sub sets of data accordingly.

(iii) The adapted ML-estimator is robust against outliers.

(iv) The adapted ML-estimator relies on background information on missing values and therefore does not distort interpretations.

Therefore, the adapted ML-estimator realizes a solution to several of the challenges of unbiased and robust detection of multiple linear hypotheses in complex data sets.

(4) Algorithm for the detection of linear relationships

In order to assign measured data to a hypothetical linear relationship without contradictions, corresponding residues have to be analyzed. One condition is that these residues are randomly distributed; otherwise, additional systematic errors would have to be assumed. Secondly, the residues have to be explainable by the technical errors in a statistical manner. The adapted ML-estimator already realizes a measure for agreement between (linear) model and data under consideration of the corresponding technical errors. Thresholds can now be determined for rejecting specific linear hypotheses using the distribution of _{return}. The likelihood is subsequently normalized to the number _{return }of this data. The parameter m_{return }comprises the number of samples that were returned to belong to a linear function despite deviation that is due to the contribution of unrelated variance. Each data point contributes a value of ln 2 to the likelihood function, resulting in the normalized likelihood

We now have two parameters, _{return }and _{return}, for which test statistics can be determined based on randomly selected true linear relationships. _{max }denotes the parameters for which the maximum likelihood is assumed. The distributions of _{return }and _{return }were assessed by Monte Carlo simulations: For each sample size ranging from 3 to 150 data points we have generated 25,000 random data sets, and test statistics were derived for each sample size _{1 }one determines all samples which belong to the corresponding linear function. Based on _{1 }and _{return}, the value of _{return }is determined as given above. The frequency distributions of _{return }for different values of _{return }are shown for the example of _{return }distributions varied for different _{return }values, and consequently, corresponding test statistics were established that set the limits for rejecting the null-hypotheses at a false negative error rate of ≤ 5% for each of the _{return }values.

Determination of the linearity rejection region by Monte Carlo simulations

Determination of the linearity rejection region by Monte Carlo simulations. 3–150 samples were used from linear functions which were imposed by additional Gaussian noise. The example for _{return }values, adapted maximum likelihood limits were determined for which the null hypothesis, the existence of a linearity, would need to be rejected.

(5) Determination of false positive and false negative error rates

The degree of noise can be described in terms of the reliability that is defined as ratio of biological variance and total variance. The later is just the sum of biological and technical variance if both variances are not correlated. In that case the reliability of the measurement of metabolite

The average reliability can easily be obtained from the simulations, and hence, the degree of noise can well be described as

We here assume that linear relationships between two metabolites are only confused by technical errors, but not by other biological factors, so the degree of noise here is only induced by technical errors. In order to test the algorithm described above, a data set was simulated that closely describes the problem. This model data set comprised 200 variables which were grouped into 20 clusters of equal size. All variables within a cluster were described by a linear relationship

The error rate of the algorithm is exemplified for selected sample numbers in figure _{return }and _{return}, the 5% threshold for the false positive rates would be reached at higher technical error rates. However, simultaneously, the minimal error rate for of false negatives would increase. Consequently, type I and type II error rates could in principle be balanced by adapting the thresholds for _{return }and _{return }in a qualitative manner. Nevertheless, the total error rate can only be influenced by decreasing the technical error or increasing the number of samples taken into account.

False negative and false positive error rates of the algorithm tested on simulated data in relation to the number of samples and the assumed technical errors, in % of the total variance

False negative and false positive error rates of the algorithm tested on simulated data in relation to the number of samples and the assumed technical errors, in % of the total variance. 900 pair-wise linear relationships between 200 metabolites were defined that were tested against the total of 19,900 potential linearities. Upper panel: Error rates without filter. Lower panel: error rates with filter.

As outlined above, increasing levels of technical errors cause higher false positive error rates of detections of linearities. However, the number of false positive detections can be shifted towards false negative error rates, if desired for a specific biological study. Therefore, a filter has been developed that filters out all potential false positives (Additional File

Number of samples required in relation to the assumed relative technical errors, if both false positive and false negative error rates are to remain below 0.05 (i.e. 5%)

Number of samples required in relation to the assumed relative technical errors, if both false positive and false negative error rates are to remain below 0.05 (i.e. 5%). 900 pair-wise linear relationships between 200 metabolites were defined that were tested against the total of 19,900 potential linearities.

The robustness of the algorithm was tested on a model dataset with 20 samples (figure _{return }and _{return }were adjusted to tolerate an outlier rate of 5% (one of 20 samples). Outliers were modelled with a distance from 2

Influence of outliers on false negative and false positive error rates on a sample size of

Influence of outliers on false negative and false positive error rates on a sample size of

Conclusion

Use of the technical error concomitant with a maximum likelihood assessment of linearity parameters and verification by simulated test statistics enables a robust detection and verification of liner relationships in complex data sets. An implementation of this algorithm will enable biologists to calculate and compare linearity networks in metabolomic or other multivariate data sets, from which biological hypotheses may be derived. The algorithm can be modified with respect to the ratio of type I and type II errors depending on the biological focus of a study. It is highly advised to use more than 20 biological replicates for each condition that is to be tested in a biological experimental design of

Authors' contributions

FK has worked out, tested and implemented the algorithm. MH had initially advised on the mathematics of likelihood estimations. JB eventually revised and improved the mathematical description of the algorithm and contributed to writing the paper. OF conceived the study, participated in developing and testing the algorithm and drafted and wrote the manuscript.

Acknowledgements

The work was funded by the NIEHS through the R01 project ES13932 granted to OF and by a fellowship granted to FK by the Max-Planck Society, Germany. Helpful comments by Joachim Selbig are appreciated.