TNO Quality of Life, P.O. Box 360, 3700 AJ Zeist, The Netherlands

Biosystems Data Analysis, Swammerdam Institute for Life Sciences, Universiteit van Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands

Abstract

Background

Extracting relevant biological information from large data sets is a major challenge in functional genomics research. Different aspects of the data hamper their biological interpretation. For instance, 5000-fold differences in concentration for different metabolites are present in a metabolomics data set, while these differences are not proportional to the biological relevance of these metabolites. However, data analysis methods are not able to make this distinction. Data pretreatment methods can correct for aspects that hinder the biological interpretation of metabolomics data sets by emphasizing the biological information in the data set and thus improving their biological interpretability.

Results

Different data pretreatment methods, i.e. centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation, were tested on a real-life metabolomics data set. They were found to greatly affect the outcome of the data analysis and thus the rank of the, from a biological point of view, most important metabolites. Furthermore, the stability of the rank, the influence of technical errors on data analysis, and the preference of data analysis methods for selecting highly abundant metabolites were affected by the data pretreatment method used prior to data analysis.

Conclusion

Different pretreatment methods emphasize different aspects of the data and each pretreatment method has its own merits and drawbacks. The choice for a pretreatment method depends on the biological question to be answered, the properties of the data set and the data analysis method selected. For the explorative analysis of the validation data set used in this study, autoscaling and range scaling performed better than the other pretreatment methods. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis).

In conclusion, selecting a proper data pretreatment method is an essential step in the analysis of metabolomics data and greatly affects the metabolites that are identified to be the most important.

Background

Functional genomics approaches are increasingly being used for the elucidation of complex biological questions with applications that range from human health

In metabolomics research, there are several steps between the sampling of the biological condition under study and the biological interpretation of the results of the data analysis (Figure

The different steps between biological sampling and ranking of the most important metabolites

**The different steps between biological sampling and ranking of the most important metabolites.**

In this paper, we discuss different properties of metabolomics data, how pretreatment methods influence these properties, and how the effects of the data pretreatment methods can be analyzed. The effect of data pretreatment will be illustrated by the application of eight data pretreatment methods to a metabolomics data set of

Properties of metabolome data

In metabolomics experiments, a snapshot of the metabolome is obtained that reflects the cellular state, or phenotype, under the experimental conditions studied

However, other factors are also present in metabolomics data:

1. Differences in orders of magnitude between measured metabolite concentrations; for example, the average concentration of a signal molecule is much lower than the average concentration of a highly abundant compound like ATP. However, from a biological point of view, metabolites present in high concentrations are not necessarily more important than those present at low concentrations.

2. Differences in the fold changes in metabolite concentration due to the induced variation; the concentrations of metabolites in the central metabolism are generally relatively constant, while the concentrations of metabolites that are present in pathways of the secondary metabolism usually show much larger differences in concentration depending on the environmental conditions.

3. Some metabolites show large fluctuations in concentration under identical experimental conditions. This is called uninduced biological variation.

Besides these biological factors, other effects present in the data set are:

4. Technical variation; this originates from, for instance, sampling, sample work-up and analytical errors.

5. Heteroscedasticity; for data analysis, it is often assumed that the total uninduced variation resulting from biology, sampling, and analytical measurements is symmetric around zero with equal standard deviations. However, this assumption is generally not true. For instance, the standard deviation due to uninduced biological variation depends on the average value of the measurement. This is called heteroscedasticity, and it results in the introduction of additional structure in the data

The variation in the data resulting from a metabolomics experiment is the sum of the induced variation and the total uninduced variation. The total uninduced variation is all the variation originating from uninduced biological variation, sampling, sample work-up, and analytical variation. Data pretreatment focuses on the biologically relevant information by emphasizing different aspects in the clean data, for instance, the metabolite concentration under a growth condition relative to the average concentration, or relative to the biological range of that metabolite. In metabolomics, data pretreatment relates the differences in metabolite concentrations in the different samples to differences in the phenotypes of the cells from which these samples were obtained

Data pretreatment methods

The choice for a data pretreatment method does not only depend on the biological information to be obtained, but also on the data analysis method chosen since different data analysis methods focus on different aspects of the data. For example, a clustering method focuses on the analysis of (dis)similarities, whereas principal component analysis (PCA) attempts to explain as much variation as possible in as few components as possible. Changing data properties using data pretreatment may therefore enhance the results of a clustering method, while obscuring the results of a PCA analysis.

In this paper, we discuss three classes of data pretreatment methods: (I) centering, (II) scaling and (III) transformations (Table

Overview of the pretreatment methods used in this study. In the Unit column, the unit of the data after the data pretreatment is stated.

**Class**

**Method**

**Formula**

**Unit**

**Goal**

**Advantages**

**Disadvantages**

**I**

Centering

Focus on the differences and not the similarities in the data

Remove the offset from the data

When data is heteroscedastic, the effect of this pretreatment method is not always sufficient

**II**

Autoscaling

(-)

Compare metabolites based on correlations

All metabolites become equally important

Inflation of the measurement errors

Range scaling

(-)

Compare metabolites relative to the biological response range

All metabolites become equally important. Scaling is related to biology

Inflation of the measurement errors and sensitive to outliers

Pareto scaling

Reduce the relative importance of large values, but keep data structure partially intact

Stays closer to the original measurement than autoscaling

Sensitive to large fold changes

Vast scaling

(-)

Focus on the metabolites that show small fluctuations

Aims for robustness, can use prior group knowledge

Not suited for large induced variation without group structure

Level scaling

(-)

Focus on relative response

Suited for identification of e.g. biomarkers

Inflation of the measurement errors

**III**

Log transformation

Log

Correct for heteroscedasticity, pseudo scaling. Make multiplicative models additive

Reduce heteroscedasticity, multiplicative effects become additive

Difficulties with values with large relative standard deviation and zeros

Power transformation

Correct for heteroscedasticity, pseudo scaling

Reduce heteroscedasticity, no problems with small values

Choice for square root is arbitrary.

Class I: Centering

Centering converts all the concentrations to fluctuations around zero instead of around the mean of the metabolite concentrations. Hereby, it adjusts for differences in the offset between high and low abundant metabolites. It is therefore used to focus on the fluctuating part of the data

Class II: Scaling

Scaling methods are data pretreatment approaches that divide each variable by a factor, the scaling factor, which is different for each variable. They aim to adjust for the differences in fold differences between the different metabolites by converting the data into differences in concentration relative to the scaling factor. This often results in the inflation of small values, which can have an undesirable side effect as the influence of the measurement error, that is usually relatively large for small values, is increased as well.

There are two subclasses within scaling. The first class uses a measure of the data dispersion (such as, the standard deviation) as a scaling factor, while the second class uses a size measure (for instance, the mean).

Scaling based on data dispersion

Scaling methods tested that use a dispersion measure for scaling were autoscaling

Pareto scaling

Vast scaling

The scaling methods described above use the standard deviation or an associated measure as scaling factor. The standard deviation is, within statistics, a commonly used entity to measure the data spread. In biology, however, a different measure for data spread might be useful as well, namely the biological range. The biological range is the difference between the minimal and the maximal concentration reached by a certain metabolite in a set of experiments. Range scaling

Scaling based on average value

Level scaling falls in the second subclass of scaling methods, which use a size measure instead of a spread measure for the scaling. Level scaling converts the changes in metabolite concentrations into changes relative to the average concentration of the metabolite by using the mean concentration as the scaling factor. The resulting values are changes in percentages compared to the mean concentration. As a more robust alternative, the median could be used. Level scaling can be used when large relative changes are of specific biological interest, for example, when stress responses are studied or when aiming to identify relatively abundant biomarkers.

Class III: Transformations

Transformations are nonlinear conversions of the data like, for instance, the log transformation and the power transformation (Table

Since the log transformation and the power transformation reduce large values in the data set relatively more than the small values, the transformations have a pseudo scaling effect as differences between large and small values in the data are reduced. However, the pseudo scaling effect is not determined by the multiplication with a scaling factor as for a 'real' scaling effect, but by the effect that these transformations have on the original values. This pseudo scaling effect is therefore rarely sufficient to fully adjust for magnitude differences. Hence, it can be useful to apply a scaling method after the transformation. However, it is not clear how the transformation and a scaling method influence each other with regard to the complex metabolomics data.

A transformation that is often used is the log transformation (Table

A transformation that does not show these problems and also has positive effects on heteroscedasticity is the power transformation (Table

Methods

Background of the data set

GC-MS analysis

Lyophilized metabolome samples were derivatized using a solution of ethoxyamine hydrochloride in pyridine as the oximation reagent followed by silylation with N-trimethyl-N-trimethylsilylacetamide as described by

Data preprocessing

The data from GC-MS analyses were deconvoluted using the AMDIS spectral deconvolution software package ^{3}. The output of the AMDIS analysis, in the form of peak identifiers and peak areas, was corrected for the recovery of internal standards and normalized with respect to biomass. The peaks resulting from a known compound were combined. The samples N3, S2 and S3 were removed from the data set, as a different sample workup protocol was followed. Furthermore, metabolites detected only once in the 13 remaining experiments were removed. This lead to a reduced data set consisting of 13 experiments and 140 variables expressed as peak areas in arbitrary units (Figure

Experimental design

**Experimental design**. The fermentations were performed in independent triplicates. Of the third glucose fermentation a sample was taken in duplicate and of G1, N1 and S1 the samples were analyzed in duplicate by GC-MS. The samples of N3, S2 and S3 were not taken into account in this study.

Data pretreatment

Data pretreatment and PCA were performed using Matlab 7 **X**), vectors in bold lowercase (**t**), and scalars are given in lowercase italic (**X **(_{ij }therefore holds the measurement of metabolite

Vast scaling was applied unsupervised as the other data pretreatment methods were unsupervised as well.

Data analysis

PCA was applied for the analysis of the data. PCA decomposes the variation of matrix **X **into scores **T**, loadings **P**, and a residuals matrix **E**. **P **is an **T **is a

**X **= **PT**^{T }+ **E**,

where ^{T }** P **=

The number of components used (

For ranking of the metabolites according to importance for the

Here, _{a }is the singular value for the ^{th }PC and _{ia }is the value for the ^{th }variable in the loading vector belonging to the ^{th }PC. To allow for comparison between the different data pretreatment methods, the values for _{A }were sorted in descending order after which the comparisons were performed using the rank of the metabolite in the sorted list.

The measurement errors were analyzed by estimation of the standard deviation from the biological, analytical, and sampling repeats. The standard deviations were binned by calculating the average variance per 10 metabolites ordered by mean value

The jackknife routine was performed according to the following setup. In round one experiments F1, G1, N1 were left out, in round two F2, G2, N1d were left out, and in round three F3, G3A, were left out. By selecting these experiments, the specific aspects of the experimental design were maintained.

Results and discussion

Properties of the clean data

For any data set, the total variation is the sum of the contributions of all the different sources of variation. The sources of variation in the data set used in this study were the induced biological variation, the uninduced biological variation, the sample work-up variation, and the analytical variation. The variation resulting from the sample work-up and the analytical analysis together was called technical variation. The contributions of the different sources of variation were roughly estimated from the replicate measurements by calculating the sum of squares (SS) and the mean square (MS) (Table

Estimation of the sources of variation in the data set. The SS and the MS for the different sources of variation are given, based on the experimental design presented in Figure 2. *The technical source of variation consists of the analytical error and the sample work-up error.

**Source of variation**

**SS**

**MS**

Analytical

0.0205

0.0102

Technical*

0.0482

0.0482

Uninduced biological

0.208

0.104

Induced biological

0.952

0.317

Total SS

1.23

The effect of pretreatment on the clean data

The application of different pretreatment methods on the clean data had a large effect on the resulting data used as input for data analysis, as is depicted for sample G2 in Figure

Effect of data pretreatment on the original data

**Effect of data pretreatment on the original data**. Original data of experiment G2 (A), and the data after centering (B), autoscaling (C), pareto scaling (D), range scaling (E), vast scaling (F), level scaling (G), log transformation (H), and power transformation (I). For units refer to Table 1.

Heteroscedasticity

To determine the presence or absence of heteroscedasticity in the data set, the standard deviations of the metabolites of the analytical and the biological repeats were analyzed (Figure

Analytical and biological heteroscedasticity in the data

**Analytical and biological heteroscedasticity in the data**. A: Analytical standard deviation (experiment G1), B: Biological standard deviation (all glucose experiments), and C: Relative biological standard deviation (all glucose experiments), as a function of the metabolite concentration. To obtain a clearer overview, the standard deviations were grouped together based on average mean value of the peak area (Binning, see Jansen

The effect of the log and the power transformation on the data as a means to correct for heteroscedasticity is shown in Figure

Effect of data transformation on biological heteroscedasticity

**Effect of data transformation on biological heteroscedasticity**. A: power transformed data. B: log transformed data. The standard deviations over all glucose experiments were ordered by the mean value of the peak areas and binned per 10 metabolites. The first bin contained the metabolites whose peak area was below the detection limit.

Scaling approaches influence the heteroscedasticity as well, since the variation, and thus the heteroscedasticity, is converted into relative values to the scaling factor. It is likely that this aspect reduces the effect of the heteroscedasticity on the results.

The effect of data pretreatment on the data analysis results

PCA

The score plots were judged on two aspects by visual inspection, namely the distance within the cluster of a specific carbon source and the distance between the clusters of different carbon sources. The loading plots show the contributions of the measured metabolites to the separation of the experiments in the score plots. As cellular metabolism is strongly interlinked (e.g. see

The data pretreatment methods used largely affected the outcome of PCA analysis (Figure

Effect of data pretreatment on the PCA results

**Effect of data pretreatment on the PCA results**. PCA results of range scaled data (6A), centered data (6B), and vast scaled data (6C). For every pretreatment method the score plot (X1) (PC1 vs. PC2) and the loadings of PC 1 (X2) and PC 2 (X3) are shown. D-fructose (F, △), succinate (S, □), D-gluconate (N, ◯), D-glucose (G, *).

The application of centering lead to intermediate clustering results in the score plots (Figure

In contrast to the other pretreatment methods, vast scaling of the clean data resulted in a very poor clustering of the samples (Figure

These results clearly demonstrate that the pretreatment method chosen dramatically influences the results of a PCA analysis. Consequently, these effects are also present in the rank of the metabolites.

Ranking of the most important metabolites

In functional genomics research, ranking of targets according to their relevance to the problem studied (for instance, strain improvement) is of great importance as it is time consuming and costly to validate the, in general, dozens or hundreds of leads that are generated in these studies^{st}, 11^{th}, or 38^{th }most important metabolite, respectively. The pretreatment of the clean data thus directly affected the ranking of the metabolites as being the most relevant.

Rank of the most important metabolites

**Rank of the most important metabolites**. The rank was based on the cumulative contributions of the loadings of the first three PCs. Top 10 metabolites are given in white characters with a black background, the top 11 to 20 is given in white characters with dark gray background, the top 21 to 30 is given in black characters with a light gray background.

The effect of a data pretreatment method on the rank of the metabolites is also apparent when studying the relation between the rank of the metabolites and the abundance (average peak area of a metabolite), or the fold change (standard deviation of the peak area over all experiments for a metabolite) (Figure

Relation between the abundance or the fold change of a metabolite and its rank after data pretreatment

**Relation between the abundance or the fold change of a metabolite and its rank after data pretreatment**. The highest ranked metabolite after data pretreatment, based on its cumulative contributions on the loadings of the first three PCs, has position 1 on the X-axis. The metabolite that is ranked at position 1 on the Y-axis has either the highest fold change in concentration (largest standard deviation of the peak area over all the experiments in the clean data (O)); or is most abundant (largest mean concentration (□)) in the clean data.

Reliability of the rank of the metabolites

While the rank of the metabolites provides valuable information, the robustness of this rank is just as important as it determines the limits of the reliable interpretation of the rank. To test the reliability of the rank of the metabolites, a jackknife routine was applied

The results for level scaling and range scaling are shown in Figure

Stability of the rank of the most important metabolites

**Stability of the rank of the most important metabolites**. The order of the metabolites is based on the average rank.

This resampling approach showed that the reliability of the rank of the most important metabolites is also dependent on the data pretreatment method. The most stable data pretreatment methods were centering, level scaling (Figure

It must be stressed that the pretreatment method that provides the most stable rank does not necessarily provides the most relevant biological answers.

Conclusion

This paper demonstrates that the data pretreatment method used is crucial to the outcome of the data analysis of functional genomics data. The selection of a data pretreatment method depends on three factors: (i) the biological question that has to be answered, (ii) the properties of the data set, and (iii) the data analysis method that will be used for the analysis of the functional genomics data.

Notwithstanding these boundaries, autoscaling and range scaling seem to perform better than the other methods with regard to the biological expectations. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA analysis. Other methods showed a strong dependence on the average concentration or magnitude of the fold change (centering, log transformation, power transformation, level scaling, pareto scaling), or lead to PCA results that were poorly interpretable in relation to the experimental setup (vast scaling).

Using a pretreatment method that is not suited for the biological question, the data, or the data analysis method, will lead to poor results with regard to, for instance, the rank of the most relevant metabolites for the biological question that is subject of study (Figure

In functional genomics data analysis, data pretreatment is often overlooked or is applied in an ad hoc way. For instance, in many software packages, such as Cluster

As far as we are aware, this is the first time that the importance of selecting a proper data pretreatment method on the outcome of data analysis in relation to the identification of biologically important metabolites in metabolomics/functional genomics is clearly demonstrated.

Authors' contributions

RAB is responsible for the idea of a comprehensive comparison of data pretreatment methods, and performed the statistical analyses. HCJH provided valuable input with regard to the mathematical soundness of the research. JAW advised on the practical issues and the interpretation of the results. AKS supplied statistical feedback and conceptual feedback on different pretreatment methods. MJW recognized the importance of data pretreatment for biological interpretation and kept the focus of the research on biological interpretability.

Acknowledgements

The authors would like to thank Karin Overkamp and Machtelt Braaksma for the generation of the biological samples and sample work up, and Maud Koek, Bas Muilwijk, and Thomas Hankemeier for the analysis of the samples and data preprocessing. This research was funded by the Kluyver Centre for Genomics of Industrial Fermentation, which is supported by the Netherlands Genomics Initiative (NROG).