Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach/Riss, Germany

EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Theoretical Bioinformatics Department, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany

Department of Bioinformatics and Functional Genomics, Institute of Pharmacy and Molecular Biotechnology (IPMB) and BioQuant, University of Heidelberg, 69120 Heidelberg, Germany

Boehringer Ingelheim Pharmaceuticals Inc., Ridgefield, CT 06877, USA

Abstract

Background

Normalization of microarrays is a standard practice to account for and minimize effects which are not due to the controlled factors in an experiment. There is an overwhelming number of different methods that can be applied, none of which is ideally suited for all experimental designs. Thus, it is important to identify a normalization method appropriate for the experimental setup under consideration that is neither too negligent nor too stringent. Major aim is to derive optimal results from the underlying experiment. Comparisons of different normalization methods have already been conducted, none of which, to our knowledge, comparing more than a handful of methods.

Results

In the present study, 25 different ways of pre-processing Illumina Sentrix BeadChip array data are compared. Among others, methods provided by the BeadStudio software are taken into account. Looking at different statistical measures, we point out the ideal versus the actual observations. Additionally, we compare qRT-PCR measurements of transcripts from different ranges of expression intensities to the respective normalized values of the microarray data. Taking together all different kinds of measures, the ideal method for our dataset is identified.

Conclusions

Pre-processing of microarray gene expression experiments has been shown to influence further downstream analysis to a great extent and thus has to be carefully chosen based on the design of the experiment. This study provides a recommendation for deciding which normalization method is best suited for a particular experimental setup.

Background

Analysing gene expression using microarrays is a well established method

Several studies comparing different normalization methods have already been conducted, many of them focusing on Affymetrix chips

Here we present a strategy for an in depth evaluation of normalization methods aiming at identifying the most appropriate one for a given data set. Our study compares established normalization methods available in the R environment to those offered by BeadStudio software. It focuses on the HumanHT-12 v3 Expression BeadChip, yet the underlying principles are directly transferable to other technologies measuring gene expression. Analyses described here provide the basis for the Phenocopy project (Baum ^{® }technology to measure the quantitative abundance of eight genes at three time points that are known to be deregulated

Results and Discussion

Expression data was pre-processed in 25 different ways (Figure _{2}-transformation (_{2}-transformation. In a last step, the data was normalized using

Heatmap of quality scores assigned for the different pre-processing methods

**Heatmap of quality scores assigned for the different pre-processing methods**. Displayed are the quality scores for the different pre-processing methods given for the analyses conducted. Quality scores range from -2 (bad) to 2 (good). The values in parentheses display the sum over the single quality scores for the respective pre-processing procedures. Based on this sum, the pre-processing method finally used to normalize the Phenocopy data has been chosen. Manhattan distance and complete linkage were used for clustering by applying an adjusted

Different pre-processing methods were evaluated by analysing the variance of the resulting gene expression intensities via various statistical measures. Some of these have already been used in other studies

Pre-processing methods were scored from -2 to 2 based on how well they match the required criteria for the different analyses described in this section. As it is difficult to clearly categorize the methods based on the examined measures, the final decision of which score to assign to some extent stays subjective. However, it is unambiguously possible to separate better pre-processing methods from worse. A complete overview of the scores assigned and the final ranking is given in Figure

Analyses of variance based on expression measurements

One basic assumption of gene expression pre-processing methods is that the majority of genes do not change their expression under different conditions. Additionally, expression intensities of replicates should be very similar compared to the expression of transcripts between differently treated sample groups. Based on these principles, we looked at different statistical measures to identify the method best suited for our dataset with respect to variance.

Distribution of F-test statistics

A good normalization method should minimize the variation within a treatment group. Furthermore, the variation within a treatment group should be smaller than the variation between groups. The F-statistic is a typical measurement to compare the variation between replicates to the variation between conditions or treatment groups

Cumulative Distribution Functions of F-test p-values

**Cumulative Distribution Functions of F-test p-values**. Cumulative Distribution Functions (CDFs) of FDR-corrected F-test p-values were calculated based on the gene expression measured for untreated HaCaT cells after 2, 4, and 12 hours. Displayed are the results obtained for the different pre-processing methods used. The vertical red dashed line indicates the commonly chosen p-value cut-off of 0.05. The insert displays the obtained results over the whole range of values from 0 to 1 on both axes.

P-values against variance between groups

Assuming a stable variance over the within group measurements, the bigger the variance between the groups, the bigger the respective -log_{10}(p-value) should be. When plotting these parameters, an appropriate normalization method should result in smoothly increasing values with not too much scattering around the fitted curve. Figure _{10}(p-value) against the respective variance between the control groups at time points 2 h, 4 h, and 12 h for three of the pre-processing methods, an overview over all results is given in Additional file _{10}(p-value) for a relatively high proportion of low between group variability values leading to a high scattering of observations in these regions. Using, for example, the rank invariant normalization of BeadStudio (

-log_{10}(p-values) against MSQ_{between }where MSQ_{between }≤ 5

**-log _{10}(p-values) against MSQ_{between }where MSQ_{between }≤ 5**. MSQ

**-log _{10}(p-values) against MSQ_{between }where MSQ_{between }≤ 5**. MSQs were calculated based on the gene expression measured for the three sample groups analyzed, namely untreated HaCaT cells after 2, 4, and 12 hours. Results obtained for the different pre-processing methods used are displayed. The blue line represents a loess-curve fitted to the values.

Click here for file

Boxplots of MSQ_{between }and MSQ_{within}

Further indications for good normalization are the distributions of between (MSQ_{between}) and within (MSQ_{within}) group variances and their relation to each other. If genes are not differentially expressed, MSQ_{between }should be comparable to MSQ_{within}. For genes that are differentially expressed, MSQ_{between }is supposed to be higher than MSQ_{within}. Figure _{between }(red) and MSQ_{within }(blue) values. Since we expect some genes to be differentially deregulated across the different time points under consideration, quantiles of MSQ_{within }values should lie below the corresponding quantiles of the MSQ_{between }values. For the differentially expressed genes, within group variance should be smaller than between group variance, whereas for the genes not differentially expressed, the respective MSQ_{between }and MSQ_{within }values should show no great difference. Small interquartile ranges (IQRs) of MSQ_{within }are indicative for a comparable variability between genes.

Boxplots of MSQ_{within }(blue) and MSQ_{between }(red)

**Boxplots of MSQ _{within }(blue) and MSQ_{between }(red)**. MSQs were calculated based on the gene expression measured for the three sample groups analyzed, namely untreated HaCaT cells after 2, 4, and 12 hours. Results obtained for the different pre-processing methods used are displayed. The grey dashed line indicates the expected value for the MSQ

To judge the values for MSQ, an MSQ_{between }was calculated for artificial group means of log_{2 }expression values for three time points based on four replicates. The group means used were (6, 6, 7) which resulted in an MSQ_{between }of 1.33, indicated by a dashed grey line in Figure _{2 }ratio of 1 when group 3 is compared to group 1 or group 2, reflecting a relevant difference between those groups. A good normalization method should result in similar expression values for replicates and thus in small MSQ_{within }values hardly crossing this artificial MSQ_{between}. Additionally, since we limited the whole data set to expressions measured for untreated HaCaT cells across time, we expect only few genes to be differentially expressed. Thus, only a few genes are assumed to result in an MSQ_{between }above the artificial MSQ_{between}.

Almost all boxplots representing MSQ_{within }of background normalized data (_{between}, only those transformed using vst do stay below. Compared to other pre-processing methods, _{between }and MSQ_{within}. Methods that meet the described behaviour by showing a low within group variability for which the quantiles generally exhibit lower values than the quantiles of the between group variabilities are

Density functions of MSQ_{between }and MSQ_{within}

Density functions of MSQ_{between }and MSQ_{within }should exhibit clear differences. This fact renders density functions of MSQ_{between }and MSQ_{within }as an additional option for investigating these values. Within group variability should be smaller than between group variability and most of the genes should show a between group variability similar to the within group variability, i.e. are not differentially expressed. Thus, the mode of MSQ_{within }should be smaller than the mode of MSQ_{between }and the peak of the function for MSQ_{within }is supposed to be higher than the peak for MSQ_{between}. Lean MSQ_{within }functions, on the one hand, reflect a comparable within group variability for many genes. On the other hand, broader MSQ_{between }functions indicate that at least some of the genes, i.e. the differentially expressed ones, show a higher between than within group variability. Ideal characteristics of density functions as described here are very similar to the characteristics of ideal boxplots mentioned in the previous section. In contrast to density functions, boxplots give a very rough idea about the distribution of the values, also depicting outliers. Density functions deliver a more detailed view of how the values are distributed across different ranges.

Figure _{between }(red) and MSQ_{within }(blue) for three of the pre-processing methods, a complete overview is given in Additional file _{within }generated by _{within}, normalization methods leading to a unimodal distribution should be favoured.

Density plots of MSQ_{within }(blue) and MSQ_{between }(red)

**Density plots of MSQ _{within }(blue) and MSQ_{between }(red)**. MSQs were calculated based on the gene expression measured for the three sample groups analyzed, namely untreated HaCaT cells after 2, 4, and 12 hours. The grey dashed line indicates the expected value for the MSQ

**Density plots of MSQ _{within }(blue) and MSQ_{between }(red)**. MSQs were calculated based on the gene expression measured for the three sample groups analyzed, namely untreated HaCaT cells after 2, 4, and 12 hours. Results obtained for the different pre-processing methods used are displayed. The grey dashed line indicates the expected value for the MSQ

Click here for file

A small overlap of the functions like for the values generated by the

Volcano plots

Volcano plots constitute a standard visualisation of microarray results. They are generated by plotting -log_{10}(p-value) versus the respective log_{2 }ratios. Due to the tendency of larger log_{2 }ratios being connected to more significant -log_{10}(p-values) a volcano like shape is generated. Pairwise comparisons (4 hours compared to 2 hours, 12 hours compared to 2 hours, and 12 hours compared to 4 hours) using a moderated t-statistic _{2 }ratios and p-values. Our aim is to detect normalization procedures yielding as correct estimates of log_{2 }ratios as possible combined with as informative p-values as possible. As mentioned above higher log_{2 }ratios should tend to have higher -log_{10}(p-value). The loess fits of the log_{2 }ratios and -log_{10}(p-value) pairs (dark blue curves) of the volcano plots shown in Figure _{2 }ratios should not be too large.

Volcano plots

**Volcano plots**. Log_{2 }ratios and p-values for the comparison of untreated HaCaT cells at 4 hours compared to 2 hours, 12 hours compared to 2 hours, and 12 hours compared to 4 hours were calculated based on the gene expression measured. Three examples of different qualities are displayed showing the -log_{10}(p-value) against log_{2 }ratio comparing 4 hours to 2 hours. The blue line represents a loess-curve fitted to the values. Quality values assigned to _{2 }ratios are overestimated and at the same time p-values are not very accurate. In contrast, _{2 }ratios (not many genes are assumed to heavily change their expression between different time points) but compared to _{2 }ratios. For a complete overview over all methods and all comparisons, see Additional file

**Volcano plots**. Log_{2 }ratios and p-values for the comparison of untreated HaCaT cells at 4 hours compared to 2 hours, 12 hours compared to 2 hours, and 12 hours compared to 4 hours were calculated based on the gene expression measured. Displayed are the -log_{10}(p-value) against log_{2 }ratio for the respective comparisons and the different normalization methods used. The blue line represents a loess-curve fitted to the values.

Click here for file

All volcano plots based on rma background corrected data do not look very promising. The fitted curves are rather flat, i.e. even for high absolute log_{2 }ratios the -log_{10}(p-value) are relatively low. Additionally, the -log_{10}(p-value) for similar log_{2 }ratios tend to scatter extremely. Some volcanos, e.g. _{2 }ratios. Especially _{2 }ratios for which the respective -log_{10}(p-value) seem to be relatively high. In this region the fitted curve shows a very steep, linear course. Volcano plots generated for all other methods are similar to what would be expected. Still they differ in the variance of the p-values and in that some of the fitted curves show a flatter shape than others. This reflects the fact that some normalization methods generate a smaller variance than others, resulting in lower fold changes but more significant p-values. Ultimately, a method with a reasonable trade off between fold change and variance has to be chosen and cut-off parameters for interesting genes have to be defined accordingly. Volcano plots best reflecting the desired properties in the context of our experiment were generated by

Residual standard deviation against mean and minimum of gene expression levels

In an optimally normalized experiment the residual standard deviation of fitted gene expression intensities should be low and independent of the expression levels, i.e. the variance over the different expression levels should be stable. This is prerequisite for many statistical methods, like for example linear model fitting and moderate t-statistics

As displayed in Figure

Residual standard deviation against expression intensities

**Residual standard deviation against expression intensities**. Standard deviation of the residuals observed for the regression fitted to the expression intensities are plotted against minimum (upper row) and mean (lower row) expression intensity of each probe. The blue line represents a loess-curve fitted to the values.

**Residual standard deviation against minimum expression intensity**. For each pre-processing method, standard deviation of the residuals observed for the regression fitted to the expression intensities are plotted against minimum expression intensity of each probe. The blue line represents a loess-curve fitted to the values.

Click here for file

**Residual standard deviation against mean expression intensity**. For each pre-processing method, standard deviation of the residuals observed for the regression fitted to the expression intensities are plotted against mean expression intensity of each probe. The blue line represents a loess-curve fitted to the values.

Click here for file

Methods which perform best with respect to variance stabilization across all expression levels are

Scatterplots of expression values

Scatterplots are an easy and straightforward visualisation tool for judging the comparability of replicates. They clearly show whether high variances are to be expected and, if this is the case, in which range of the expression data. Figure

Scatterplots between replicates

**Scatterplots between replicates**. After application of different normalization methods, expression values for the respective replicates at 12 hours are plotted against each other.

**Scatterplots between replicates**. After application of different normalization methods, expression values for the replicates are plotted against each other. The orange line indicates the main diagonal.

Click here for file

Pseudo-ROC curves

In order to compensate for missing spike-in and dilution data a pseudo-ROC approach

Pseudo-ROC curves based on adjusted p-values

**Pseudo-ROC curves based on adjusted p-values**. Pseudo-ROC curves were calculated for the different pre-processing methods. FDR-adjusted p-values

**Ranking of AUC values**. AUC values as calculated for the pseudo-ROC analysis displayed in Figure

Click here for file

Analyses of bias based on qRT-PCR

qRT-PCR has been performed for mRNAs from eight genes that are known to be deregulated by TGF-β signalling to a varying degree (CDKN1A, CDKN2B, HAND1, JUNB, LINCR, RPTN, SERPINE1, and TSC22D1). By this means, it is possible to compare the results of the normalization methods to values that reflect the real abundance of the respective mRNA in the cells. Thus, we are able to evaluate the accuracy of the different pre-processing methods with respect to their bias. To guarantee that the comparisons of the normalization methods are not biased towards certain intensities, the mRNAs used in qRT-PCR experiments were chosen such that the respective signals on the chips cover a broad range of expression intensities (Additional file

**Results of qRT-PCR**. 2^{-ΔΔCt }

Click here for file

Correlation analysis of fold changes

Based on the different normalization procedures for the gene expression experiment and based on the qRT-PCR measurements (Additional file _{2}-transformation (

Pearson correlation of log_{2 }ratios for different normalization methods and qRT-PCR

**Pearson correlation of log _{2 }ratios for different normalization methods and qRT-PCR**. Correlations of log

Regression analysis

To investigate the linear relationship between fold changes as determined by TaqMan and gene expression data, a linear regression analysis was performed by minimizing the sum of squares of the Euclidean distance of points to the fitted line ('orthogonal regression', Figure _{2}-transformation (

Orthogonal regression between qRT-PCR and normalization based log_{2 }ratios

**Orthogonal regression between qRT-PCR and normalization based log _{2 }ratios**. Regression of log

Results of orthogonal regression

**Results of orthogonal regression**. Ranking of slope (A) and intercept (B) of the orthogonal regression lines as displayed in Figure

**Orthogonal regression between qRT-PCR and normalization based log _{2 }ratios**. Regression of log

Click here for file

Conclusions

It is important to select appropriate pre-processing methods for a given data set based on the experimental setup used. On the one hand, if sample sizes of the different groups are relatively small, it is crucial to achieve a homogeneous variance for the groups. On the other hand, if sample sizes are large, variances can be estimated separately and one should focus on unbiased fold changes. Since the sample sizes for the current data set are rather small (three to four replicates per group), a stable variance is more important than an exact representation of the fold change. In general, the data should be normalized without too much reducing real variations. Figure

One has to keep in mind that, based on the individual analyses, there are several methods resulting in nearly equal quality. Therefore, it is not possible to give a well-defined rationale for using only one specific method. After excluding the methods that clearly violate the imposed criteria, the decision is still subjective. It, for example, depends on whether one would like to account for a good estimate of fold changes or a small and homogeneous variance. Finally the decision remains based on experience; yet, with the analyses and criteria described here, we provide a recommendation on how to pre-select appropriate methods. Since, for our data set, we intended to achieve a low and homogeneous variance, we provided more and to a certain degree overlapping statistics investigating variance. In case the focus is on a good estimate of the fold change, the researcher should higher account for statistics investigating this measure. Correlation to qRT-PCR or slope and intercept of the regression between qRT-PCR and gene expression fold changes are examples of analyses that could be of higher interest in this context. Focusing on variance, best suited for the data set analysed here are _{2}-transformation in combination with quantile normalization has been approved as performing relatively well by Du

Spike-in or dilution data is frequently used for evaluating different normalization methods

In summary, we provide statistical measures based on which researchers can decide on the best suited pre-processing scenario for their own experimental design. If no spike-in data is available, we recommend conducting qRT-PCR for selected, representative transcripts. Thereby, it is possible to estimate the bias of log_{2 }ratios obtained from normalized data. In conjunction with the measures for the variability of the data finally the basis for weighing well measured changes versus low and homogeneous variance is delivered and by this means selecting an appropriate normalization method is possible.

Methods

Biological experiments

Cell culture

HaCaT cells were cultured under standard conditions (REF). Cells were seeded in 96-well (ELISA) or in 24-well (RNA expression profiling) plates and grown overnight to a confluence of approximately 70%. Cells were starved for 3 hours in DMEM containing no FCS and subsequently stimulated with 5 ng/ml of TGF-β1 (R&D Systems) or left unstimulated as controls for 2, 4, and 12 hours.

RNA extraction

RNA isolation was carried out using a MagMAX™ Express-96 Magnetic Particle Processor and the MagMAX™-96 Total RNA Isolation Kit according to the manufacturer's protocol. Total RNA concentration was quantified by fluorescence measurement using SYBR Green II (Invitrogen) and a Synergy HT reader (BioTek) as previously described

Amplification, labeling and BeadChip hybridization of RNA samples

Illumina TotalPrep RNA Amplification Kit (Ambion) was used to transcribe 200 ng toRNA according to the manufacture's recommendation. A total of 700 ng of cRNA was hybridized at 58°C for 16 hours to the Illumina HumanHT-12 v3 Expression BeadChips (Illumina). BeadChips were scanned using an Illumina BeadArray Reader and the Bead Scan Software (Illumina). Data is publicly available in ArrayExpress

qRT-PCR

Quantitative Real-Time Polymerase Chain Reaction (qRT-PCR) was conducted for eight genes (CDKN1A, CDKN2B, HAND1, JUNB, LINCR, RPTN, SERPINE1, and TSC22D1) known to be deregulated at at least one time point by TGF-β stimulation.

mRNA expression levels of the eight genes were determined by qRT-PCR analysis using a 7900HT Fast Real-Time PCR System (Applied Biosystems) and the Universal ProbeLibrary System (Roche). Gene specific forward and reverse primer sequences were designed using the Universal Probe Library Assay Design Center (Roche). Total RNA was transcribed into cDNA using the High Capacity cDNA Reverse Transcription Kit (Applied Biosystems) according to the manufacture's instructions. qRT-PCR is carried out in a final volume of 12 μl in three replicates for each cDNA sample. Levels of RNA polymerase II were used for normalization of the data. ΔΔCT method was used to relatively quantify mRNA levels of treated samples compared to untreated controls (Additional file

Data processing

Data has been processed with BeadStudio version 3.0 and the R Language and Environment for Statistical Computing (R) 2.7.0

BeadStudio pre-processing

The normalizations executed by Illumina BeadStudio were all applied to the expression values on the original scale. If background adjustment was performed, we used the standard background normalization offered by BeadStudio (_{2}-transformed.

R pre-processing

_{2}-transformed using the

forcePositive (_{2}-transform the expression values. Background correction referred to as

For transforming the data, a simple log_{2}-tranformation (

Data was normalized using quantile normalization (

All methods used are implemented in the R packages affy

Statistical measures

In the following, the statistical measures used are briefly summarized. Unless otherwise noted, all statistical calculations were performed using R. A small R-package to conduct and reproduce the described analyses is available from the authors upon request. For the visualizations displayed in Figures

Signal to noise ratios

One aim of normalization is to minimize, for each gene, the within group variability while maximizing the between group variability also referred to as mean sum of square within

and mean sum of square between

respectively.

Here, k represents, for a given gene, the number of groups, n_{i }the size of group i, _{ij }the j^{th }value in group i. The aim is to maximize _{1 }= n_{2 }= n_{3 }= 4, and

Pseudo-ROC curves

One of the main uses of expression arrays is the identification of genes that are differentially expressed under various experimental conditions. A typical identification rule filters genes with p-values and/or fold change exceeding a given threshold. Given a set of known true positives (TP) and false positives (FP), Receiver Operator Characteristic (ROC) curves offer a graphical representation of both specificity and sensitivity for such a detection rule. ROC curves are created by plotting the true positive rate (sensitivity) against false positive rate (1-specificity) obtained at each possible threshold value. Since we only know about TP (20 genes known to be deregulated by TGF-β), we made use of so-called pseudo-ROC curves

Log_{2 }ratios, residual standard deviation, and p-values

The log_{2 }ratios, residual standard deviation, and p-values were calculated using linear models in combination with the moderated t-statistic as supplied by limma

Regression analysis of fold change values and qRT-PCR measurements

To get an overall impression of how good of a fit of the fold change levels detected using the different normalization methods to the qRT-PCR results are, an orthogonal regression for the observations was performed using the

Naming conventions

The following naming conventions are used to refer to different normalization methods:

R normalizations

•

•

•

•

BeadStudio normalizations

•

•

Authors' contributions

PB did the laboratory work, KF-C and RS conducted the statistical and bioinformatics analyses, CI, WH, BB and RE supported the statistical analyses, CI supervised the statistical analyses, DM and KQ supervised the project. RS and PB wrote the manuscript and all authors proofread the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank Simon Anders, Elin Axelsson, Richard Bourgon, Bernd Fischer, Audrey Kauffmann, Gregoire Pau, and Jörn Tödling for helpful discussions and Stephen Gelling for proofreading.