Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Box 630, Rochester, New York 14642, USA

Abstract

Background

Quantile and rank normalizations are two widely used pre-processing techniques designed to remove technological noise presented in genomic data. Subsequent statistical analysis such as gene differential expression analysis is usually based on normalized expressions. In this study, we find that these normalization procedures can have a profound impact on differential expression analysis, especially in terms of testing power.

Results

We conduct theoretical derivations to show that the testing power of differential expression analysis based on quantile or rank normalized gene expressions can never reach 100% with fixed sample size no matter how strong the gene differentiation effects are. We perform extensive simulation analyses and find the results corroborate theoretical predictions.

Conclusions

Our finding may explain why genes with well documented strong differentiation are not always detected in microarray analysis. It provides new insights in microarray experimental design and will help practitioners in selecting proper normalization procedures.

Background

Microarray technology has been widely adopted in many genomic related studies in the past decade. Despite its popularity, it is well known that various technical noises exist in microarray experiments

Quantile normalization is perhaps the most widely adopted method for analyzing microarray data generated by Affymetrix GeneChip platform. Motivated by quantile-quantile plot, it makes the empirical distribution of gene expressions pooled from each array to be the same

Rank normalization is an alternative to quantile normalization. It replaces each observation by its fractional rank (the rank divided by the total number of genes) within array

After normalization, a pertinent statistical test such as Student’s

Without compromising the control of type I error, better testing power can be achieved by either increasing sample size or improving the strength of gene differentiation effect (fold changes between different phenotypes). Sometimes large expected differential effects based on biological considerations are invoked as a reason to justify a microarray study with very small sample sizes.

In this study, we find that one cannot “trade” differentiation effects with sample size. When the sample size is small, the statistical power for a gene differentiation analysis will not reach 100% even when the effect size approaches to infinity. This counter-intuitive phenomenon is due to the nature of the normalization procedures, which alters both sample mean difference and pooled sample standard deviation of the normalized expressions. As a result, they both grow at most linearly as functions of effect size and their effects cancel out. Our findings provide new insights into microarray experimental design which may help practitioners in selecting appropriate normalization procedures.

Methods

Notations and biological data

Notations

We assume that all expression levels are log-transformed. For convenience, the words “gene” and “gene expression” are used interchangeably to refer to these log-transformed random variables. These genes are indexed by

Let

The mean and standard deviation of

In practice, the true level of gene differentiation is not a constant. It depends on the biological settings. The variance of gene expressions is nor constant either — it depends on the accuracy of measuring instruments and the homogeneity of biological subjects, just to name a few factors. In terms of statistical power, the decrease of gene expression variance is equivalent to the increase of mean difference. For simplicity, we consider gene expression variance to be fixed and define the effect size, our analysis tuning parameter, to be the expected mean difference of the

We divide genes into three sets:

• _{0}, the set of non-differentially expressed genes (abbreviated as NDEGs). For all _{0},

• _{
i
}> 0.

• _{
i
}< 0.

The set of differentially expressed genes (abbreviated as DEGs) is the union of both up-regulated and down-regulated genes, which is denoted by _{0 }= |_{0}|, _{1 }= |_{1}|. Apparently _{0 }+ _{1 }=

Biological data

The biological dataset used in this study is the childhood leukemia dataset from the St. Jude Children’s Research Hospital database **HYPERDIP**), 79 patients (arrays) with a special translocation type of acute lymphoblastic leukemia (**TEL**) and 45 patients (arrays) with a T lineage leukemia (**TALL**). Each patient is represented by an array reporting the logarithm (base 2) of expression level on the set of 9005 genes.

Analytic analysis of the impact of normalization procedures on differential expression analysis

In this section, we evaluate the impact of quantile and rank normalization on

To simplify theoretical derivation, we assume that the mean expression levels in the normal phenotype (group _{
i
}, is much more important than the normal level of gene expressions. For simplicity, we also assume that the effect size is a constant ^{+ }> 0 for all up-regulated and ^{- }< 0 for all down-regulated genes. In summary,

Therefore, the expected group differences of non-normalized gene expression data are

We must point out that all these assumptions are made only for the simplification of the theoretical derivations. Our findings essentially do not depend on these assumptions. This has been confirmed in our biological simulation study in Section “Results and discussion” (**SIMU-BIO**).

For the

where

The testing power of a two-sided

Quantile normalization

With quantile normalization (**QUANT**), a reference array of empirical quantiles, denoted as **q **= (_{1},_{2},…,_{
m
}), is first computed by taking the average across all ordered arrays. Let

The original expressions are replaced by the entries of the reference array with the same rank. Denote

We refer the reader to

In group

We first investigate the asymptotic properties of sample mean difference ^{+}. More specifically, by using conditional expectation, we obtain that for

Similarly for down-regulated DEGs (

Detailed derivations can be found in Section 3 in the Additional file

**Supplementary material.**

Click here for file

Similarly, ^{+ }and ^{- }or (with positive probability) stay as a constant. Heuristically speaking, ^{+ }or ^{- }if the ranks of expressions are all in the top group (^{+ }or ^{-}. For group

More detailed derivations can be found in Section 3 in the Additional file

According to Equations (6), (7) and (8), the sample mean difference and pooled sample standard deviation both grow at most linearly as functions of ^{+ }(^{-}). As a result, the (absolute values of)

Similarly, the

To see this mixture under the normality assumption, we assume that all observed gene expressions

Here _{
t
}, _{
T(γ) }and _{
T(γ,λ) }are the density functions of central, noncentral and doubly noncentral ^{+},^{-}) is the numerator noncentrality parameter and ^{+})^{2},(^{-})^{2}) is the denominator noncentrality parameter (from noncentral ^{2})

In microarray analysis it is reasonable to assume _{1 }≪ ^{+ }(^{-}) approaches infinity. Figure ^{+ }and -^{- }vary from 0 to 3.6 and the medians of the

Empirical density estimates of the

**Empirical density estimates of the ****-statistics before and after quantile normalization. **Empirical density estimates of the ^{+ }= -^{- }=

Medians of the

**Medians of the ****-statistic absolute values. **Medians of the absolute values of the ^{+ }= -^{- }=

Empirical evidences in Section “Results and discussion” also show that the statistical power converges to a fixed number strictly less than 1.0; and this convergence is independent of the hypothesis testing methods and MTPs being applied. Heuristically speaking, **QUANT **“borrows” information from both NDEGs and DEGs to reduce data variation, and as a result the normalized expressions are complex

Rank normalization

With rank normalization (**RANK**), we replace each entry in one array by its position (rank) in the ordered array counted from the smallest value divided by the total number of genes. Denote

This method was proposed by

Compared with **QUANT**, **RANK **goes even further in the nonparametric direction. It removes the noise by only preserving the ordering of observations. We know

Here for simplicity, again we assume that the genes take the specified ranks with equal chances within each group. Therefore, the normalized gene expressions no longer depend on the effect size. The expected group differences for rank normalized genes are

It is easy to check that the pooled standard deviation is also independent of the effect size. As a result, the testing power with rank normalization converges to a constant strictly less than 1.0 as the effect size increases. More details can be found in Section 5 in the Additional file

Simulation studies

Extensive simulations are conducted to verify above theoretical predictions. We document these simulation studies in this section.

Simulation data

Two sets of simulated data are used in this study. Each set of data has two groups of

• **SIMU**: Each array has _{0 }= 900. For both groups, all genes are normally distributed with standard deviation **HYPERDIP**, 0.93 for **TEL**, and 0.91 for **TALL**. The algorithm used to generate these correlated observations is stated in **SIMU**. Details can be found in Section 6 of the Additional file

• The expectations of DEGs in group

• **SIMU-BIO**: To match the statistical properties of real gene expression more closely and mimic other noise sources such as non-additive noise, we apply resampling method to the biological data to construct an additional set of data.

• We apply **HYPERDIP **and **TEL **(79 arrays chosen from each set) without any normalization procedure or multiple testing adjustment. At significance level 0.05, 734 genes are detected as DEGs with an unbalanced differential expression structure (677 up-regulated and 57 down-regulated). We record the mean difference across **HYPERDIP **and **TEL **for each DEG as its effect size (_{
i
}). Then we combine **HYPERDIP **and **TEL **data and randomly permute the arrays. After that we randomly choose 2**TALL **and **TEL **(45 arrays chosen from each set) and 546 genes are defined to be DEGs with a balanced differential expression structure (259 up-regulated and 287 down-regulated). The sample size

Hypothesis testing methods

We use Student’s

Two alternative tests, namely the Wilcoxon rank-sum test and permutation

Results and discussion

We randomly generate 20 sets of data per tuning parameter for **SIMU **and **SIMU-BIO**. We apply normalization procedures first and then conduct hypothesis tests to obtain raw

Simulation results (SIMU)

**Simulation results (SIMU). **Average number of true positives as functions of effect size for **SIMU**. The error bar represents one standard deviation above and below average. Total number of truly differentially expressed genes is 100 with

Simulation results (SIMU-BIO)

**Simulation results (SIMU-BIO). **Average number of true positives as functions of effect size for **SIMU-BIO**. The error bar represents one standard deviation above and below average. Total number of truly differentially expressed genes is

By removing the noise from the observed gene expressions, quantile and rank normalization procedures improve the statistical power of the subsequent differential expression analyses when effect size is small. However, when

Conclusions

Microarray technology has been used in many areas of biomedical research. Biomedical researchers rely on this technology to identify differentially expressed genes. Due to the “large

High statistical power can be achieved in a study with the following properties.

1. An adequate sample size. Clearly, this is a reliable way to increase statistical power. Everyone seems to agree on it but not everyone practices it. Many years ago this was due to the high cost of conducting microarray experiments. Currently it only costs a fraction to obtain the same number of arrays. In a sense, the myth that “five arrays per group should be good enough” only reflects the fact that it takes a long time to change old, perhaps even anachronic habits.

2. Small variance. It is well known that a large proportion of the variance of gene expression is induced by undesirable systematic variations and various technical noise. Microarray technology has been evolving very fast in the past years and we think it is not unreasonable to assume that the technical noise level is getting lower. However, variance induced by biological heterogeneity will not be affected by the advances of technology. For certain data, using a normalization procedure, such as **QUANT **or **RANK**, can reduce this variance and help detect DEGs. We must point out that these elegant variance reduction procedures can also alter the mean expression and

3. Strong true effect size. Based on our experience, this is often invoked as a reason to justify the use of small sample size in a study

One main motivation of our study is to dismiss the dangerous idea that “five arrays per-group ought to be good enough for my study”. Our somewhat counter-intuitive findings suggest that if data with dramatic gene differentiation have only limited sample size (

Although we choose to focus on the Affymetrix GeneChip platform throughout this paper, we believe our conclusions should be valid for other array platforms which require/recommend normalization, such as Affymetrix exon arrays, Illumina BeadChip arrays and many others. We hope this study can help biological researchers choose an appropriate normalization procedure in their experiments or even develop novel normalization procedures with better downstream testing power when the gene differential expression is dramatic.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All three authors have equal contribution to this paper including the original idea, study design, theoretical derivations, simulations and summary of the findings. All authors read and approved the final manuscript.

Acknowledgements

This research is supported by the University of Rochester CTSA award number UL1 RR024160 from the National Center for Research Resources and the National Center for Advancing Translational Sciences of the National Institutes of Health; NIH/NIAID HHSN272201000055C/N01-AI-50020 from the National Institutes of Health; NIH 5 R01 AI087135-02 from the National Institutes of Health; and NIH 2 R01 HL062826-09A2 from the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Center for Research Resources or the National Institutes of Health. We appreciate Ms. Christine Brower’s technical assistance with computing. In addition, we would like to thank Ms. Malora Zavaglia and Ms. Jing Che for their proofreading effort.