Department of Microbiology, University of Washington, Seattle, WA 98195, USA

Department of Pathology, University of Washington, Seattle, WA 98195, USA

Abstract

Background

Microarray experiments are often performed with a small number of biological replicates, resulting in low statistical power for detecting differentially expressed genes and concomitant high false positive rates. While increasing sample size can increase statistical power and decrease error rates, with too many samples, valuable resources are not used efficiently. The issue of how many replicates are required in a typical experimental system needs to be addressed. Of particular interest is the difference in required sample sizes for similar experiments in inbred vs. outbred populations (e.g. mouse and rat vs. human).

Results

We hypothesize that if all other factors (assay protocol, microarray platform, data pre-processing) were equal, fewer individuals would be needed for the same statistical power using inbred animals as opposed to unrelated human subjects, as genetic effects on gene expression will be removed in the inbred populations. We apply the same normalization algorithm and estimate the variance of gene expression for a variety of cDNA data sets (humans, inbred mice and rats) comparing two conditions. Using one sample, paired sample or two independent sample t-tests, we calculate the sample sizes required to detect a 1.5-, 2-, and 4-fold changes in expression level as a function of false positive rate, power and percentage of genes that have a standard deviation below a given percentile.

Conclusions

Factors that affect power and sample size calculations include variability of the population, the desired detectable differences, the power to detect the differences, and an acceptable error rate. In addition, experimental design, technical variability and data pre-processing play a role in the power of the statistical tests in microarrays. We show that the number of samples required for detecting a 2-fold change with 90% probability and a p-value of 0.01 in humans is much larger than the number of samples commonly used in present day studies, and that far fewer individuals are needed for the same statistical power when using inbred animals rather than unrelated human subjects.

Background

Microarray technology has become an important tool for studying gene expression levels on the whole genome scale

In general, the required sample size depends on the magnitude of the variability of the population, the magnitude of the expression change that is biologically meaningful (or desirable to detect), the power to detect the expression change, and the P-value/significance level/false positive rate. However, power and sample size have been viewed as complicated and difficult issues for microarray studies due to the large number of genes being investigated and little knowledge of the degree of natural expression variation within a population. To date, very few studies have assessed power and sample size requirements in microarray experiments. Pan et al.

Pair-wise comparisons between conditions/groups/treatments are frequently used in microarray studies. Parametric and nonparametric statistical methods have been proposed to identify differentially expressed genes, among which t-tests are most commonly used. This paper is intended to provide some guidelines for sample size planning for pair-wise comparisons. Normalization is an essential and important pre-processing step in microarray data analysis. To our knowledge, no previous studies using multiple data sets have pre-processed the data sets in a comparable way. In addition, previous studies did not look at the effect of inbred vs outbred populations on the variation of gene expression. In order to make the results more comparable, we make use of 7 cDNA microarray data sets and apply the same normalization method (spatial lowess).

We estimate the variance by one sample t-test, paired t-tests or two sample t-tests on a gene-by-gene basis using several large expression data sets from both human, rats and mice. We then calculate the sample size required to detect a 1.5-, 2-, and 4-fold change in expression levels for the 90^{th}, 75^{th}, 50^{th }and 25^{th }percentile of genes ranked by variability at fixed settings for false positive and false negative rates. The sample size calculation provides the approximate but not exact number of replicates required for a given set of criteria.

Results

Data sets

We estimate the standard deviation and required sample size from 1 unpublished and 6 published cDNA data sets (Table

cDNA microarray data sets used in the study

Data set

Reference

# Rep

# Genes

Tissue type

Description

Hybridization

A

Smith et al. 2003

20

15,592

Human liver

Paired HCC tumor vs adjacent non-tumor

Direct hyb between tumor and non tumor

B

Lapointe et al. 2004

41

38627

Human prostate

Paired prostate tumor vs adjacent non-tumor

Indirect hyb using common reference

C

Chen et al 2002

48

22618

Human liver

Paired HCC+HBV vs HBV

Indirect hyb using common reference

D

Pritchard et al. 2001

6

5281

Mouse liver and kidney

Paired liver vs kidney

Indirect hyb using common reference

E

Zhao et al. 2004

36 ductal + 21 lobular

44549

Human breast

lobular and ductal tumor tissue

Indirect hyb using common reference

F

NA

6

13, 056

Mouse liver

One third vs two thirds hepatectomy

Indirect hyb using individual baseline

G

Callow et al. 2000

8

5548

Mouse liver

ApoAI knock-out vs normal

Indirect hyb using common reference (pool)

Data set A comprises data generated from 40 liver RNA samples isolated from paired liver hepatocellular carcinoma (HCC) tumor and adjacent cirrhotic non-tumor tissue from 20 HCV infected Caucasian patients

Data set B was generated from 41 matched pairs of prostate tumor and non-tumor tissue hybridized to arrays spotted with 38627 cDNAs. All samples were labeled with Cy5 and co-hybridized with a common reference labeled in Cy3

Data set C was generated from RNA isolated from paired HCC tumor and adjacent non tumor liver from 41 HBV infected patients

Data set D was generated from RNA isolated from paired liver and kidney tissue from 6 male C57BL6 mice

Data set E was generated from RNA isolated from 36 breast ductal tumor and 16 lobular tumor tissues

Data set F consists of data generated from 24 liver tissue samples from 12 inbred mice (unpublished data). One third or two thirds of the liver was removed from each mouse and used as the baseline samples. At 12 hours post operation, the mice were sacrificed and the remaining liver tissue was used as the experimental sample. The aim of this study was to screen for genes potentially related to liver regeneration after hepatectomy. RNA samples from the 12 hour post-operation livers were co-hybridized with their own baseline liver samples. A total of four DNA arrays were used for each sample comparison. Two sets of arrays (MOD1 and MOD2), each containing 6528 different cDNAs spotted in duplicate (A and B) on each array were used. In addition, each comparison was done with a dye flip pair of slides. This data set made use of arrays generated at the University of Washington Center for Expression Arrays.

The goal of data set G was to identify genes with altered expression in the liver tissues of two mouse models with very low HDL cholesterol levels (treatment groups) as compared to inbred control mice. The mouse model considered in this study is the Apolipoprotein AI (ApoAI) knock-out, where ApoAI is a gene known to play a pivotal roles in HDL metabolism

In summary, three of the data sets (D, F-G) are from inbred mouse and rat strains respectively, and the other four data sets (A-C, and E) are from large scale studies of gene expression in humans. If all other factors (assay protocol microarray platform, data pre-processing) were equal, one might anticipate that fewer individuals would be needed for the same statistical power using inbred animals as opposed to unrelated human subjects.

Background adjustment and normalization

Background adjustment and normalization is necessary to remove systematic biases of non-biological origin in microarray studies. A number of methods of background correction and normalization have been proposed

Estimates of standard deviation and sample size calculation

The distribution of the standard deviations estimated from these 7 data sets are presented in Figure _{2 }transformed prior to data analysis. Figure _{2 }ratio (sample/reference) of two independent groups.

Histogram of standard deviation

Histogram of standard deviation The X axis is the standard deviation, and the Y axis is the percentage of genes that has standard deviation below the value of X. All data sets were normalized by spatial lowess; (A) Data set A-standard deviation of log ratio of two groups (direct hybridization); data set B-D standard deviation of the difference of log (sample/reference) of the two groups (indirect hybridization); (B) Data sets E-G common standard deviation of (sample/reference) of the two independent groups (indirection hybridization).

The required sample size of an experiment depends on the variance component (

For example, in the case of data set A (one sample t-test), if we wish to find out the approximate sample size to detect a 2 fold change (

power.t.test(n = NULL, delta = 1, sd = 0.5584, sig.level = 0.001, power = 0.9, type = "one.sample", alternative = "two.sided")

Where sd = 0.5584 is the 75^{th }percentile of the standard deviation of log ratio.

In the case of data set G (two sample t-test), if we wish to find the approximate sample size to detect a 2 fold change (

power.t.test(n = NULL, delta = 1, sd = 0.3102, sig.level = 0.001, power = 0.9, type = "two.sample", alternative = "two.sided")

Where sd = 0.3102 is the 75^{th }percentile of the common standard deviation of log (sample/reference).

In R, for a one sample t-test or a paired t-test to have power 1-

Power = Pr(t_{v, ncp }< t_{v, α/2}) + Pr(t_{v, ncp }> t_{v, 1-α/2})

Where ncp is the noncentrality parameter of the non-central t-distribution, and is estimated by

t_{v, α/2 }is the _{v, ncp }follows a non-central t-distribution with v degrees of freedom and a non-centrality parameter of ncp.

For a two sample t-test with equal sample sizes, if we wish to have a large enough sample to detect a difference

Power = Pr(t_{v, ncp }< t_{v, α/2}) + Pr(t_{v, ncp }> t_{v, 1-α/2})

Where ncp is the noncentrality parameter of non-central t-distribution, and is estimated by

t_{v, α/2 }is the _{v, ncp }follows a non-central t-distribution with v degrees of freedom and a non-centrality parameter of ncp.

Microarray experiments usually involve a large number of genes, with variance components varying greatly across the genes. In general, the variance is higher for low expressors which make up of a large percentage of the genes (Figure ^{th }percentile of variance across all genes, and to use this as the value in the power calculations. For example, if we use the variance for the 50^{th }percentile, then the sample size calculations will assure us of having the desired power to detect a chosen n-fold change for all but the 50% most variable genes. In Figure ^{th}, 50^{th}, 75^{th }and 90^{th }percentiles. The intersection of these lines with the "cumulative percentage of genes" provides the value of ^{th}, 75^{th}, 50^{th}, and 25^{th }percentile genes for a given setting of false positive rate and power. As is expected, the required sample size increases with increasing variance, increasing power, and decreasing fold-change and false positive rate.

Standard deviation versus log intensity

Standard deviation versus log intensity Standard deviations are based on one sample t-test (data set A), paired t-test (data sets B-D), or two independent t-test (data sets E-G).

Sample size required to detect a 1.5-, 2-, and 4-fold changes of expression level for the 90%, 75%, 50%, and 25% least variable genes for a given settings of false positive rates (

Click here for file

A significance level (the probability of making a type I error, that is getting a false positive) of 0.05 is often employed in hypothesis testing. Thousands of genes are usually studied in microarray experiments. When more than 10,000 genes are tested independently, we would expect more than 500 genes to appear as false positives when the 0.05 significance level is applied. Hence, a smaller cut-off p-value should be used in order to reduce the number of false positives. Many multiple testing correction methods have been proposed. The simplest one is the Bonferroni correction (family wise error control) where the nominal significance level is divided by the number of tests. The Bonferroni correction is very stringent. False discovery rate (FDR)

For reference, Table

Significant genes/ESTs/probes called by methods used in the studies using different criteria (combination of significance level and fold changes)

Data set

Reference

P <= 0.001 and the estimated fold change >=2

P <= 0.001 only

P <= 0.01 only

A

Smith et al. 2003

183

1783

3590

B

Lapointe et al. 2004

609

6549

10153

C

Chen et al 2002

1253

4187

6197

D

Pritchard et al. 2001

479

1557

1845

E

Zhao et al. 2004

270

1050

3821

F

NA

16

145

723

G

Callow et al. 2000

6

11

77

Discussion

Factors that affect sample size calculation include the magnitude of the variability of the population, the magnitude of the desired detectable expression change, the chosen power to detect the expression change, and the cut-off P-value/significance level/false positive rate. For a given study, the variability of the population being studied is fixed, and once researchers have identified the desired detectable expression change, the required sample size depends on the chosen false positive and false negative rates. The variability of human subject data is typically larger than that seen with laboratory animals and cell lines due to genetic influences on gene expression. Hence, more replicates are needed for studies that involve human subjects (or any other outbred population) than for studies with samples from an inbred population. This is readily apparent in the cDNA data in

Multiple levels of replicates are common in two color microarray experiments. Multiple arrays probed with RNA samples isolated from multiple individuals of a population/treatment/group are referred to as biological replicates. Multiple arrays hybridized using the same RNA or multiple replicates of the same gene within an array are referred to as technical replicates. Although technical replicates can improve the precision and the reliability of the measurement and provide information for quality control, biological replicates are most effective in reducing the variance of the estimate of mean difference. Biological replicates therefore increase the power to detect biologically significant gene expression differences. More importantly, when trying to identify differences between a treatment and a control group, accurate estimates of the biological variability within the groups is essential to determine if the between group differences are meaningful (by a t-test, Analysis of Variance (ANOVA) or other method).

Careful experimental design is necessary to maximize the statistical power of the test

Caveats

This paper is intended to give some guidance to those planning microarray experiments. The sample size calculations we performed provide an approximate number of replicates for a given set of criteria. Our studies were limited to a small number of published microarray studies for which the following criteria were true:

1) A reasonably large number of biological replicates were analyzed.

2) Raw data was readily available so that we could reprocess all data with the same algorithms.

3) Other potentially large sources of variability such as flow sorting, laser micro-dissection and/or multiple rounds of amplification were not present.

We have only analyzed date from a limited number of tissue types – liver, prostate, breast and blood in human, liver and kidney in mouse, and mammary gland in rat. It is entirely possible that different tissue types will have larger or smaller degrees of biological variation and hence will require more or fewer samples to reach a given conclusion. In addition, lab or experiment specific methods of obtaining and processing samples may induce greater degrees of expression variation than seen in our sample data. As more large data sets become available, it will be useful to extend these studies to better define the magnitude of gene expression variation in purebred animals and in outbred humans across a variety of tissues.

However, the data shown in

Methods

Data set selection, pre-processing and normalization

Background adjustment and normalization is needed in microarray data analysis in order to remove non-biological variation. Intensity based normalization methods such as locally weighted least square polynomial regression (lowess) is commonly used in cDNA microarray experiments. The background subtracted intensities were normalized by the spatial lowess method using the R add on package MAANOVA written by the Jackson Lab. For the two cDNA experiments with replicate panels within each array, we normalized the two panels separately. All control genes were excluded from data analysis for data sets A-G.

Estimate of variance components

Pair-wise comparisons among conditions/groups/treatments of gene expression levels are common goals of microarray studies. Simultaneous comparison of more than two treatments/conditions using one way ANOVA can be advantageous. However, a significant F for a comparison of several treatments does not provide information about which particular groups differ from each other. In addition, one way ANOVA is not sensitive to treatment effects when only one or two samples out of many are quite different. T-tests are commonly used to compare individual treatments in pairs.

In order to calculate power and plan sample size, one must first estimate the variance. We applied paired or two sample t-tests in this study based on the correlation between the two groups. For data set A, as the pairs of tumor and adjacent non-tumor tissue are highly correlated, we used two tailed one sample t-tests with the normalized log_{2 }ratio of tumor/non-tumor as the response variable. Data sets B-D were generated from paired samples using a reference design on cDNA arrays; Paired t-tests are appropriate for these three data sets. We performed two tailed, two sample (or independent) t-tests on data sets E-G with normalized log ratio as the response variable. The two sample t-tests are based on unequal variances for the two groups of samples.

The variances of the data sets with paired samples are the variance of the difference. The common variance of the datasets with independent samples was estimated by the following formula:

Where n_{1}, n_{2 }are the number of observations for group 1, and group 2, respectively; and S_{1 }and S_{2 }are the standard deviation for group 1, and group 2, respectively.

To simplify power and sample size calculation, and to focus our calculation on

Authors' contributions

JL provided the mouse liver microarray data set F prior to publication. CW performed all the analysis in this paper. RB supervised JL and CW and contributed to the design, coordination and writing. All authors read and approved of the final manuscript.

Acknowledgements

Roger Bumgarner receives funding from the following grants: NHBLI-