Division of Personalized Nutrition and Medicine, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA

Department of Statistics, National Chengchi University, Taipei, Taiwan

Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan

Abstract

Background

Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) _{1}) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with

Results

A sample size estimate based on the common formulation, to achieve the desired sensitivity on average, can be calculated using a univariate method without taking the correlation among genes into consideration. This formulation of sample size problem is inadequate because the probability of detecting the specified sensitivity can be lower than 50%. On the other hand, the needed sample size calculated by the proposed permutation method will ensure detecting at least the desired sensitivity with 95% probability. The method is shown to perform well for a real example dataset using a small pilot dataset with 4-6 samples per group.

Conclusions

We recommend that the sample size problem should be formulated to detect a specified proportion of differentially expressed genes with 95% probability. This formulation ensures finding the desired proportion of true positives with high probability. The proposed permutation method takes the correlation structure and effect size heterogeneity into consideration and works well using only a small pilot dataset.

Background

DNA microarray technology provides tools for studying the expression profiles of hundreds or thousands of distinct genes simultaneously. A fundamental goal in microarray studies is to identify a subset of genes that are differentially expressed under experimental conditions of interest. Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays (replicates) required in order to have adequate power to identify differentially expressed genes.

Many sample size estimation methods have been developed for various Type I error specifications, such as family-wise error rate (FWE)

When the sample size problem is formulated to achieve the specified sensitivity

This paper presents an overview of the power and parameter specifications, and proposes a permutation procedure for sample size determination under the probability formulation (

Methods

Let _{0 }and _{1 }are the numbers of non-differentially and differentially expressed genes, respectively. Given the significance level

Four possible outcomes when testing m hypotheses.

**True State of Nature**

**Declared significant**

**Declared Not significant **

**Total**

Null

_{0}

Alternative

_{1}

Total

_{0 }is the proportion of genes not differentially expressed that are declared significant, its expectation is the per comparison-wise error rate E(_{0 }= _{1 }is the proportion of truly differentially expressed genes that are correctly declared. In a diagnosis problem, this proportion is often referred as the true positive rate, or the sensitivity. By taking expectation, we have the "average sensitivity", E(_{1}, denoted by

Sample Size Estimation

In sample size estimation, _{1}, and the (standardized) effect size **
δ
**= (

If _{
i
}= _{0 }is constant for all _{0}. Given _{0}, and (1 - _{0}, the sample size can be based on the univariate sample size calculation and is given as

where _{
α
}and _{
β
}are the percentiles of a _{
i
}'s are different, then _{
i
}= ^{-1 }(

The needed sample size is _{
i
}) since _{1 }
_{0}, regardless of the correlation structure among genes and hence the desired sensitivity can be achieved on average. Most sample size estimation methods are either based on this approach or extensions

Given _{1 }(= _{1}/_{
i
}= _{0}, _{0}, and the calculated sample size _{0 }since _{
λ0 }of identifying at least _{0 }fraction of _{1 }differentially expressed genes can be calculated as the sum of the binomial probabilities

The method of using Equation (1) to estimate sample size is referred to as the univariate method. Column 3-5 of Table _{
λ0 }at _{0 }= 0.6, 0.7, 0.8, 0.9. The parameters used in the calculation are: _{1 }= 5%, 10%, 20%, _{0 }= 2 and _{
λ0 }can be less than 60%. That is, using this formulation to calculate needed arrays may result in that an experiment will have the sensitivity less than the specified _{0 }level with more than 40% probability.

Average formulation versus 95% probability formulation under the independent model.^{a}

**Average formulation:**

**Univariate method**

**95% probability formulation:**

**Binomial method**

**
π
_{1}
**

**
λ
_{0}
**

**
n
^{b}
**

**
λ
**

**
ϕ
_{λ0}
**

**
n
^{c}
**

**
λ
**

**
ϕ
_{λ0}
**

5%

60%

9

0.70

0.985

9

0.70

0.985

70%

9

0.70

0.576

10

0.81

0.997

80%

10

0.81

0.681

11

0.88

0.993

90%

12

0.92

0.866

13

0.95

0.992

10%

60%

8

0.70

0.999

8

0.70

0.999

70%

8

0.71

0.687

9

0.82

1.000

80%

9

0.82

0.841

10

0.89

1.000

90%

11

0.93

0.977

11

0.93

0.977

20%

60%

7

0.72

1.000

7

0.72

1.000

70%

7

0.74

0.975

7

0.74

0.975

80%

8

0.85

0.996

8

0.85

0.996

90%

9

0.91

0.792

10

0.95

1.000

a. Estimated sample size _{λ0 }for the specified sensitivity _{0 }= 60%, 70%, 80%, 90%, under the independent model. The parameters used in the calculation were: _{1 }= 5%, 10%, 20%, _{0 }= 2 and

b. Sample size _{0 }on average.

c. Sample size _{λ0 }of detecting at least _{0 }fraction of differentially expressed genes is at least 95%.

Alternatively, Wang and Chen _{0 }with a probability _{
λ0}. In this formulation both _{0 }and _{
λ0 }need to be specified and not necessarily equal. The _{
λ0 }is set at 95% since it is consistent with the common statistical practice of using the 95% confidence probability. Under this formulation, for specified _{0 }the needed number of arrays is calculated so that the average sensitivity is greater than _{0 }and the 5^{th }percentile, _{5}, of the distribution of the sensitivity _{1 }is greater than _{0}:

In the independent and constant effect size model, Tsai et al. _{
λ0 }for _{0 }= 0.6, 0.7, 0.8, 0.9. The probabilities in Column 8 are all higher than 95% due to

In Table _{0 }= 2, the difference is up to 1. For a given sensitivity, the needed sample size increases as the effect size _{0 }decreasing, and the difference of the two formulations in the estimates is larger. We calculated the sample sizes using the same parameters as Table _{0 }= 1. The sample size differences increase at about four times those of Table

Permutation Method for Sample Size Estimation

Tibshirani

For simplicity, assume an equal sample size in each group, denoted as _{0 }= _{1}. Start with some pilot data with at least 4 samples per group, denoted as _{0p }and _{1p }for the control and treatment group, respectively. For specified _{1}, **
δ
**= (

Algorithm: Sample Size Estimation (See additional file

**The software for the algorithm of the proposed method**. It provides software and an example for the algorithm of the proposed method.

Click here for file

1. Set _{1 }
_{0}
_{0}(1 -

2. Compute the adjustment factor _{1 }
_{2 }where _{
df, p
}is the ^{th }percentile of a

3. Generate the

4. Compute the

5. Multiply each _{1 }
**
s
**

6. Store the permutation statistics **
s
**

7. Repeat 3-6 for all possible permutations, _{0p}+_{1p}) **C**
_{0p}

8. Construct the null distribution by pooling all permutation statistics from the set of non-differentially expressed genes **
s
**

9. Compute the number of significances for the true positives _{
b
}for each statistic in **
s
**

10. Order _{1}, _{2}, ..., _{
N
}, and find the 5^{th }percentile, denoted by

11. Compare _{1 }
_{0}. If _{1}
_{0}, stop and report

In the proposed algorithm, the permutation ^{th }percentile _{1}
_{2}. The adjustment factor consists of two scale factors: _{1 }and _{2}. The first factor, _{1}, accounts for differential sample sizes between the pilot study and the planned study and the second scale factor, _{2}, uses the maximum likelihood estimate of the _{1 }and _{2 }converge to 1 and the proposed and the Tibshirani

Results

Two simulation analyses were conducted to evaluate the two formulations of sample size estimation described above. The first analysis evaluated the two formulations under the independent and constant effect size model. The theoretical results for the two formulations are shown in Table _{0}. However, to minimize the confounding effect brought by the variation in estimating _{0}, we simply used the true _{0 }in our simulation analysis. Sample sizes were calculated for the given parameter values. The empirical estimates of the FDR, average sensitivity _{
λ0 }were then calculated and evaluated. Using the true _{0 }provides a direct validation of the proposed procedure with control of the FDR.

The purpose of the first simulation study was to validate the theoretical results of the sample size, sensitivity, and 95% probability for the two methods shown in Table _{0 }= _{1}) genes were generated from the independent standard normal N(0,1); for the alternative model, _{1 }= _{1 }genes were generated based on independent normal N(_{0}, 1). For each simulation sample set, the _{
λ0 }were then calculated. The estimate of _{
λ0 }was the proportion of times out of the 1,000 simulations that the number of true positives was not less than _{1 }× _{0}.

Table _{1 }= 0.05 and _{0 }= 70%. The probability _{
λ0 }is less than 50%, for _{1 }= 0.05 and _{0 }= 70%. For the binomial method, the empirical average sensitivities _{
λ0}'s exceed 95% except for _{1 }= 0.05, _{0 }= 60%, _{1 }= 0.10, _{0 }= 90% and _{1 }= 0.20, _{0 }= 70%. The empirical results of Table

The validation of the theoretical results from Table 2.^{a}

**Average formulation**

**95% probability formulation**

**Univariate method**

**Binomial method**

**Permutation**

**
π
_{1}
**

**
λ
_{0}
**

**
n
^{
b
}
**

**
q
**

**
λ
**

**
ϕ
_{λ0}
**

**
n
^{
c
}
**

**
q
**

**
λ
**

**
ϕ
_{λ0}
**

**
n
^{
d
}
**

5%

60%

9

0.0505

0.69

0.937

9

0.0505

0.69

0.937

11.3(0.453)

70%

9

0.0505

0.69

0.497

10

0.0502

0.80

0.983

12.5(0.507)

80%

10

0.0502

0.80

0.506

11

0.0494

0.87

0.961

14.2(0.485)

90%

12

0.0492

0.91

0.730

13

0.0484

0.95

0.965

17.1(0.568)

10%

60%

8

0.0490

0.71

0.997

8

0.0490

0.71

0.997

9.8(0.361)

70%

8

0.0490

0.71

0.589

9

0.0506

0.81

1.000

10.8(0.368)

80%

9

0.0506

0.81

0.688

10

0.0503

0.88

0.999

12.1(0.291)

90%

11

0.0497

0.93

0.921

11

0.0497

0.93

0.921

14.6(0.491)

20%

60%

7

0.0498

0.73

1.000

7

0.0498

0.73

1.000

8.0(0.089)

70%

7

0.0498

0.73

0.901

7

0.0498

0.73

0.901

9.0(0.045)

80%

8

0.0491

0.84

0.966

8

0.0491

0.84

0.966

10.1(0.224)

90%

9

0.0501

0.90

0.627

10

0.0497

0.94

0.999

12.2(0.384)

a. Empirical estimates of FDR _{λ0 }of the univariate method for the average formulation and of the binomial method for the 95% probability formulation. The parameters used in the calculation were: _{0 }= 2, and

b. Sample size _{0 }on average.

c. Sample size _{0 }with 95% probability.

d. Sample size _{0 }with 95% probability with pilot study of group size 4 under the independent model.

For comparison purposes, the mean and standard deviation of the sample size estimates from the proposed permutation method using a pilot dataset of group size 4 are also provided in the last column of Table

The second analysis was to evaluate the four methods, the univariate method (Jung

In the first step, 4 samples from the colon dataset were randomly selected without replacement from each group to form a pilot dataset. The algorithm described above was used to estimate the sample size for the proposed method and the Tibshirani _{1 }= 5%, _{0 }= 90%, the initial sample size was _{
i
}= _{0 }= 2 was considered. For the proposed permutation method, the initial adjustment factors for _{1 }= 0.6777 and _{2 }=

The procedure was repeated 1,000 times to select different pilot datasets of size 4 from each group to account for the variation of pilot dataset. The means and standard deviations of the sample size estimates from the Tibshirani _{0 }increases or _{1 }decreases. Note that, under the independent model, the sample size and standard deviation estimates from the proposed method are smaller (Table

Sample size estimates (standard deviations) for the proposed method and the Tibshirani ^{a}

**Pilot study of group size 4**

**Pilot study of group**

**size 6**

**Entire data of size 62**

**
π
_{1}
**

**
λ
_{0}
**

**
n
^{b}
**

**
n
^{c}
**

**
n
^{d}
**

**
n
^{c}
**

**
n
^{d}
**

**
n
^{e}
**

5%

60%

9

12.2(2.931)

20.2(6.529)

12.7(2.193)

14.9(3.347)

9.5

70%

9

13.1(2.848)

21.6(6.209)

13.4(2.330)

15.9(3.504)

10.3

80%

10

14.3(3.017)

23.6(6.399)

14.4(2.335)

17.2(3.547)

11.5

90%

12

16.3(2.997)

27.1(6.303)

16.1(2.365)

19.5(3.559)

13.7

10%

60%

8

10.9(2.409)

15.7(4.664)

11.5(2.015)

12.5(2.828)

8.1

70%

8

11.8(2.544)

16.8(4.858)

12.1(2.096)

13.4(2.971)

8.8

80%

9

13.0(2.601)

18.6(4.809)

13.0(2.033)

14.4(2.852)

9.8

90%

11

14.7(2.944)

21.5(5.250)

14.6(2.275)

16.4(3.099)

11.8

20%

60%

7

9.8(2.184)

12.2(3.608)

10.3(1.832)

10.4(2.390)

6.7

70%

7

10.4(2.236)

12.8(3.675)

10.7(1.899)

10.9(2.446)

7.3

80%

8

11.4(2.414)

14.2(3.709)

11.6(1.995)

11.9(2.506)

8.2

90%

9

13.1(2.515)

16.5(3.902)

13.0(2.074)

13.6(2.603)

9.9

a. The sample size estimates are based on 1,000 repetitions using the colon tumor data _{0 }= 2 and

b. The univariate method.

c. The proposed permutation method

d. The Tibshirani

e. The Shao and Tseng

The procedure was repeated with 6 samples for the initial pilot dataset. The estimates are shown in Columns 6 and 7. The proposed procedure gives consistent results from the two pilot sample sizes; however, the results from the Tibshirani

In our simulations, the Algorithm B in Shao and Tseng _{1 }= 20%, _{0 }= 60% and 70%; the mean (standard deviation) of the sample size estimates are 6.4(0.012) and 6.8(0.012), respectively. The estimated values appear too small to be correct. This method does not appear to be applicable for small pilot sample sizes. Using the entire colon cancer dataset

Comparison of the performance of the two methods is similar to that shown in Table _{0 }= 2 was added to a set of randomly selected _{1 }genes in the tumor group. For each re-sampled data set, the permutation test was used to generate a p-value and the numbers of false positives and true positives were computed using _{
λ0 }were computed. The entire procedure was repeated 1,000 times.

Table _{
λ0 }for the two methods. Both methods are shown to control the FDR well and achieve the desired sensitivity. Thus the two methods can be expected to have satisfactory performance in practice. However, for the univariate method, the empirical _{
λ0 }estimates are between 55% and 75%, except one at 80%. One would have to take a risk that the sensitivity can fall below the specified level.

Empirical estimates of FDR, average sensitivity _{λ0 }from the univariate method and the proposed method based on the results of Table 4.

**Average formulation:**

**Univariate method**

**95% probability formulation:**

**Proposed method**

**
π
_{1}
**

**
λ
_{0}
**

**
n
**

**
q
**

**
λ
**

**
ϕ
_{λ0}
**

**
n
**

**
q
**

**
λ
**

**
ϕ
_{λ0}
**

5%

60%

9

0.0412

0.65

0.661

13

0.0431

0.94

0.976

70%

9

0.0424

0.65

0.558

14

0.0443

0.97

0.984

80%

10

0.0389

0.76

0.611

15

0.0458

0.99

0.993

90%

12

0.0460

0.91

0.743

17

0.0426

1.00

0.998

10%

60%

8

0.0427

0.66

0.666

11

0.0474

0.92

0.964

70%

8

0.0419

0.66

0.585

12

0.0478

0.96

0.973

80%

9

0.0431

0.78

0.666

13

0.0450

0.98

0.981

90%

11

0.0466

0.92

0.800

15

0.0475

1.00

0.994

20%

60%

7

0.0433

0.69

0.711

10

0.0447

0.94

0.975

70%

7

0.0448

0.69

0.634

11

0.0498

0.97

0.987

80%

8

0.0428

0.81

0.703

12

0.0496

0.99

0.994

90%

9

0.0442

0.89

0.716

14

0.0488

1.00

1.000

The effect size of _{0 }= 2 (Table _{0 }= 1 for two pilot sample sizes 4 and 6. The sample size estimates are shown in Table _{0 }= 2 in Table

Sample size estimates (standard deviations) for the proposed method and the Tibshirani ^{a}

**Pilot study of group **

**size 4**

**Pilot study of group**

**size 6**

**Entire data of**

**size 62**

**
π
_{1}
**

**
λ
_{0}
**

**
n
^{b}
**

**
n
^{c}
**

**
n
^{d}
**

**
n
^{c}
**

**
n
^{d}
**

**
n
^{e}
**

5%

60%

26

39.4(11.166)

77.8(22.283)

40.5(8.743)

56.8(12.376)

29.0

70%

29

43.0(11.659)

84.5(24.442)

43.5(8.913)

61.1(13.570)

31.7

80%

33

48.7(13.104)

92.2(23.398)

47.8(9.138)

65.4(13.134)

NaN

90%

40

56.8(13.846)

106.3(25.373)

54.3(9.168)

74.1(13.846)

NaN

10%

60%

23

34.9(9.140)

60.9(18.692)

36.8(8.074)

48.5(12.212)

25.0

70%

26

38.8(9.821)

66.2(18.993)

40.0(8.408)

52.0(11.819)

27.8

80%

29

43.3(10.399)

72.5(18.492)

43.3(8.475)

55.8(11.662)

31.4

90%

35

50.2(10.649)

83.7(20.271)

49.5(8.593)

64.0(12.485)

NaN

20%

60%

19

31.1(9.066)

47.0(14.301)

32.6(7.572)

39.8(9.552)

20.7

70%

22

34.4(8.740)

50.4(14.156)

35.7(7.816)

42.6(9.570)

23.4

80%

25

38.6(9.611)

55.5(15.393)

39.0(7.766)

46.6(10.415)

27

90%

31

44.7(9.655)

63.6(14.919)

44.5(7.999)

52.3(10.313)

32.3

a. The sample size estimates are based on 1,000 repetitions using the colon tumor data _{0 }= 1 and

b. The univariate method.

c. The proposed permutation method

d. The Tibshirani

e. The Shao and Tseng

Discussion and Conclusions

Determination of the needed sample size before conducting a microarray experiment is an important issue. The sample size problem is commonly formulated as the number of arrays needed to achieve the specified sensitivity _{
λ
}that the specified sensitivity is achieved can be low (less than 50%) due to the variance in sensitivity distributions. Furthermore, under this formulation this paper shows that the sample size can be calculated by a univariate method, regardless of the correlation structure among the gene expression levels; the procedures to account for correlations, such as Li et al.

Under the confidence probability formulation, consideration of the dependency among gene expressions is necessary in estimating the sample size since the percentile of the sensitivity distributions not only depends on the effect size of individual genes but also on their correlations. We propose a permutation method based on the method proposed by Tibshirani _{2 }may not be necessary (data not shown).

The choice of a particular multiple testing procedure used for data analysis can affect the error rate and power in the sample size estimation. Using a conservative procedure in the data analysis may decrease the "power" of the study; sometimes, the calculated sample size may have sensitivity below the specified level. For example, in this paper the calculation is based on the true number of non-differentially genes _{0}. However, if the data analysis uses an overestimated _{0 }such as the Benjamini and Hochberg procedure _{0 }to estimate the sample size. This procedure is expected to generate an appropriate sample size to achieve the desired sensitivity with a specified probability, regardless of which multiple testing procedure is used for data analysis.

Authors' contributions

JJC conceived the study and wrote the manuscript. JJC and WJL developed the methodology and proved theoretical results. WJL implemented the algorithms. HMH improved the concepts of the average and 95% confidence probability formulations. JJC, HMH and WJL performed the analysis. All authors read and approved the final manuscript.

Acknowledgements

Huey-Miin Hsueh's research was done while visiting the NCTR. The authors are very grateful to reviewers for much helpful comments and suggestions for revising and improving this paper. The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration