Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data

Chung-I Li13, Pei-Fang Su23 and Yu Shyr3*

Author Affiliations

1 Department of Applied Mathematics, National Chiayi University, Chiayi, Taiwan

2 Department of Statistics, National Cheng Kung University, Tainan, Taiwan

3 Center for Quantitative Sciences, Vanderbilt University, 571 Preston Building Nashville, TN, USA

For all author emails, please log on.

BMC Bioinformatics 2013, 14:357  doi:10.1186/1471-2105-14-357


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/357


Received:3 June 2013
Accepted:28 November 2013
Published:6 December 2013

© 2013 Li et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Sample size calculation is an important issue in the experimental design of biomedical research. For RNA-seq experiments, the sample size calculation method based on the Poisson model has been proposed; however, when there are biological replicates, RNA-seq data could exhibit variation significantly greater than the mean (i.e. over-dispersion). The Poisson model cannot appropriately model the over-dispersion, and in such cases, the negative binomial model has been used as a natural extension of the Poisson model. Because the field currently lacks a sample size calculation method based on the negative binomial model for assessing differential expression analysis of RNA-seq data, we propose a method to calculate the sample size.

Results

We propose a sample size calculation method based on the exact test for assessing differential expression analysis of RNA-seq data.

Conclusions

The proposed sample size calculation method is straightforward and not computationally intensive. Simulation studies to evaluate the performance of the proposed sample size method are presented; the results indicate our method works well, with achievement of desired power.

Background

Next generation sequencing (NGS) technology has revolutionized genetic analysis; RNA-seq is a powerful NGS method that enables researchers to discover, profile, and quantify RNA transcripts across the entire transcriptome. In addition, unlike the microarray chip, which offers only quantification of gene expression level, RNA-seq provides expression level data as well as differentially spliced variants, gene fusion, and mutation profile data. Such advantages have gradually elevated RNA-seq as the technology of choice among researchers. Nevertheless, the advantages of RNA-seq are not without computational cost; as compared to microarray analysis, RNA-seq data analysis is much more complicated and difficult. In the past several years, the published literature has addressed the application of RNA-seq to multiple research questions, including abundance estimation [1-3], detection of alternative splicing [4-6], detection of novel transcripts [6,7], and the biology associated with gene expression profile differences between samples [8-10]. With this rapid growth of RNA-seq applications, discussion of experimental design issues has lagged behind, though more recent literature has begun to address some of the relevant principles (e.g., randomization, replication, and blocking) to guide decisions in the RNA-seq framework [11,12].

One of the principal questions in designing an RNA-seq experiment is: What is the optimal number of biological replicates to achieve desired statistical power? (Note: In this article, the term “sample size” is used to refer to the number of biological replicates or number of subjects.) Because RNA-seq data are counts, the Poisson distribution has been widely used to model the number of reads obtained for each gene to identify differential gene expression [8,13]. Further, [12] used a Poisson distribution to model RNA-seq data and derive a sample size calculation formula based on the Wald test for single-gene differential expression analysis. It is worth noting that a critical assumption of the Poisson model is that the mean and variance are equal. This assumption may not hold, however, as read counts could exhibit variation significantly greater than the mean [14]. That is, the data are over-dispersed relative to the Poisson model. In such cases, one natural alternative to Poisson is the negative binomial model. Based on the negative binomial model, [14,15] proposed a quantile-adjusted conditional maximum likelihood procedure to create a pseudocount which lead to the development of an exact test for assessing the differential expression analysis of RNA-seq data. Furthermore, [16] provided a Bioconductor package, edgeR, based on the exact test.

Sample size determination based on the exact test has not yet been studied, however. Therefore, the first goal of this paper is to propose a sample size calculation method based on the exact test.

In reality, thousands of genes are examined in an RNA-seq experiment; differential expression among those genes is tested simultaneously, requiring the correction of error rates for multiple comparisons. For the high-dimensional multiple testing problem, several such corrected measures have been proposed, such as family-wise error rate (FWER) and false discovery rate (FDR). In high-dimensional multiple testing circumstances, controlling FDR is preferable [17] because the Bonferroni correction for FWER is often too conservative [18]. Many methods have been proposed to control FDR in the analysis of high-dimensional data [17,19,20]. Those concepts have been extended to calculate sample size for microarray studies [21-25]. To our knowledge, however, the literature does not address determination of sample size while controlling FDR in RNA-seq data. Therefore, the second purpose of this paper is to propose a procedure to calculate sample size while controlling FDR for differential expression analysis of RNA-seq data.

In sum, in this article, we address the following two questions: (i) For a single-gene comparison, what is the minimum number of biological replicates needed to achieve a specified power for identifying differential gene expression between two groups? (ii) For multiple gene comparisons, what is the suitable sample size while controlling FDR? The article is organized as follows. In the Method section, a sample size calculation method is proposed for a single-gene comparison. We then extend the method to address the multiple comparison test issue. Performance comparisons via numerical studies are described in the Results section. Two real RNA-seq data sets are used to illustrate sample size calculation. Finally, discussion follows in the Conclusions section.

Method

Exact test

In an RNA-seq experiment, the total number of reads, also referred to as library size, mapped to the genome are different among the samples. In such cases, the counts in each group are not identically distributed, and it is difficult to develop an exact test for assessing the differential expression analysis of RNA-seq data. To handle this issue, [14,15] proposed a quantile-adjusted conditional maximum likelihood procedure to create pseudocounts which are approximately identically distributed and which lead to the development of an exact test. In the following, the proposed sample size calculation method is based the exact test for a single-gene comparison. Let Yij be the random variable corresponding to the pseudocount, with yij being the observed value of Yij, of the jth (j = 1,2,…,ni) sample of the ith (i = 0,1) group where n0 and n1 are the numbers of samples from the control and treatment group, respectively. Assume pseudocount Yij can be modeled as a negative binomial (NB) distribution, NB(dijγi,ϕ). Here, γi represents the normalized gene expression level of group i, dij represents a normalization factor for the total number of reads mapped in the jth sample of the ith group, and ϕ is the dispersion. We use the NB parameterization where the mean is μij = dijγi and variance is <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M1">View MathML</a>. Because the question of interest is to identify the differential gene expression between two groups, the corresponding testing hypothesis is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M2">View MathML</a>

(1)

Because the pseudocounts in each group have an approximately identical negative binomial distribution [14,15], the sum of pseudocounts of each group, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M3">View MathML</a>, has a negative binomial distribution NB(<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M4">View MathML</a>) where <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M5">View MathML</a> is the geometric mean of normalization factors in group i. Under the null hypothesis (1), the sum of the total pseudocount, Y1+Y0, follows a negative binomial distribution. In analogy with Fisher’s exact test, [14,15] proposed an exact test for replacing the hypergeometric probabilities with negative binomial probabilities. Because [16] developed a Bioconductor software package edgeR which is an implementation of methodology developed by [14,15], the p-value can be easily calculated for conducting the exact test.

In the following simulation and application sections, we used edgeR version 3.0.6 for estimating model parameters and performing the exact test.

Sample size calculation for controlling type I error rate

In this section, we focus on sample size calculation based on the exact test for a single-gene comparison as described in the test statistics section. For simplicity, we assume the RNA-seq experiment uses a balanced design (i.e., n0 = n1 = n), which is a special but common case. The following method could be easily extended to the unbalanced case (i.e. let n0 = n and n1 = kn where k is a predetermined ratio of the sample size of the control group to the treatment group). In order to perform sample size calculations, it is necessary to construct a power function for the testing described above. The power of a test is the probability that the null hypothesis is rejected when the alternative hypothesis is true. Since the distribution of the exact test statistic under the alternative hypothesis is unknown, however, it is difficult to derive a closed-form expression of the power function. Instead of deriving the distribution of test statistic under the alternative hypothesis, [26] proposed a method to calculate the power for the exact test based on a given p-value. Here, we borrow their concept to calculate power. For a given p-value, p(y1,y0) where y0 and y1 are the observed pseudo-sums, described in the previous section, the power can be expressed as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M6">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M7">View MathML</a> is the ratio of the geometric means of normalization factors between two groups, ρ = γ1 / γ0 is the fold change, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M8">View MathML</a> is the average number of reads in the control group, f(μ,ϕ) is the probability mass function of the negative binomial distribution with mean μ as well as dispersion ϕ, α is the the level of significance, and I(.) denotes the indicator function. For a given desired power 1-β, the power of the test can be represented as the function of sample size in the form

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M9">View MathML</a>

(2)

Thus, the required sample size n to attain the given power 1-β at level of significance α can then be calculated by solving (2) through a numerical approach, such as a gradient-search or bisection procedure.

Sample size calculation for controlling false discovery rate

In reality, thousands of genes are examined in an RNA-seq experiment, and those genes are tested simultaneously for significance of differential expression. In such cases, the sample size calculation for a single-gene comparison discussed above cannot be applied directly. Jung, 2005 [23] incorporated FDR controlling based on a two-sample t-test under the Gaussian distribution assumption. In this section, we borrowed their concept to incorporate FDR controlling based on the test statistics described in the test statistics section.

For the multiple testing problem, [19] suggested the use of false discovery rate (FDR) which is defined as the expected proportion of false discoveries among rejected null hypotheses. Storey, 2002 [17] further proposed an improvement to FDR to achieve higher power, in the form

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M10">View MathML</a>

where R0 is the number of false discoveries and R is the number of results declared significant (i.e., rejections of the null hypothesis).

To calculate the sample size for microarray data analysis, [23] proposed an FDR-controlled method which is based on the expression of FDR under independence (or weak dependence) among test statistics, as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M11">View MathML</a>

[17,27], where m0 is the number of true null hypotheses and E(R1) is the expected number of true rejections. By borrowing their concepts, the expected number of true rejections for RNA-seq data can be calculated as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M12">View MathML</a>

where ρg is the fold change, ϕg is the dispersion, and μ0g is the average read count in the control group for gene gM1 (the set of prognostic genes), respectively. Thus, to guarantee an expected number of true rejections, say r1, and control FDR at a specified level f, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M13">View MathML</a>

(3)

and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M14">View MathML</a>

(4)

By solving equation (3) with respect to α, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M15">View MathML</a>

where α is the marginal type I error level for the expected number of true rejections r1 at a given FDR f. Replacing α with α in (4), we have the function with respect to n as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M16">View MathML</a>

Then, by solving g1(n) = 0 via a numerical approach, the required sample size for controlling FDR at level f can be obtained.

To calculate the sample size, we have to estimate all of the fold changes ρg, dispersions ϕg, and average read counts μ0g of gene g for the set of prognostic genes gM1 prior to the RNA-seq experiment. However, we may not have enough information to estimate all of those parameters in practice. To address this issue, we propose the following method to obtain a conservative estimate of the required sample size. Because the power increases as | log2(ρg)| or μ0g increases and ϕg decreases, we suggest using a common <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M17">View MathML</a> minimum fold change, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M18">View MathML</a> minimum average read count, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M19">View MathML</a> maximum dispersion to estimate each ρg, μ0g, and ϕg, respectively. In such cases, it gives a more conservative estimate of the required sample size.

When we use ρ, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M20">View MathML</a>, and ϕ to estimate each ρg, μ0g, and ϕg, gM1, in the multiple testing context, α and β can be calculated as r1f/(m0(1-f)) and 1-r1/m1, respectively, where m1 is the number of prognostic genes. In other words, the power function (2) can be applied in the case of multiple gene comparison, with the replacement of α and β with α and β.

The procedures for sample size calculation detailed in this section can be summarized as follows:

1. Specify the following parameters: m : total number genes for testing; m1 : number of prognostic genes; r1 : number of true rejections; f : FDR level; w : ratio of normalization factors between two groups; {μ0g,gM1} : average read counts for prognostic gene g in control group; {ρg,gM1} : fold changes for prognostic genes g in control group; {ϕg,gM1} : dispersion for prognostic genes g in control group;

2. Calculate sample size:

(a) If all the parameters μ0g, ρg, and ϕg for each prognostic gene g are known, use a numerical approach to solve the equation below with respect to n.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M21">View MathML</a>

where α = r1f/(m0(1-f)) and m0 = m-m1;

(a) Otherwise,

•••specify a desired minimum fold change ρ, a minimum average read count <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M22">View MathML</a>, and a maximum dispersion ϕ;

•••replace ρ = ρ, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M23">View MathML</a>, ϕ = ϕ, α = r1f/(m0(1-f)), and β = 1-r1/m1 in equation (2) and solve it with respect to n.

Results

Numerical studies

In this section, we conducted simulation studies to evaluate the accuracy of the proposed sample size formula. The parameter settings in simulation studies are based on empirical data sets.

We set the total number of genes for testing to be m = 10000 and the number of statistically significant prognostic genes m1 = 100. We wanted to detect the expected number of true rejections r1 = 80, which corresponds to a power of 80% (i.e. β = 0.2). All parameters μ0g, ρg, and ϕg (g = 1,…,10000) were assumed to be unknown. Thus, we used a minimum fold change ρ and a minimum average read count <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M24">View MathML</a> and a maximum dispersion ϕ to estimate each ρg, μ0g, and ϕg, g = 1…,10000. We varied <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M25">View MathML</a> or 5; log2-fold changes log2(ρ) = 0.5,1.0,1.5,2.0 or 2.5; and ϕ = 0.1, or 0.5. With these settings, α = 8.162×10-5,4.253×10-4, and 8.979×10-4, which correspond to controlling FDR at level 1%, 5%, and 10%, respectively.

Then, we substituted α and β into the formulas (2) and calculated sample size by solving this equation. In addition, for each design setting, we generated 5000 samples from independent negative binomial distributions based on the calculated sample size n; for the control group, the count of each gene is generated by R program from a negative binomial distribution with mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M26">View MathML</a> and dispersion ϕ; for the treatment group, the count of each gene is generated from a negative binomial distribution with mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M27">View MathML</a> and dispersion ϕ. Then, edgeR is used to estimate model parameters and perform the exact test. The number of true rejections was counted using the q-value procedure proposed by [20]. The expected number of true rejections was estimated as the sample mean of the number of rejections of the 5000 simulation samples (<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M28">View MathML</a>).

In Table 1, we showed the calculated sample size with corresponding <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M29">View MathML</a> in parentheses under the case w = 1. For a fixed log2-fold change, dispersion, and FDR, sample size increases when μ0 decreases. This result is as expected; a small average read count provides less information, such that a larger sample size is required to detect the difference. For a fixed <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M30">View MathML</a>, ϕ, and FDR, sample size increases when log2(ρ) decreases (i.e. the smaller log2-fold changes requires greater sample sizes with all else being equal). This result is as expected; a larger sample size is required for detecting a smaller difference. For a fixed <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M31">View MathML</a>, log2(ρ), and FDR, sample size increases when ϕ increases. This result, also, is as expected; the variation increases when dispersion increases, such that a larger sample size is required to detect the difference. Note that all <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M32">View MathML</a> in Table 1 are close to the pre-specified number of true rejections (r1=80); thus, the proposed method estimated a sample size that achieves correct power at the specified FDR level.

Table 1. Sample size calculation for simulation study (and<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M33">View MathML</a>) withr1 = 80 at FDR = 1%, 5% and 10%whenw = 1,m = 10000,m1 = 100

Applications

Liver and kidney RNA-seq data set

To identify differentially expressed genes between human liver and kidney RNA samples, [8] explored an RNA-seq data set containing 5 human kidney samples and 5 human liver samples. In the following, we used this data set as pilot data for designing a new study with the same study objective. For the purpose of demonstration, we assumed that the human kidney is the control group. After filtering genes with no more than 5 total reads in liver samples or kidney samples, there were 17306 genes left. We assumed that the top 175 (≈ 1% of 17306) genes are prognostic. From the pilot data, the minimum average read counts among the prognostic genes in the control group were estimated as <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M36">View MathML</a>, the maximum dispersion was estimated as ϕ = 0.0029, and the ratio of the geometric mean of normalization factors between the two groups was estimated as w = 0.9 using edgeR. Suppose we want to identify 80% of the prognostic genes (i.e. r1 = 0.8×175 = 140), while controlling FDR at 1% (i.e. f = 0.01). Based on the pilot data, we set m = 17306, m1 = 175, m0 = 17131, r1 = 140, and f = 0.01. In this case, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M37">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M38">View MathML</a>

After substituting those parameters into equation (2) and solving it with respect to n, the required sample size can be obtained. In the second column from the left of Table 2, we report the sample size while controlling FDR at 1% under various desired minimum fold changes ρ = 0.10,0.25,0.50,0.75,1.25,1.50,2.00,2.50, and 3.0. From Table 2, we found that the original RNA-seq experiment described in [8] with sample size 5 in each group can identify 80% of the prognostic genes at FDR =1% if the desired minimum fold change ρ is 3.0.

Table 2. Sample size calculation for liver and kidney RNA-seq data set under various desired minimum fold changes (ρ) forr1 = 140atFDR = 1%whenm = 17360andm1 = 175

Li, 2013 [28] proposed several sample size calculation methods for RNA-seq data under the Poisson model. To compare the difference in sample size calculation between the negative binomial method and Poisson method, in the last six right columns of Table 2 we report the sample size calculation based on Poisson model (i.e. the sample size based on the Wald test nw, score test ns, log transformation of Wald statistic nlw, log transformation of score statistic nls, transformation of Poisson ntp, and likelihood ratio test nlr) with the same settings as the negative binomial method. As we can see, the sample size calculation based on the negative binomial and Poisson methods are similar. This result is as expected since the data set explored by [8] has technical and not biological replicates (i.e. the maximum dispersion estimated from the liver and kidney RNA-seq data set is close to zero). Thus, it is not surprising that the results of the negative binomial and Poisson methods are similar when the dispersion parameter is close to zero. Moreover, in Table 2, the estimated sample size is about the same size for very small fold changes (ρ = 0.10) and very large fold changes (ρ = 3.0). This result is expected since it tends to the same conclusion no matter what statistical model is used when the treatment effect is very large (i.e. the fold change is very large or small).

Transcript regulation data set

Blekhman, 2010 [29] used RNA-seq to study transcript regulation in humans, chimpanzees, and rhesus macaques using liver RNA samples from three males and three females from each species. For the purpose of demonstration, we assumed that the goal of the study is to identify differential gene expression between male and female in humans and that the female is considered the control group. There were 13267 genes in the data set after performing quality control analyses. Suppose that the top 133 (≈ 1% of 13267) genes are prognostic. After filtering genes with no more than 5 total reads in male samples or female samples, there were 7658 genes left. Those genes are considered pilot data, and we assessed the differential expression by using edgeR. From the pilot data, the minimum average read counts among the prognostic genes in the control group were estimated as <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/357/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/357/mathml/M39">View MathML</a>, ϕ = 0.6513, and the ratio of the geometric mean of normalization factors between the two groups was estimated as w = 1.08. Suppose we want to identify 80% of the prognostic genes (i.e. r1 = 0.8×133 = 107), while controlling the FDR at 10%. Based on the pilot data, we set m = 13267, m1 = 133, m0 = 13134, r1 = 107 and f = 0.1. In this case, we have α = 9.0512×10-4 and β = 0.2. In the second column from the left of Table 3, we report the required sample sizes under various desired minimum fold changes while controlling the FDR at 10% under the negative binomial distribution. We also report the required sample size based on the Poisson model proposed by [28] under the same settings in the last six columns on the right of Table 3. As we can see, the required sample size based on the negative binomial method is greater than the Poisson method. In the transcript regulation data set, the maximum dispersion was estimated as ϕ = 0.6513>0. This indicates that the read counts in this data set exhibit over-dispersion. In such a situation, it is inappropriate to model this data set based on the Poisson, and the sample size calculation based on the Poisson model will be underestimated due to underestimation of variance (i.e. the study based on the corresponding sample size will be underpowered).

Table 3. Sample size calculation for transcript regulation data set under various desired minimum fold changes (ρ) forr1 = 107atFDR = 10%whenm = 13267andm1 = 133

Discussion

In this research, we assume independent gene expression levels; however, this assumption may not hold in reality. For correlated RNA-seq gene expression data, evaluation of the accuracy of our method is an important future research question; however, generating a negative binomial distribution for correlated high-dimensional data will be a challenge. Moreover, most of the major R packages dedicated to RNA-seq differential analyses (edgeR, DESeq, etc.) are now starting to enable multi-group comparisons. However, the proposed method is developed for comparing two-group means. Thus, the sample size calculation for multi-group comparisons would be an interesting research topic for us in the future. In addition, it has already been noted that typical RNA-seq differential analyses have very low power; see for example the simulation studies in [30], where power for edgeR was always less than 60%, or [31], where power ranged from about 45% to 55% (both with 10 samples per condition). In our simulation and application sections, the minimum sample sizes required to achieve 80% power would be prohibitively large for RNA-seq experiments in practice, given their current cost. In such situations, the findings in [30,31] can provide useful information for specifying achievable power. It is well known that low study power will decrease the reproducibility of scientific research. We hope that this paper can benefit researchers by allowing them to understand their study power.

Conclusions

In recent years, RNA-seq technology has emerged as an attractive alternative to microarray studies, due to its ability to produce digital signals (counts) rather than analog signals (intensities), and to produce more highly reproducible results with relatively little technical variation [32,33]. With a large sample size, RNA-seq can become costly; on the other hand, insufficient sample size may lead to unreliable answers to the research question of interest. To manage the trade-off between cost and accuracy, sample size determination is a critical issue for RNA-seq experimental design. For comparing the differential expression of a single gene, we have proposed a sample size calculation method based on an exact test proposed by [14,15]. To address multiple testing (i.e., multiple genes), we further extended our proposed method to incorporate FDR control. Our methods are not computationally intensive for pilot data or other relevant data with a specified desired minimum fold change, minimum average read count, and maximum dispersion. To facilitate implementation of the sample size calculation, R code is available from the corresponding author.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Authors CIL, and PFS were involved in the development of the models. CIL and PFS wrote the manuscript. SY generated the original idea and guided and supervised the research. All authors read and approved the final version of this manuscript.

Acknowledgements

This work was partly supported by NIH grants P30CA068485, P50CA095103, P50CA098131, and U01CA163056. The authors wish to thank Margot Bjoring for editorial work on this manuscript.

References

  1. Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq.

    Bioinformatics 2009, 25(8):1026-1032. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with read mapping uncertainty.

    Bioinformatics 2010, 26(4):493-500. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Wu Z, Wang X, Zhang X: Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq.

    Bioinformatics 2011, 27(4):502-508. PubMed Abstract | Publisher Full Text OpenURL

  4. Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, Morin RD, Corbett R, Tang MJ, Hou YC, Pugh TJ, Robertson G, Chittaranjan S, Ally A, Asano JK, Chan SY, Li HI, McDonald H, Teague K, Zhao Y, Zeng T, Delaney A, Hirst M, Morin GB, Jones SJM, Tai IT, Marra MA: Alternative expression analysis by RNA sequencing.

    Nat Methods 2010, 7(10):843-847. PubMed Abstract | Publisher Full Text OpenURL

  5. Wang L, Xi Y, Yu J, Dong L, Yen L, Li W: A statistical method for the detection of alternative splicing using RNA-seq.

    PLoS One 2010, 5:e8529. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

    Nat Biotechnol 2010, 28(5):511-515. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I: De novo assembly and analysis of RNA-seq data.

    Nat Methods 2010, 7(11):909-912. PubMed Abstract | Publisher Full Text OpenURL

  8. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.

    Genome Res 2008, 18(9):1509-1517. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing.

    Nat Methods 2008, 5(7):613-619. PubMed Abstract | Publisher Full Text OpenURL

  10. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing.

    Nature 2010, 464(7289):768-772. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Auer PL, Doerge RW: Statistical design and analysis of RNA sequencing data.

    Genetics 2010, 185(2):405-416. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Fang Z, Cui X: Design and validation issues in RNA-seq experiments.

    Brief Bioinform 2011, 12(3):280-287. PubMed Abstract | Publisher Full Text OpenURL

  13. Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNA-seq data.

    Bioinformatics 2010, 26:136-138. PubMed Abstract | Publisher Full Text OpenURL

  14. Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data.

    Biostat 2008, 9(2):321-332. OpenURL

  15. Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance.

    Bioinformatics 2007, 23(21):2881-2887. PubMed Abstract | Publisher Full Text OpenURL

  16. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

    Bioinformatics 2010, 26:139-140. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Storey JD: A direct approach to false discovery rates.

    J R Stat Soc Ser B 2002, 64(3):479-498. Publisher Full Text OpenURL

  18. Hirakawa A, Sato Y, Sozu T, Hamada C, Yoshimura I: Estimating the false discovery rate using mixed normal distribution for identifying differentially expressed genes in microarray data analysis.

    Cancer Inform 2007, 3:140-148. OpenURL

  19. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

    J R Stat Soc Ser B 1995, 57:289-300. OpenURL

  20. Storey JD, Tibshirani R: Statistical significance for genomewide studies.

    Proc Natl Acad Sci USA 2003, 100(16):9440-9445. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Pounds S, Cheng C: Sample size determination for the false discovery rate.

    Bioinformatics 2005, 21(23):4263-4271. PubMed Abstract | Publisher Full Text OpenURL

  22. Hu J, Zou F, Wright FA: Practical FDR-based sample size calculations in microarray experiment.

    Bioinformatics 2005, 21:3264-3272. PubMed Abstract | Publisher Full Text OpenURL

  23. Jung SH: Sample size for FDR-control in microarray data analysis.

    Bioinformatics 2005, 21(14):3097-3104. PubMed Abstract | Publisher Full Text OpenURL

  24. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discovery rate, sensitivity and sample size for microarray studies.

    Bioinformatics 2005, 21:3017-3024. PubMed Abstract | Publisher Full Text OpenURL

  25. Liu P, Hwang JTG: Quick calculation for sample size while controlling false discovery rate with application to microarray analysis.

    Bioinformatics 2007, 23(6):739-746. PubMed Abstract | Publisher Full Text OpenURL

  26. Krishnamoorhy K, Thomson J: A more powerful test for comparing two Poisson means.

    J Stat Plan Infer 2004, 119:23-35. Publisher Full Text OpenURL

  27. Storey JD, Tibshirani R: Estimating false discovery rates under dependence, with applications to DNA microarrays. In Technical Report . CA: Department of Statistics, Standford University; 2001-2001. OpenURL

  28. Li CI, Su PF, Guo Y, Shyr Y: Sample size calculation for differential expression analysis of RNA-seq data under Poisson distribution.

    Int J Comput Biol Drug Des 2013, 6(4):358-375. PubMed Abstract | Publisher Full Text OpenURL

  29. Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y: Sex-specific and lineage-specific alternative splicing in primates.

    Genome Res 2010, 20(2):180-189. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data.

    BMC Bioinformatics 2013, 14:91.

    [http://dx.doi.org/10.1186/1471-2105-14-91 webcite]

    PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  31. Dillies M, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

    Brief Bioinform 2013, 14(6):671-683. PubMed Abstract | Publisher Full Text OpenURL

  32. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq.

    Nat Methods 2008, 5(7):621-628. PubMed Abstract | Publisher Full Text OpenURL

  33. Hashimoto Si, Qu W, Ahsan B, Ogoshi K, Sasaki A, Nakatani Y, Lee Y, Ogawa M, Ametani A, Suzuki Y, Sugano S, Lee CC, Nutter RC, Morishita S, Matsushima K: High-resolution analysis of the 5’-end transcriptome using a next generation DNA sequencer.

    PLoS One 2009, 4:e4108. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL