Abstract
Background
When conducting multiple hypothesis tests, it is important to control the number of false positives, or the False Discovery Rate (FDR). However, there is a tradeoff between controlling FDR and maximizing power. Several methods have been proposed, such as the qvalue method, to estimate the proportion of true null hypothesis among the tested hypotheses, and use this estimation in the control of FDR. These methods usually depend on the assumption that the test statistics are independent (or only weakly correlated). However, many types of data, for example microarray data, often contain large scale correlation structures. Our objective was to develop methods to control the FDR while maintaining a greater level of power in highly correlated datasets by improving the estimation of the proportion of null hypotheses.
Results
We showed that when strong correlation exists among the data, which is common in microarray datasets, the estimation of the proportion of null hypotheses could be highly variable resulting in a high level of variation in the FDR. Therefore, we developed a resampling strategy to reduce the variation by breaking the correlations between gene expression values, then using a conservative strategy of selecting the upper quartile of the resampling estimations to obtain a strong control of FDR.
Conclusion
With simulation studies and perturbations on actual microarray datasets, our method, compared to competing methods such as qvalue, generated slightly biased estimates on the proportion of null hypotheses but with lower mean square errors. When selecting genes with controlling the same FDR level, our methods have on average a significantly lower false discovery rate in exchange for a minor reduction in the power.
Background
Microarray technology has become a standard experimental method in biomedical research. In the analysis of microarray data, one of the most fundamental tasks is the identification of differentially expressed genes while controlling false positives and minimizing false negatives. This is a multiple hypothesis test problem which analyzes thousands or tens of thousands of genes simultaneously. In these tests we often need to control the false discovery among the rejected hypotheses under a prespecified level while maintaining maximal power. Thus, there is a trade off in the control of the typeI error between rejecting true null hypotheses (false discovery) versus accepting true alternative hypotheses (false negative).
Traditional Bonferroni correction procedures are designed to control the Family Wise Error Rate (FWER), which guards against making one or more type I errors among a family of hypothesis tests. However, these procedures may be excessively conservative for microarray analysis where the number of hypotheses is very large and a substantial fraction of the genes are differentially expressed [1]. A more appropriate approach is to control the False Discovery Rate (FDR), which is the proportion of type I errors among all rejected hypotheses [2,3]. This approach is particularly useful in exploratory analyses, where the objective is to maximize the discovery of true positives, rather than guarding against one or more false positive results.
A number of methods have been proposed to control the FDR given a population of hypothesis tests. These methods usually assume that the distribution of the test statistics, f, can be modeled by a mixture of two components [4]:
Where f_{0 }is the distribution of the test statistics under H_{0}, which by definition equals to 1 when using pvalues when tests are independent, f_{1 }is the distribution of the test statistics under H_{1}, m_{0 }is the number of true H_{0}, m is the total number of hypotheses under consideration, and π_{0 }is the proportion of true H_{0}. The methods proposed by Benjamini et al [2,3] to control FDR do not estimate π_{0}; therefore, they provide the strongest controls on FDR but have the lowest power compared to other methods that do so.
In many actual applications where a considerable number of genes are differentially expressed, assuming π_{0 }= 1 may be too conservative causing loss of power. Several alternative methods, such as nonparametric empirical Bayesian pFDR criterion and its pvalue equivalent called qvalue method [1,5,6], binwise model [79], local FDR method [10], parametric betauniform mixture models [1114], the Lowest Slope estimator (LSL) [15], the Spacing LOESS Histogram (SPLOSH) method [16], the nonparametric MLE method [17], the moment generating function approach [18], and the Poisson regression approach [1820], have all been proposed to estimate π_{0 }by pooling test statistics and controlling FDR based on the estimated π_{0}.
In these methods, one of the critical steps is estimating the proportion of null hypotheses, π_{0}. When using pvalues, these estimations usually depend on the assumption that f_{0 }follows a uniform distribution. This assumption, which is of critical importance for the methods of statistical inference that employ pooling test statistics across genes [21], is valid when all test hypotheses are independent and identically distributed. Furthermore, when there are only weak correlations, or "clumpy" correlations (a large number of groups that have a small number of genes with high correlation within groups but no correlation between groups [21,22]), the uniform assumption is not strongly violated and the method remains adequate. However, in datasets with large scale strong correlations, the joint distribution of the test statistics will no longer be the product of marginal distributions, and the observed f_{0 }will severely deviate from uniform, causing the current π_{0 }estimation methods to become very unstable. Increased variation and bias of π_{0}, as well as FDR, was also observed by Wu et al [14] in datasets with strong local correlations.
The effect of correlation on simultaneous significance tests was previously discussed theoretically [2325], and a number of permutation based FDR control methods were proposed, such as SAM [26], dChip [27], Ge et al [28], Meinshausen et al [24] and Efron [25]. In these methods, the distribution of f_{0 }was modeled empirically through permutations, which naturally considered the correlation. However, like Benjamini et al [2,3], these methods don't estimate π_{0}; therefore, in datasets with a large number of differentially expressed genes, the FDR control may be overly conservative with a loss of power.
Therefore we proposed 2 resampling schemes, similar to model averaging in bagging methods, to reduce the variation in estimating π_{0 }in datasets with strong correlation between gene expression values. Our methods produced a more stable and conservative estimation of π_{0 }and, therefore, provided stronger control of False Discovery Rate with only a minor sacrifice of power.
Implementation
Creating simulated data set
To test the performance of various algorithms in estimating π_{0}, we generated 2 types of simulated datasets. Both datasets had strong correlation between subsets of genes and a known proportion of true null hypotheses, to represent the log transformed microarray expression data.
DataB
The first simulation method is similar to the method used in Qiu et al [21] and Wu et al [14]. Assume n samples and m genes, with n/2 samples per class. The m genes were divided into K blocks with each block consisting of m/K genes. Assume independence between blocks and constant correlation coefficient between genes within each block. For block l and sample j, we first created a block center vector
where d_{l }was the mean difference between groups, it equals to 0 (for simulating nondifferentially expressed genes) with probability π_{0}, or was generated from beta distribution with parameter (4, 20) otherwise; x_{j }was a group indicator; and δ_{lj }was i.i.d. N(0,1) to represent sample specific noise. Then the expression value of gene i in sample j in that block, Y_{lji}, was generated by
where ρ was the correlation constant which takes value between [0,1] determining the correlation coefficient between genes within the block, and e_{lji }was i.i.d. N(0,1). The K blocks of genes were generated independently of each other and then pooled to form the whole dataset. We call this type of dataset which contains blocks of highly correlated genes DataB in our experiments.
DataM
The above DataB model is oversimplified in many aspects, and is still a "clumpy" structure, although the clumpiness can be pronounced. To mimic more realistic situations, we generated a second type of simulated data based on an actual human breast cancer microarray dataset [29] obtained with Affymetrix U133 plus 2.0 microarrays. The dataset contains 65 estrogen receptor positive (ER+) cases and 46 estrogen receptor negative (ER) cases. The data were normalized by the GCRMA algorithm[30,31], and the gene (probeset) expression levels were log2transformed. According to published literatures [32,33], the ER status is one of the most predominant partitioning factors for molecular classification of breast cancer. We therefore took some of the genes differentially expressed between the two classes as "truly H_{1}" genes. We selected 8778 genes with differences in mean of log transformed expression levels between the two classes greater than 0.58 (equivalent to a 1.5 fold change). From these genes, each time we randomly chose 1000 to form our simulated dataset and then randomly picked π_{0 }proportion of the 1000 genes to establish H_{0 }genes by scrambling these genes together between samples. Thus, we obtained a simulated dataset with a known number of null hypotheses while the correlations among both the differentially and nondifferentially expressed genes were maintained. We call this type of dataset DataM in our experiments [34].
Estimating π_{0 }by resampling strategy
To get a better estimation of π_{0 }in datasets with strong correlations, we proposed 2 resampling methods to replace the original π_{0 }estimation step in qvalue method which estimates π_{0 }directly from the pvalue distribution [6].
The first method, termed SampS, resamples without replacement 2/3 of the samples from each class, calculates pvalues for each gene, and then estimates the π_{0 }from the pvalues. For each dataset, we performed this procedure 100 times and used the upper quartile to replace the π_{0 }estimated by the qvalue method.
The second method, termed SampG, resamples without replacement 2/3 of the values from each class for each gene independently, calculates the pvalues for each gene, and then estimates the π_{0}. We also performed this procedure 100 times for each dataset and used the upper quartile to replace the π_{0 }estimated by the qvalue method.
After each resampling step, we have to feed the pvalues into a π_{0 }estimator. This estimator could be the π_{0 }estimation by qvalue method, the mgf method [18], or any other unbiased π_{0 }estimators. In this paper, we tried to use both qvalue and mgf as the plugin estimator, and called them SampS.q and SampG.q when using qvalue as the plugin estimator, or SampS.m and SampG.m when using mgf as the plugin estimator.
Results
Variation of the π_{0 }estimation when genes are correlated
To evaluate the impact of correlation structure in large datasets on the estimation of the proportion of true null hypotheses (π_{0}), we first evaluated current methods on a published microarray dataset [29]. This dataset consists of 65 estrogen receptor positive (ER+) and 46 estrogen receptor negative (ER) breast cancers. The gene expression levels were normalized by the GCRMA algorithm [30,31] and log2 transformed. Expression values were filtered to eliminate low expressing genes with mean expression below 5 and constant expressing genes with coefficient of variation below 0.1. A total of 9,993 genes passed this filtering criterion. We bootstrapped 200 datasets from this microarray data and used the qvalue [6] and twilight [9] methods to estimate the proportion of null hypotheses. The π_{0 }estimates were similar by the 2 methods on each bootstrapped data set; however, both methods showed a large range of π_{0 }among the bootstrapped datasets that varied from 0.36 to 0.83 (Figure 1).
Figure 1. Estimating the π_{0 }in an actual microarray data set. Box plot of π_{0 }estimated by qvalue and twilight methods on 200 bootstrap breast cancer datasets.
To further investigate the influence of gene correlation on the estimation of π_{0}, we generated simulated dataset DataB which contained blocks of highly correlated genes, but all genes were nondifferentially expressed. In our simulation we set the correlation constant ρ equal to 0.5, and created datasets containing 100, 1,000 or 10,000 genes with 1, 10, 100 or 1,000 genes per block. After calculating the pvalue for each gene, we calculated the coefficient of variation (CV) of the bar heights in the histogram of pvalues by splitting the pvalues into 20 bins between [0, 1] with the width of each bin being 0.05. Figure 2(a) shows the histogram of pvalues in one of the simulations having 10,000 genes all independent with each other, and Figure 2(b) shows the histogram of pvalues in another simulation having 10,000 genes with 1,000 genes per block. Comparing Figure 2(a) and 2(b) we can see that with highly correlated genes, the distribution of pvalue deviated significantly from uniform, although none of the genes were differentially expressed. We created 100 simulations for each type of data, calculated their CV of the bar height in the pvalue histogram, and plotted the box plots of the CVs in Figure 3.
Figure 2. Histogram of pvalues in simulated data sets where all genes were nondifferentially expressed. (a) dataset having 10,000 genes, all independent with each other (b) dataset having 10,000 genes, with 1,000 genes per block.
Figure 3. Box plot of the CV of pvalue histogram under different correlation structures. (a) datasets having 100 genes, with 1 or 10 genes per block. (b) datasets having 1000 genes, with 1, 10 or 100 genes per block. (c) datasets having 10000 genes, with 1, 10, 100 or 1000 genes per block. The histogram was calculated by splitting pvalues into 20 bins between [0, 1] with the width of each bin being 0.05. The horizontal lines represent the expected CV when genes are independent.
In this simulated study, since all genes were nondifferentially expressed, the pvalues should follow a uniform distribution, and the histogram of pvalues should be flat if genes were independent of each other. When the number of correlated genes in each block was small, for example, 1 (independent) or 10 (weak correlation) genes per block, the distribution of pvalues approximated a uniform distribution and the CV of the histogram of pvalues were close to expected under the independent assumption. However, the CV of the histogram of pvalues increased significantly with the growth of correlation structure. In other words, although none of the genes were differentially expressed, the distribution of pvalue deviated increasingly from a uniform distribution when large groups of genes were correlated. We also calculated the CV of the histogram of pvalues in our microarray dataset by randomly permuting the class labels, which makes all genes nondifferentially expressed but still correlated, as well as randomly permuting all expression values across genes and samples which makes genes nondifferentially expressed and also independent. The results of 100 permutations showed a dramatically higher CV for the pvalue histogram of datasets with only class labels permuted but with genegene correlations intact (Figure 4).
Figure 4. Box plot of the CV of pvalue histogram of permuted microarray datasets. PermLab: Breast cancer dataset with class labels randomly permuted. By randomly permuting class labels, all genes become nondifferentially expressed but the genegene correlation intact. PermAll: Randomly permuting all expression values between the 111 samples and 9993 genes. By randomly permuting expression values all genes become nondifferentially expressed and independent with each other. The 2 types of permutations were performed 100 times each, and the pvalues were split into 20 bins between [0, 1] with the width of each bin being 0.05. The horizontal line represents the expected CV when genes are independent.
Comparing Figure 2, Figure 3 and Figure 4, it is apparent that increased correlation among genes greatly increased the deviation of pvalue distribution from uniform. Therefore, strong correlation structures will increase the variation in estimated π_{0}. And inevitably subsequent statistical inferences and false discovery control procedures would be influenced by this unstable π_{0 }estimation.
Improving the π_{0 }estimation by resampling strategies
To improve the estimation of π_{0 }in datasets with strong correlations, we proposed two methods, termed SampS and SampG, to replace the original π_{0 }estimation step in the qvalue method. In the SampS algorithm, we used a model averaging strategy. We repeatedly sampled 2/3 of the data from each class without replacement, calculated the pvalues for genes, estimated π_{0 }from the pvalue distribution and finally used the upper quartile of the resampling estimations in the subsequent statistical inferences. In the SampG algorithm, we further broke down the correlations between genes and stabilized the π_{0 }estimation. In this algorithm, we repeatedly sampled without replacement 2/3 of the expression values from each class independently for each gene, calculated their pvalues, estimate π_{0}, and then used the upper quartile of the resampling estimations in subsequent analysis. For the choice of the plugin π_{0 }estimator, we tried to use both qvalue [6] and moment generating function [18] methods, and called them SampG.q, SampS.q and SampG.m, SampS.m, respectively.
To test the variation in π_{0 }estimation induced by strong correlation structures, and the performance of our proposed SampS and SampG methods, we created simulated datasets DataB and DataM, with true π_{0 }being 0.9, 0.8, 0.7, 0.6, and the correlation constant ρ being 0.3, 0.5 and 0.7, respectively. We created 100 datasets for each combination of these control parameters. In DataB, we created 1000 genes forming 10 blocks with 100 genes per block to simulate large scale correlation between genes. In DataM we randomly selected 1000 genes with differences in the mean of log transformed expression levels greater than 0.58, then randomly scrambled 90%, 80%, 70% or 60% of them to create datasets with known proportion of true null hypothesis and strong genegene correlations. We applied the SampS and SampG strategies to estimate π_{0 }for each of the simulated datasets, and compared our results to a number of other methods.
The π_{0 }estimation methods we used and their corresponding R functions are:
1. The Lowest Slope estimator (LSL) [15] with parameter determined via bootstrap (bootstrap) [5]; function fdr.estimate.eta0
2. qvalue method with tuning parameter chosen by smoother method (Qvalue) [6]; function qvalue
3. The Spacing LOESS Histogram (SPLOSH) [16]; function splosh
4. The betauniform mixture model (BUM) [12]; function bum.mle
5. The nonparametric MLE method (convest)[17]; function convest
6. The Poisson regression approach (PRE) [1820]; function p0.mom
7. The moment generating function (mgf) [18]; function p0.mom
8. The Lowest Slope estimator (LSL) [15] with parameter determined adaptively (adaptive) [35]; function fdr.estimate.eta0
9. SampG method, with qvalue plug in, 2^{nd }quartile (SampG.q Q2)
10. SampG method, with qvalue plug in, 3^{rd }quartile (SampG.q Q3)
11. SampS method, with qvalue plug in, 2^{nd }quartile (SampS.q Q2)
12. SampS method, with qvalue plug in, 3^{rd }quartile (SampS.q Q3)
13. SampG method, with mgf plug in, 2^{nd }quartile (SampG.m Q2)
14. SampG method, with mgf plug in, 3^{rd }quartile (SampG.m Q3)
15. SampS method, with mgf plug in, 2^{nd }quartile (SampS.m Q2)
16. SampS method, with mgf plug in, 3^{rd }quartile (SampS.m Q3)
The π_{0 }estimations were shown as boxplots in Figure 5, and the Mean Square Error (MSE) were listed in Table 1.
Figure 5. Boxplot of π_{0 }estimated by various methods on the simulated data sets. The methods are (from left to right in each figure): LSL, Qvalue, SPLOSH, BUM, convest, PRE, mgf, adaptive, SampG.q Q2, SampG.q Q3, SampS.q Q2, SampS.q Q3, SampG.m Q2, SampG.m Q3, SampS.m Q2, and SampS.m Q3. AD: DataB with ρ = 0.3 and π_{0 }varies from 0.9 to 0.6. EH: DataB with ρ = 0.5 and π_{0 }varies from 0.9 to 0.6. IL: DataB with ρ = 0.7 and π_{0 }varies from 0.9 to 0.6. MP: DataM with π_{0 }varies from 0.9 to 0.6. Horizontal lines represent the true π_{0 }in each simulated case.
Table 1. Mean Square Error of π_{0 }estimation by different methods
Table 1 listed the MSE of π_{0 }estimations by various methods. To better understand the methods tested, we listed both the 2^{nd }and 3^{rd }quartiles of the SampS and SampG methods compared to other methods. Later, we used only the 3^{rd }quartile to provide strong control of FDR. From Figure 5 and Table 1 we can see that the bootstrap, Qvalue and SPLOSH methods are very sensitive to correlation and have higher MSE, especially the SPLOSH methods tend to underestimate higher π_{0 }but overestimate lower π_{0}; the BUM, convest and PRE methods tend to underestimate π_{0 }in most of the simulated data sets with strong correlations, which is not favorable in FDR controls; the adaptive method tends to overestimate π_{0 }when the correlation is not very strong, which is also observed by [18] with independent and weak correlated data, but it worked better on cases with strong correlations when the MSE were the least among all tested methods; mgf method is generally the second best among the above, with relatively small bias and variation among all simulated cases. In terms of SampS and SampG methods, both the 2^{nd }and 3^{rd }quartile outperformed the corresponding plugin estimator, due to smoothed variation by model averaging, and SampG performs better than SampS with smaller MSE because SampG breaks down correlation between genes whereas SampS does not. And since mgf outperforms the qvalue method, the SampG and SampS with the mgf plugin also outperforms SampG and SampS with the qvalue plugin.
We then selected genes with FDR controlled under 0.05 level based on the π_{0 }estimated by our method, calculated the actual false discovery rate and power, and compared our results to that of FDR controlling method proposed by Benjamini et al. with correlations considered (BY; function p.adjust with method "BY") [3], the permutation based FDR controlling method (howmany; function howmany_dependent) [24], and qvalue method [6]. The boxplot of actual false discovery rate and power on the 4 types of DataM simulations were shown in Figure 6. From Figure 6 it can be seen that both the BY and howmany methods provided strong control of FDR, their actual FDR level was much lower than 0.05, and their power to detect true alternatives were much lower than methods where the proportion of true null hypotheses was estimated and used in the FDR control. Comparing the qvalue method and our proposed SampS and SampG method, for majority of the cases the actual FDR level were still controlled below 0.05 level, although some outliers exist with actual FDR up to 0.4~0.6, due to the unstable pi_0 estimations. The SampS and SampG methods, especially when using mgf as the plugin π_{0 }estimator, tend to have lower FDR on those outlier cases compared to qvalue method. The differences in power between qvalue method and our proposed methods were very minor, compared to the difference between qvalue and BY method. We also tested on the DataB simulations, and obtained the same result. Since we used the 3^{rd }quartile of π_{0 }estimated by SampS and SampG, our estimations were biased but conservative. With the same FDR control level our method would make smaller numbers of rejections than the qvalue method, therefore the actual FDR and power of the genes selected were lower than that of the qvalue method. This was shown by the pvalue in pairwise comparison of actual FDR and power between the resampling based methods and qvalue method in these simulations, where all FDR and most power comparisons were significant (Table 2, Table 3). Interestingly, comparing the mean of FDR and power, as shown in Table 4 and 5, we found that the SampS and SampG methods, compared to the qvalue method, can reduce the average FDR up to 40%, with a decrease of average power in most cases less than 1%. In fact, with the highest correlation constant we tested, and in most cases for SampG.m method, the decrease of power was not even significant from the qvalue method. In contrast, the most conservative BY method reduced the average FDR by more than 90% compared to the qvalue method, but also reduced the average power by approximately 10% in cases with strong correlation.
Figure 6. Boxplot of actual FDR and power by various methods on DataM. The methods are (from left to right in each figure): BY, howmany, Qvalue, SampG.q, SampS.q, SampG.m, SampS.m. AD: boxplot of FDR on DataM with π_{0 }varies from 0.9 to 0.6. EH: boxplot of power on DataM with π_{0 }varies from 0.9 to 0.6. Horizontal lines represent FDR at 0.05 level.
Table 2. Pvalues comparing FDR by different methods.
Table 3. Pvalues comparing power by different methods.
Table 4. Percentage in decrease of mean FDR
Table 5. Percentage in decrease of mean power
Comparing the resampling strategy to bootstrap pvalues directly
When there is strong correlation between genes, and thus also between pvalues, bootstrapping pvalues does not change the correlation structure and therefore the estimations are still unstable. In contrast, resampling of samples or gene expression values as we proposed could address the variations induced by the correlation structure and therefore smooth the estimation.
For example, for each of the simulated datasets, we bootstrapped the pvalues 100 times and then estimated π_{0 }by the qvalue method. Figure 7 shows the scatter plots of π_{0 }estimated by the qvalue method versus the 1^{st}, 2^{nd }and 3^{rd }quartiles of the SampS and SampG methods, as well as the estimations by bootstrapping pvalues on the 100 DataM simulations with true π_{0 }equals to 0.7. For the bootstrap method, as expected, the scatter plot showed that the median estimation of π_{0 }was close to the estimation using original pvalues for all cases. Whereas for the SampS and SampG methods, since they smoothed the variation induced by correlation between pvalues, the variation was much smaller than the qvalue method. Especially for cases where the qvalue method severely under or overestimated the π_{0}, the estimation by the SampS and SampG methods were closer to the true value. This was also observed in all simulated datasets that we have tested (data not shown).
Figure 7. Scatter plot of π_{0 }estimated by qvalue method versus the 3 quartiles by SampG, SampS and bootstrapping pvalues. Xaxis represents the π_{0 }estimated by qvalue method on the 100 DataM simulations, and yaxis represents the 1^{st}, 2^{nd }and 3^{rd }quartiles of π_{0 }estimated by SampG, SampS method and the bootstrapped pvalues, in the 3 plots respectively. Circles are the 3^{rd }quartile of estimation, triangles are the median and diamonds are the 1^{st }quartile. The lines in the images are the 45 degree diagonal line and the horizontal line corresponds to the true π_{0 }which equals to 0.7. We used qvalue plugin for SampG and SampS in this figure.
Discussion
In actual microarray datasets, genes expression is often correlated due to coregulation, sharing of transcription factor binding motifs, or technical reasons such as sequence similarity, crosshybridization or signal leak during hybridization. This is of critical importance for statistical inferences that rely on pooling of test statistics across genes [21]. The distribution of pvalues of these correlated genes can be viewed under a mixture model where groups of highly correlated genes share similar pvalues and the whole distribution is actually a mixture of components corresponding to the groups of highly correlated genes. The effect of strong and large scale correlations is equivalent to reducing the total number of independent components in this mixture model. Storey [22] argued that subsets of genes fall into small but highly correlated groups due to coregulation or crosshybridization, but these groups are small in size and nearly independent with each other ("clumpy dependency"), therefore the uniformity of pvalue distribution of true null genes would not be strongly affected. However, other researchers, such as Qiu et al [21], have found that the impact of correlation may be quite strong. It is also true that in our permutation study on a breast cancer microarray dataset, the distribution of pvalues could be extremely far from uniform due to genegene correlation.
Systems biology research has shown that biological gene networks have a scale free [36], hierarchical structure [37,38], where most of the genes are connected to a small number of other genes forming small groups of complexes, while some "hub" genes may be connected to large number of peripheral genes. The distribution of connectivity degree (the number of genes being connected to a given gene) decreases with a powerlaw, which is much slower than the exponential decay expected in a random network [36]. These genegene interactions may be reflected by coregulation or correlation in expression under certain conditions, and the possible scale of gene interaction is unlimited given the scale free structure. Therefore, large scale correlation of gene expression levels is not surprising in microarray studies.
We have shown in our simulated study that with the growth of correlation structures, the pvalue distribution of H_{0 }genes increasingly deviates from a typical uniform distribution. This may influence the estimation of π_{0 }and the following statistical inferences. The effect of strong correlation was also discussed by other researchers [14]. To solve this problem, we proposed the SampS and SampG methods. These algorithms replaced the unbiased but unstable π_{0 }estimation step in the qvalue method with a model averaging procedure of resampling samples or furthermore resampling independently for each gene to partially break the correlations between genes. Strong correlation between genes will inevitably increase the variation of the π_{0 }estimation, even though the variation could be partially smoothed by the resampling strategies proposed in this paper. Therefore, it is necessary to compromise between safety and efficiency; in this case, we would like to shrink the estimation toward 1 from an unbiased estimation to guarantee a strong control of FDR. That is why we used the upper quartile instead of the median of the resampling estimations, although medians had a smaller MSE in estimating π_{0 }in our simulated studies. We showed in our simulations that these plugin revisions, compared to the qvalue method, can greatly reduce the variation of the π_{0 }estimation under strong genegene correlations, and enhance the performance of FDR control by reducing false discovery rate up to 40% with a reduction of power less than 1% compared to qvalue method.
In our study, to create datasets with a known proportion of true null hypotheses while still having a similar correlation structure to that in actual microarray datasets, we developed 2 strategies to generate simulated datasets. The first one, DataB, is simply a blockwise structure with arbitrary block size and intrablock correlation, but independent between blocks. When the block size was small, this was similar to the "clumpy" hypothesis [22]. When the block size became bigger, we showed that the correlations influenced the π_{0 }estimation, and the resampling strategies we proposed improved the performance of gene selection by significantly reducing the FDR with a minor reduction of power. To mimic a more realistic scenario, we also developed the DataM strategy to generate simulated data from actual microarray datasets by arbitrarily permuting a given proportion of the genes. This permutation breaks the correlation between arbitrarily assigned differentially and nondifferentially expressed genes, but maintains the correlation within the 2 groups of genes.
Conclusion
The SampS and SampG methods we proposed and tested in this paper are simple revisions, but they greatly improved the π_{0 }estimations. The same approach using independent resamples of expression values to estimate the π_{0 }and then using the upper quartile of the resampling estimations in FDR control, could be applied to other FDR algorithms in data where strong correlation between hypotheses exists. In this paper we considered the typical 2sample comparison problem with a reasonable number of independent replicates. For more complex problems, such as timecourse studies, the SampG method (resample gene expression values independently for each gene) may not be able to be applied directly without a reasonable number of replicates in each time point, but the SampS method (resample samples) may be applicable if there is sound reason to assume the existence of stationary time patterns in the biological system under investigation.
Availability and requirements
The R code of SampG and SampS methods, and R code to generate simulated data sets DataB and DataM, is included as Additional file 1.
Additional file 1. The file contains the R source code of functions to create DataB, DataM simulated data sets, and SampG, SampS methods to estimate the π_{0}.
Format: R Size: 7KB Download file
Requirements: R environment.
Operating systems: Windows XP or Linux.
License: free.
Authors' contributions
XL developed the basic strategies of SampG and SampS methods, developed the methods to generate DataB and DataM simulation data, and did all simulation studies. DPL helped XL on analyzing the experiments, comparing the methods, and improving the algorithms. XL and DPL executed the writing together.
Acknowledgements
The authors want to thank Dr. Ping Ma, Charles Berry and Anthony Gamst for valuable advice and proofreading of the manuscript. The comments by an anonymous referee and the assistant editor are also gratefully acknowledged. The work was partially supported by NCI grant RO1 AI055056, ACSIRG #70002, and NCI Special Cancer Center Support Grant CA2310022.
References

Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach.

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

Benjamini Y, Yekutieli D: The Control of the False Discovery Rate in Multiple Testing Under Dependency.

Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes Analysis of a Microarray Experiment.
Journal of the American Statistical Association 2001, 96:11511160.

Storey JD, Tibshirani R: Statistical significance for genomewide studies.
PNAS 2003, 100:94409445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Scheid S, Spang R: A false discovery rate approach to separate the score distributions of induced and noninduced genes . In Proceedings of the 3rd International workshop on distributed statistical computating (DSC2003). Vienna, Austria; 2003:2022.

Scheid S, Spang R: A Stochastic Downhill Search Algorithm for Estimating the Local False Discovery Rate.
IEEE Transactions on Computational Biology and Bioinformatics 2004, 1:98108.

Scheid S, Spang R: twilight; a Bioconductor package for estimating the local false discovery rate.
Bioinformatics 2005, 21(12):29212922. PubMed Abstract  Publisher Full Text

Aubert J, BarHen A, Daudin JJ, Robin S: Determination of the differentially expressed genes in microarray experiments using local FDR.
BMC Bioinformatics 2004, 5:125. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Allison DB, Gadbury GL, Heoc M, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data.

Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of pvalues.
Bioinformatics 2003, 19(10):12361242. PubMed Abstract  Publisher Full Text

Liao JG, Lin Y, Selvanayagam ZE, Shih WJ: A mixture model for estimating the local false discovery rate in DNA microarray analysis.
Bioinformatics 2004, 20(16):26942701. PubMed Abstract  Publisher Full Text

Wu B, Guan Z, Zhao H: Parametric and nonparametric FDR estimation revisited.
Biometrics 2006, 62(3):735744. PubMed Abstract  Publisher Full Text

Hsueh HM, Chen JJ, Kodell RL: Comparison of methods for estimating the number of true null hypotheses in multiplicity testing.
J Biopharm Stat 2003, 13(4):675689. PubMed Abstract

Pounds S, Cheng C: Improving false discovery rate estimation.
Bioinformatics 2004, 20(11):17371745. PubMed Abstract  Publisher Full Text

Langaas M, Lindqvist BH: Estimating the proportion of true null hypotheses, with application to DNA microarray data.

Broberg P: A comparative review of estimates of the proportion unchanged genes and the false discovery rate.
BMC Bioinformatics 2005, 6:199. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B: Largescale simultaneous hypothesis testing : the choice of a null hypothesis.
Journal of the American Statistical Association 2004, 99:96104.

Efron B, Tibshirani R: Using specially designed exponential families for density estimation.

Qiu X, Klebanov L, Yakovlev A: Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes.
Stat Appl Genet Mol Biol 2005, 4:Article34. PubMed Abstract

Storey JD: comment on “Resamplingbased multiple testing for DNA microarray data analysis” by Ge, Dudoit and Speed.

Westfall PH, Young SS: Resamplingbased Multiple Testing. In Wiley Series in Probability and Mathematical Statistics. Edited by Barnett V, Bradley RA, Fisher NI, Hunter JS, Kadane JB, Kendall DG, Smith AFM, Stigler SM, Teugels JL, Watson GS. New York , John Wiley & Sons; 1993.

Meinshausen N: False Discovery Control for Multiple Tests of Association Under General Dependence.

Efron B: Correlation and LargeScale Simultaneous Significance Testing.
Journal of the American Statistical Association 2007, 102:93103.

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci U S A 2001, 98(9):51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li C, Wong WH: DNAChip Analyzer (dChip). In The analysis of gene expression data: methods and software. Edited by Parmigiani G, Garrett ES, Irizarry R, Zeger SL. New York , Springer; 2003:120141.

Ge Y, Dudoit S, Speed TP: Resamplingbased Multiple Testing for Micorarray Data Analysis.

Richardson AL, Wang ZC, De Nicolo A, Lu X, Brown M, Miron A, Liao X, Iglehart JD, Livingston DM, Ganesan S: X chromosomal abnormalities in basallike human breast cancer.
Cancer Cell 2006, 9(2):121132. PubMed Abstract  Publisher Full Text

Barash Y, Dehan E, Krupsky M, Franklin W, Geraci M, Friedman N, Kaminski N: Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays.
Bioinformatics 2004, 20(6):839846. PubMed Abstract  Publisher Full Text

Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data.
Nat Biotechnol 2004, 22(6):6568; author reply 658. PubMed Abstract  Publisher Full Text

Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, BorresenDale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours.
Nature 2000, 406(6797):747752. PubMed Abstract  Publisher Full Text

Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Eystein Lonning P, BorresenDale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications.
Proc Natl Acad Sci U S A 2001, 98(19):1086910874. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang X, Lu X, Shi Q, Xu XQ, Leung HC, Harris LN, Iglehart JD, Miron A, Liu JS, Wong WH: Recursive SVM feature selection and sample classification for massspectrometry and microarray data.
BMC Bioinformatics 2006, 7:197. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Benjamini Y, Hochberg Y: On the adaptive control of the false discovery rate in multiple testing with independent statistics.
Journal of Educational and Behavioural Statistics 2000, 25:6083.

Barabasi AL, Albert R: Emergence of scaling in random networks.
Science 1999, 286(5439):509512. PubMed Abstract  Publisher Full Text

Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization.
Nat Rev Genet 2004, 5(2):101113. PubMed Abstract  Publisher Full Text

Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks.
Science 2002, 297(5586):15511555. PubMed Abstract  Publisher Full Text