Abstract
Background
Many procedures for finding differentially expressed genes in microarray data are based on classical or modified tstatistics. Due to multiple testing considerations, the false discovery rate (FDR) is the key tool for assessing the significance of these test statistics. Two recent papers have generalized two aspects: Storey et al. (2005) have introduced a likelihood ratio test statistic for twosample situations that has desirable theoretical properties (optimal discovery procedure, ODP), but uses standard FDR assessment; Ploner et al. (2006) have introduced a multivariate local FDR that allows incorporation of standard error information, but uses the standard tstatistic (fdr2d). The relationship and relative performance of these methods in twosample comparisons is currently unknown.
Methods
Using simulated and real datasets, we compare the ODP and fdr2d procedures. We also introduce a new procedure called S2d that combines the ODP test statistic with the extended FDR assessment of fdr2d.
Results
For both simulated and real datasets, fdr2d performs better than ODP. As expected, both methods perform better than a standard tstatistic with standard local FDR. The new procedure S2d performs as well as fdr2d on simulated data, but performs better on the real data sets.
Conclusion
The ODP can be improved by including the standard error information as in fdr2d. This means that the optimality enjoyed in theory by ODP does not hold for the estimated version that has to be used in practice. The new procedure S2d has a slight advantage over fdr2d, which has to be balanced against a significantly higher computational effort and a less intuititive test statistic.
Background
Highthroughput methods in molecular biology have challenged existing data analysis methods and stimulated the development of new methods. A key example is the gene expression microarray and its use as a screening tool for detecting genes that are differentially expressed (DE) between different biological states. The need to identify a possibly very small number of regulated genes among the 10,000s of sequences found on modern microarray chips, based on tens to hundreds of biological samples, has led to a plethora of different methods. The emerging consensus in the field [1] suggests that a) despite ongoing research on pvalue adjustments [2], false discovery rates (FDR, [3]) are more practical for dealing with the multiplicity problem, and b) classical test statistics requires modification to limit the influence of unrealistically small variance estimates. Nonetheless, many competing methods for detecting DE exist, and even attempts at validation on data sets with known mRNA composition [4] cannot offer definitive guidelines.
In this context, the introduction of the socalled optimal discovery procedure (ODP, [5]) constitutes a major conceptual achievement. Building on the NeymanPearson lemma for testing an individual hypothesis, the author shows that an extension of the likelihood ratio test statistic for multiple parallel hypotheses (or genes) is the optimal procedure for deciding whether any specific gene is in fact DE: for any fixed number of false positive results, ODP will identify the maximum number of true positives. The ODP establishes therefore a theoretical optimum for detecting DE against which any other method can be measured.
Unfortunately, the optimality of ODP is a strictly theoretical result that requires, for all genes, a full parametric specification of the densities under null and alternative hypothesis. In practice, even assuming normality, the genewise means and variances are unknown, and they become nuisance parameters in the hypothesis testing. Consequently, the authors of [6] have suggested an estimated version EODP, which can be implemented in practice. It is, however, not clear how EODP performs compared to the theoretical optimum, or other existing methods, except under the most benign circumstances (no correlation and equal variances between genes).
The main questions of this paper are therefore a) whether the optimality of ODP is retained by EODP, and b) whether we can improve on EODP's performance in practice. Previously, we have introduced a multidimensional extension of the FDR procedure (fdr2d) that combines standard error information with the classical tstatistic. We demonstrated that the fdr2d performs as well or better than the usual modified tstatistics, without requiring extra modeling or model assumptions [7]. In this paper, we show that fdr2d also outperforms EODP on simulated and real data sets. We also demonstrate how a synthesis of the EODP and fdr2d procedures can further improve the power to detect DE.
The twosample problem
We demonstrate the application of EODP and fdr2d in the common situation where we want to detect genes that are DE between two biological states. We assume n_{1 }and n_{2 }arrays for each group, each containing probes for m genes. For gene i, we observe a vector of expression values x_{i }of length n_{1 }+ n_{2}, which consists of the observations x_{i1 }in the first group, and x_{i2 }in the second group. We define the groupwise means and standard deviations as usual, and refer to the pooled standard deviation as
Furthermore, we assume that we are dealing with a random mixture of DE and nonDE genes, with a proportion π_{0 }of genes being nonDE.
ODP statistics
The theoretical ODP statistic assumes that for all i = 1, ... m genes, the density functions of the expression values under the null hypothesis of no DE, f_{i}, and under the alternative hypothesis of DE, g_{i}, are fully known in advance. For the random mixture of DE and nonDE genes outlined above, the ODP statistic for the observed expression values x_{i }of the ithe gene can then be written as
The procedure then rejects the null hypothesis for all genes i with S_{i }≡ S(x_{i}) ≥ λ, i.e. all genes with large S_{i }are declared to be DE. Using the NeymanPearson Lemma, it can be shown that this procedure is optimal in the sense that for any prespecified false positive rate (which will determine λ), the ODP will have the maximum true positive rate. This optimality property can also be expressed in terms of FDR [5].
Requiring full specification of all null and alternative distributions, however, is impractical. In any realistic application, only an estimated ODP statistic
is feasible, where the densities
where φ(·μ, σ^{2}) is the jointdensity for the normal distribution with mean μ and variance σ^{2}.
Conceptually, under the null hypothesis, we have the usual estimates
The second step in applying the ODP to data is the calibration of the procedure. There
is no distribution theory for the statistic, so it is not clear how to choose the
threshold λ to achieve a desired FDR level. [6] suggest a conventional algorithm that computes the estimated ODP statistic
Multidimensional local false discovery rate
FDR approaches focus on the distribution of the specific statistic Z used to test the genewise null hypotheses, in contrast to ODP, which is based on the distribution of the data. Given a mixture of DE and nonDE genes as described above, the density f of Z can be written as
f(z) = π_{0}f_{0}(z) + (1  π_{0})f_{1}(z), (2)
where f_{0 }and f_{1 }are the densities of the test statistic Z for nonDE and DE genes, respectively, and π_{0 }the proportion of truly nonDE genes. The local fdr for any observed value z of the test statistic is then
and can be interpreted as the expected rate of false positives among genes with test statistic z, see [9]. Practically, the densities f can be estimated from the histograms of the test statistics computed from the real data, and f_{0 }is estimated similarly from the test statistics computed from permuted data.
Formulated as a decision procedure like ODP, we specify a test statistic Z and a desired threshold α for the local fdr; we then compute for each gene the value of the test statistic z_{i }= Z(x_{i}) and the decision criterion fdr_{i }= fdr(z_{i}) and declare genes with fdr_{i }<α to be DE.
As the more usual global FDR of a set of test statistics is just the average of their local fdr [9], little seems to be gained by using the local fdr. Note, however, that Equations (2) and (3) still hold if we replace the univariate test statistic Z by a vector Z of test statistics. We have recently shown that for the twosample problem, using a bivariate test statistic and the associated twodimensional fdr is more powerful than conventional FDR for univariate test statistics [7]. Specifically, the test statistic Z = (Z_{1}, Z_{2}) with
Z_{1 }= t and Z_{2 }= log se, (4)
where t is the usual t statistic, and se the standard error of the mean,
yields smaller fdr not only compared to the conventional tstatistic on its own, but also compared to a number of popular modified tstatistics [1012].
In the following, we will use the abbreviations fdr1d and fdr2d for local fdr computed based on univariate and bivariate test statistics, respectively. Note that in practice, the fdr2d is estimated in a similar manner as the fdr1d, using twodimensional histograms instead of onedimensional histograms, together with a somewhat more sophisticated binomial smoothing procedure, see [7] for details.
Procedures to be evaluated
The central aim of this paper is to compare the operating characteristics of four different procedures for detecting DE on a number of real and simulated data sets:
1. t1d uses the standard tstatistic with conventional fdr1d and serves as a reference.
2. S1d uses the logarithm of
3. t2d uses the test statistic in (4) for calculating fdr2d; this is the same procedure as described in [7].
4. S2d is a novel procedure that combines the logarithm of
Results
Feasibility of S2d
We first evaluate the S2d procedure, based on the bivariate test statistic
Z_{1 }= log
with
Figures 1(a) and 1(b) show the scatter plot of the bivariate test statistics for two real data sets described in Methods, with the estimated fdr2d overlayed as isolines. We exploit the useful fact that we can always average the fdr2d over one of the component statistics to get the fdr1d for the other component statistic:
Figure 1. S2d and S1d for the BRCA and Lymphoma data sets. (a) A scatter plot of the BRCA data,
with log
see [7]. Figures 1(c) and 1(d) show S1d (black) overlayed with the averaged S2d (red) for both data sets, with excellent
agreement. This indicates that the smoothing required for computing S2d has been successful.
This is consistent with the relationship between tstatistics and log
Performance on simulated data sets
We perform simulations with 10,000 genes per array, a proportion of truly nonDE genes π_{0 }= 0.8, and two independent groups with n = 7 arrays per group. We combine three different levels of variance heterogeneity between genes with two different settings for the balance between up and downregulation, for a total of six different simulation scenarios:
1. Variances can be 'similar' (effectively the same) across genes, 'balanced', which allows for moderate differences in variance between genes, and 'variable', which allows large differences.
2. In the 'symmetric' case, roughly 50% of the DE genes are up and down regulated; in the 'asymmetric' case, only about 20% of all genes are downregulated, the rest is upregulated.
We have included the asymmetric scenario, because this is where ODP is expected to perform better than standard methods in a theoretical setting [5]. All expression values are assumed to follow a normal distribution; see Methods for further details of the simulation procedure.
For each scenario, we generate 100 data sets, for a total of 10^{6 }genes. For each procedure, the fdr values are computed by keeping track of the DE status of each gene, grouping the genes in intervals (1d) or grid cells (2d) based on their test statistic, and computing the percentage of false positives in each interval or cell.
In order to compare different fdr procedures, we summarize their results via operating characteristics (OC) curves: for each procedure, we sort the groups of genes as described above by their local fdr, and compute the corresponding global FDR as cumulative mean of the local fdrs from the smallest to the largest. This global FDR is then plotted against the cumulative percentage of genes in these intervals or grid cells. The resulting curve shows the true global FDR for a set of topranked genes as a function of the size of that set (as a percentage of the number of genes under study). The results for the different simulation scenarios and all four procedures are shown in Figure 2.
Figure 2. Operating characteristics of the four procedures for six simulated data sets. Each curve shows the true global FDR among the topranked genes for a procedure on the vertical axis as a function of the percentage of genes declared DE by this procedure on the horizontal axis. See text for description of the simulation scenarios.
There is little or no difference in relative performance between the procedures under the symmetric and asymmetric scenarios in Figure 2. It is also clear that the differences in performance are most pronounced when the variances are similar, less so when the variances are balanced, and minor when the variances are highly variable. The ranking of the different procedures is consistent through all six scenarios: as expected, t1d has the worst performance; equally as expected, S1d does clearly better than t1d. Novel findings of this paper are that a) t2d does still better than S1d, and b) S2d improves over t2d, although only marginally.
Performance on real datasets
We evaluate the performance of the different procedures on two real data sets:
• The BRCA data [13] contains 3,170 genes and was collected from 15 patients with hereditary breast cancer, who had mutations either of the BRCA1(n = 7) or the BRCA2 gene (n = 8).
• The Lymphoma data [14] contains 7,399 genes and was collected from 240 patients with diffuse large Bcell lymphoma, comprising n_{1 }= 102 survivors and n_{2 }= 138 nonsurvivors.
Here, the local fdr estimates are computed based on the mixture model (2). The estimate
of f is computed by smoothing the histograms of the observed statistics, and similarly
f_{0 }from permuted test statistics. The permuted statistics are obtained from permutations
of the group labels to generate the null distribution. Technically, we also need an
estimate
For each procedure, we rank the genes by their estimated fdr, and compute their estimated global FDR among the topranked genes as the cumulative mean of their local fdrs. The global FDR is then plotted as a function of the percentage of genes declared DE. For comparison purposes, we also include the FDR as computed by the EDGE software.
The resulting OC curves are shown in Figure 3. We get the same ranking as for the simulated data: t1d performs worst and is easily bettered by S1d; t2d performs better than S1d for the 2% most highly regulated genes, and is equivalent otherwise; S2d has a slight advantage over t2d on the BRCA data. Additionally, as a check that our implementation of ODP is correct, we are happy to see that EDGE and S1d yield virtually identical FDR curves.
Figure 3. Operating characteristics of the four procedures and EDGE for the BRCA and Lymphoma data. Each curve shows the estimated global FDR among the topranked genes for a procedure on the vertical axis as a function of the percentage of genes declared DE by this procedure on the horizontal axis.
We [7] have previously compared t2d with other procedures such as SAM [11], Efron's modified t [10], and an empirical Bayes modification of the tstatistic [12]. To add more comparisons, we have run two procedures by Pounds and Cheng (Splosh
[15] and robust FDR [16]) for the two real data sets. We use their own software, with a little modification
so that we can specify the
Figure 4. Operating characteristics of different procedures for the BRCA and Lymphoma data: t1d and t2d combine standard tstatistics with one and twodimensional local fdr as shown in Figure 3; 'Splosh' and 'robust' are the FDR procedures described in Pounds and Cheng (2004) and Pounds and Cheng (2006). The 'standard' method is described in Storey and Tibshirani (2003).
Discussion
The main motivation for using the FDR has been that it offers a way of dealing with multiplicity that is less restrictive and more powerful than traditional pvalue adjustments. The challenge is how to explicitly exploit the multiplicity by pooling information across genes in order to make the FDR even more powerful.
In the case of t1d, the test statistic is computed genebygene and does not use information shared with other genes. Moderated tstatistics [1012], which borrow strength across genes for estimating standard errors, are more powerful than simple tstatistics. The ODP appears to be the ultimate in combining information, where to some extent all genes contribute to the statistic for each other gene. The fdr2d approach on the other hand augments the grouping of genes based on individual test statistics by subgrouping them based on their variability. In all cases we find that when there are few instances of genes with similar variability, the performance of the different methods tends to converge towards the simple t1d (Figures 2(e) and 2(f)).
From a practical point of view, it seems that the smoothing procedure underlying our
implementation of fdr2d seems to work as well for the statistic log
At first glance, the empirical ODP statistic seems to rely on the assumption that the expression values for all genes are normally distributed. From a practical point of view, however, the empirical ODP procedure works even if the normal assumption does not hold, because it relies on the permutation algorithm. In this sense, the normal densities in (1) only represent a scoring function that exponentially downweights contributions from genes with different mean structure and/or large variability. However, the performance of the empirical ODP will depend on how precisely the normal assumption holds for the data at hand. Some loss of the optimality property in the real data applications is probably due to nonnormality. But even in the simulations, the empirical ODP is not better than t2d. This can only mean that the presence of large number of nuisance parameters degrades the performance of ODP.
Conclusion
The estimation of the nuisance parameters required to apply the ODP in practice makes the procedure described in [6] no longer optimal. We have shown in this paper that the combination of a conventional tstatistic with the standard error of the mean as described in [7] can outperform the empirical ODP. Further improvements can be made by combining the ODP test statistic with standard error information, but the gains are comparatively small.
The ODP procedure exploits similarities in the distribution for a collection of genes, for example similarity in variance. When variances between genes are dissimilar, there is little gain by the ODP compared to the standard tstatistic. One advantage of the ODP over the modified tstatistics is that the adaption is done automatically, without calculating a modelbased or heuristic fudge factor for the denominator.
The computational demand of calculating the ODP statistic is a serious practical disadvantage: each density term f(x) or g(x) requires computation across the whole dataset, so a single ODP statistic already involves substantial computations. Doing this for the whole collection of genes and for repeated permutations of the group labels is an order of magnitude more laborious than the computation required for the standard statistics.
Methods
Simulation scenarios
Our model for simulating microarray data is based on the model described in [12]. We assume that the expression values for all m genes are normally distributed (possibly after suitable transformation), and that
their variances
In detail we proceed as follows for our simulations:
1. Initialize the design with m = 10,000 genes, proportion of nonDE genes π_{0 }= 0.8, and two groups with n_{1 }= n_{2 }= 7.
2. For each gene i = 1, ... m, draw a genespecific variance from
where
3. For each gene i = 1,... m, determine randomly with probability π_{0 }whether it is to be DE or not.
(a) In case of nonDE, set μ_{1 }= μ_{2 }= 0.
(b) In case of DE, set μ_{1 }= 0 and draw μ_{2 }randomly from
D_{i }~ N(0, v_{0}
where is v_{0 }is another tuning parameter.
i. In case of an asymmetric scenario, set the sign of μ_{2 }to positive with probability 0.8, and to negative otherwise.
4. Simulate n_{1 }and n_{2 }values in the first and second group, respectively, following normal distributions
X_{.i1 }~ N(μ_{1},
X_{.i2 }~ N(μ_{2},
Following [12], we set the constants to
For each scenario, we then generate 100 data sets, for a total of 10^{6 }genes. For each procedure, the true local fdr of the genes is estimated from the known DE status of each simulated gene, simply as the proportion of false positives in each histogram interval or grid cell. This means specifically that no permutation, smoothing, or estimation of π_{0 }is required.
Real data sets
The permutation and smoothing approach used for estimating the fdr values for real
data has been described in detail in [9] and [7]. The estimates
The BRCA data set [13] was collected from patients with hereditary breast cancer who had mutations either
of the BRCA1(n = 7) or the BRCA2 gene (n = 8). Expression was originally reported for 3,226 genes, but following [8], we removed 56 extremely variable genes and analysed only the remaining 3,170 genes.
For all four procedures, we used
The Lymphoma data set [14] was collected from 240 patients with diffuse large Bcell lymphoma, n_{1 }= 102 of whom survived the study period, and n_{2 }= 138 of whom did not. We used all 7,399 genes reported in the original article. For
all four procedures, we used
All expression values were logged prior to analysis.
Software
Methods t1d and t2d are implemented in the R package OCplus, which is freely available at the Bioconductor website [18]. R code implementing S1d and S2d is available from the authors on request. EDGE, the official implementation of EODP described in [19], is available at [20].
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
EP wrote computer programs, ran simulations and drafted the manuscript. AP wrote computer programs, ran data analysis and cowrote the manuscript. SC cowrote the manuscript. YP conceived the study and drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was partially supported by a research grant from the Swedish Cancer Foundation.
References

Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus.
Nat Rev Genet 2006, 7:5565. PubMed Abstract  Publisher Full Text

Datta S, Datta S: Empirical Bayes screening of many pvalues with applications to microarray studies.
Bioinformatics 2005, 21(9):198794. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate – A practical and powerful approach to multiple testing.

Choe S, Boutros M, Michelson A, Church G, Halfon M: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset.
Genome Biology 2005, 6(2):R16. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Storey JD: The Optimal Discovery Procedure: A New Approach to Simultaneous Significance Testing. [http://www.bepress.com/uwbiostat/paper259] webcite
UW Biostatistics Working Paper Series Working Paper 259 2005.

Storey JD, Dai JY, Leek JT: The Optimal Discovery Procedure for LargeScale Significance Testing, with Applications to Comparative Microarray Experiments. [http://www.bepress.com/uwbiostat/paper260] webcite
UW Biostatistics Working Paper Series Working Paper 260 2005.

Ploner A, Calza S, Gusnanto A, Pawitan Y: Multidimensional local false discovery rate for microarray studies.
Bioinformatics 2006, 22(5):556565. PubMed Abstract  Publisher Full Text

Storey JD, Tibshirani R: Statistical significance for genomewide studies.
Proc Natl Acad Sci USA 2003, 100(16):94405. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R, Storey J, Tusher V: Empirical Bayes Analysis of a Microarray Experiment.

Efron B, Tibshirani R, Chu GossGV: Microarrays and their use in a comparative experiment. [http://wwwstat.stanford.edu/~tibs/research.html] webcite

Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
PNAS 2001, 98(9):51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth G: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. [http://www.bepress.com/sagmb/vol3/issl/art3] webcite
Statistical Applications in Genetics and Molecular Biology 2004, 3:Article 3. Publisher Full Text

Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J: Geneexpression profiles in hereditary breast cancer.
N Engl J Med 2001, 344(8):53948. PubMed Abstract  Publisher Full Text

Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, MullerHermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, LApezGuillermo A, Grogan TM, Miller TP, LeBlanc M, Ott G, Kvaloy S, Delabie J, Holte H, Krajci P, Stokke T, Staudt LM, Project LMP: The use of molecular profiling to predict survival after chemotherapy for diffuse largeBcell lymphoma.
N Engl J Med 2002, 346(25):193747. PubMed Abstract  Publisher Full Text

Pounds S, Cheng C: Improving false discovery rate estimation.
Bioinformatics 2004, 20(11):173745. PubMed Abstract  Publisher Full Text

Pounds S, Cheng C: Robust estimation of the false discovery rate.
Bioinformatics 2006, 22(16):19791987. PubMed Abstract  Publisher Full Text

Pawitan Y, Murthy KRK, Michiels S, Ploner A: Bias in the estimation of false discovery rate in microarray studies.
Bioinformatics 2005, 21(20):38653872. PubMed Abstract  Publisher Full Text

Bioconductor [http://www.bioconductor.org] webcite

Leek JT, Monsen E, Dabney AR, Storey JD: EDGE: extraction and analysis of differential gene expression.
Bioinformatics 2006, 22(4):507508. PubMed Abstract  Publisher Full Text

EDGE [http://www.biostat.washington.edu/software/jstorey/edge] webcite