| Detecting differential expression in microarray data: comparison of optimal procedures1Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, 17177 Stockholm, Sweden 2Department of Biomedical Sciences and Biotechnologies, Brescia, Italy
BMC Bioinformatics 2007, 8:28doi:10.1186/1471-2105-8-28 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/8/28
©
2007 Perelman et al; licensee BioMed Central Ltd. AbstractBackgroundMany procedures for finding differentially expressed genes in microarray data are based on classical or modified t-statistics. Due to multiple testing considerations, the false discovery rate (FDR) is the key tool for assessing the significance of these test statistics. Two recent papers have generalized two aspects: Storey et al. (2005) have introduced a likelihood ratio test statistic for two-sample situations that has desirable theoretical properties (optimal discovery procedure, ODP), but uses standard FDR assessment; Ploner et al. (2006) have introduced a multivariate local FDR that allows incorporation of standard error information, but uses the standard t-statistic (fdr2d). The relationship and relative performance of these methods in two-sample comparisons is currently unknown. MethodsUsing simulated and real datasets, we compare the ODP and fdr2d procedures. We also introduce a new procedure called S2d that combines the ODP test statistic with the extended FDR assessment of fdr2d. ResultsFor both simulated and real datasets, fdr2d performs better than ODP. As expected, both methods perform better than a standard t-statistic with standard local FDR. The new procedure S2d performs as well as fdr2d on simulated data, but performs better on the real data sets. ConclusionThe ODP can be improved by including the standard error information as in fdr2d. This means that the optimality enjoyed in theory by ODP does not hold for the estimated version that has to be used in practice. The new procedure S2d has a slight advantage over fdr2d, which has to be balanced against a significantly higher computational effort and a less intuititive test statistic. BackgroundHigh-throughput methods in molecular biology have challenged existing data analysis methods and stimulated the development of new methods. A key example is the gene expression microarray and its use as a screening tool for detecting genes that are differentially expressed (DE) between different biological states. The need to identify a possibly very small number of regulated genes among the 10,000s of sequences found on modern microarray chips, based on tens to hundreds of biological samples, has led to a plethora of different methods. The emerging consensus in the field [1] suggests that a) despite ongoing research on p-value adjustments [2], false discovery rates (FDR, [3]) are more practical for dealing with the multiplicity problem, and b) classical test statistics requires modification to limit the influence of unrealistically small variance estimates. Nonetheless, many competing methods for detecting DE exist, and even attempts at validation on data sets with known mRNA composition [4] cannot offer definitive guidelines. In this context, the introduction of the so-called optimal discovery procedure (ODP, [5]) constitutes a major conceptual achievement. Building on the Neyman-Pearson lemma for testing an individual hypothesis, the author shows that an extension of the likelihood ratio test statistic for multiple parallel hypotheses (or genes) is the optimal procedure for deciding whether any specific gene is in fact DE: for any fixed number of false positive results, ODP will identify the maximum number of true positives. The ODP establishes therefore a theoretical optimum for detecting DE against which any other method can be measured. Unfortunately, the optimality of ODP is a strictly theoretical result that requires, for all genes, a full parametric specification of the densities under null and alternative hypothesis. In practice, even assuming normality, the gene-wise means and variances are unknown, and they become nuisance parameters in the hypothesis testing. Consequently, the authors of [6] have suggested an estimated version EODP, which can be implemented in practice. It is, however, not clear how EODP performs compared to the theoretical optimum, or other existing methods, except under the most benign circumstances (no correlation and equal variances between genes). The main questions of this paper are therefore a) whether the optimality of ODP is retained by EODP, and b) whether we can improve on EODP's performance in practice. Previously, we have introduced a multidimensional extension of the FDR procedure (fdr2d) that combines standard error information with the classical t-statistic. We demonstrated that the fdr2d performs as well or better than the usual modified t-statistics, without requiring extra modeling or model assumptions [7]. In this paper, we show that fdr2d also outperforms EODP on simulated and real data sets. We also demonstrate how a synthesis of the EODP and fdr2d procedures can further improve the power to detect DE. The two-sample problemWe demonstrate the application of EODP and fdr2d in the common situation where we want to detect genes that are DE between two biological states. We assume n1 and n2 arrays for each group, each containing probes for m genes. For gene i, we observe a vector of expression values xi of length n1 + n2, which consists of the observations xi1 in the first group, and xi2 in the second group. We define the groupwise means and standard deviations as usual, and refer to the pooled standard deviation as Furthermore, we assume that we are dealing with a random mixture of DE and nonDE genes, with a proportion π0 of genes being nonDE. ODP statisticsThe theoretical ODP statistic assumes that for all i = 1, ... m genes, the density functions of the expression values under the null hypothesis of no DE, fi, and under the alternative hypothesis of DE, gi, are fully known in advance. For the random mixture of DE and nonDE genes outlined above, the ODP statistic for the observed expression values xi of the i-the gene can then be written as The procedure then rejects the null hypothesis for all genes i with Si ≡ S(xi) ≥ λ, i.e. all genes with large Si are declared to be DE. Using the Neyman-Pearson Lemma, it can be shown that this procedure is optimal in the sense that for any pre-specified false positive rate (which will determine λ), the ODP will have the maximum true positive rate. This optimality property can also be expressed in terms of FDR [5]. Requiring full specification of all null and alternative distributions, however, is impractical. In any realistic application, only an estimated ODP statistic is feasible, where the densities where φ(·|μ, σ2) is the joint-density for the normal distribution with mean μ and variance σ2. Conceptually, under the null hypothesis, we have the usual estimates The second step in applying the ODP to data is the calibration of the procedure. There is no distribution theory for the statistic, so it is not clear how to choose the threshold λ to achieve a desired FDR level. [6] suggest a conventional algorithm that computes the estimated ODP statistic Multidimensional local false discovery rateFDR approaches focus on the distribution of the specific statistic Z used to test the gene-wise null hypotheses, in contrast to ODP, which is based on the distribution of the data. Given a mixture of DE and nonDE genes as described above, the density f of Z can be written as f(z) = π0f0(z) + (1 - π0)f1(z), (2) where f0 and f1 are the densities of the test statistic Z for nonDE and DE genes, respectively, and π0 the proportion of truly nonDE genes. The local fdr for any observed value z of the test statistic is then and can be interpreted as the expected rate of false positives among genes with test statistic z, see [9]. Practically, the densities f can be estimated from the histograms of the test statistics computed from the real data, and f0 is estimated similarly from the test statistics computed from permuted data. Formulated as a decision procedure like ODP, we specify a test statistic Z and a desired threshold α for the local fdr; we then compute for each gene the value of the test statistic zi = Z(xi) and the decision criterion fdri = fdr(zi) and declare genes with fdri <α to be DE. As the more usual global FDR of a set of test statistics is just the average of their local fdr [9], little seems to be gained by using the local fdr. Note, however, that Equations (2) and (3) still hold if we replace the univariate test statistic Z by a vector Z of test statistics. We have recently shown that for the two-sample problem, using a bivariate test statistic and the associated two-dimensional fdr is more powerful than conventional FDR for univariate test statistics [7]. Specifically, the test statistic Z = (Z1, Z2) with Z1 = t and Z2 = log se, (4) where t is the usual t statistic, and se the standard error of the mean, yields smaller fdr not only compared to the conventional t-statistic on its own, but also compared to a number of popular modified t-statistics [10-12]. In the following, we will use the abbreviations fdr1d and fdr2d for local fdr computed based on univariate and bivariate test statistics, respectively. Note that in practice, the fdr2d is estimated in a similar manner as the fdr1d, using two-dimensional histograms instead of one-dimensional histograms, together with a somewhat more sophisticated binomial smoothing procedure, see [7] for details. Procedures to be evaluatedThe central aim of this paper is to compare the operating characteristics of four different procedures for detecting DE on a number of real and simulated data sets: 1. t1d uses the standard t-statistic with conventional fdr1d and serves as a reference. 2. S1d uses the logarithm of 3. t2d uses the test statistic in (4) for calculating fdr2d; this is the same procedure as described in [7]. 4. S2d is a novel procedure that combines the logarithm of ResultsFeasibility of S2dWe first evaluate the S2d procedure, based on the bivariate test statistic with Figures 1(a) and 1(b) show the scatter plot of the bivariate test statistics for two real data sets described in Methods, with the estimated fdr2d overlayed as isolines. We exploit the useful fact that we can always average the fdr2d over one of the component statistics to get the fdr1d for the other component statistic:
see [7]. Figures 1(c) and 1(d) show S1d (black) overlayed with the averaged S2d (red) for both data sets, with excellent agreement. This indicates that the smoothing required for computing S2d has been successful. This is consistent with the relationship between t-statistics and log Performance on simulated data setsWe perform simulations with 10,000 genes per array, a proportion of truly nonDE genes π0 = 0.8, and two independent groups with n = 7 arrays per group. We combine three different levels of variance heterogeneity between genes with two different settings for the balance between up- and down-regulation, for a total of six different simulation scenarios: 1. Variances can be 'similar' (effectively the same) across genes, 'balanced', which allows for moderate differences in variance between genes, and 'variable', which allows large differences. 2. In the 'symmetric' case, roughly 50% of the DE genes are up- and down- regulated; in the 'asymmetric' case, only about 20% of all genes are down-regulated, the rest is up-regulated. We have included the asymmetric scenario, because this is where ODP is expected to perform better than standard methods in a theoretical setting [5]. All expression values are assumed to follow a normal distribution; see Methods for further details of the simulation procedure. For each scenario, we generate 100 data sets, for a total of 106 genes. For each procedure, the fdr values are computed by keeping track of the DE status of each gene, grouping the genes in intervals (1d) or grid cells (2d) based on their test statistic, and computing the percentage of false positives in each interval or cell. In order to compare different fdr procedures, we summarize their results via operating characteristics (OC) curves: for each procedure, we sort the groups of genes as described above by their local fdr, and compute the corresponding global FDR as cumulative mean of the local fdrs from the smallest to the largest. This global FDR is then plotted against the cumulative percentage of genes in these intervals or grid cells. The resulting curve shows the true global FDR for a set of top-ranked genes as a function of the size of that set (as a percentage of the number of genes under study). The results for the different simulation scenarios and all four procedures are shown in Figure 2.
There is little or no difference in relative performance between the procedures under the symmetric and asymmetric scenarios in Figure 2. It is also clear that the differences in performance are most pronounced when the variances are similar, less so when the variances are balanced, and minor when the variances are highly variable. The ranking of the different procedures is consistent through all six scenarios: as expected, t1d has the worst performance; equally as expected, S1d does clearly better than t1d. Novel findings of this paper are that a) t2d does still better than S1d, and b) S2d improves over t2d, although only marginally. Performance on real datasetsWe evaluate the performance of the different procedures on two real data sets: • The BRCA data [13] contains 3,170 genes and was collected from 15 patients with hereditary breast cancer, who had mutations either of the BRCA1(n = 7) or the BRCA2 gene (n = 8). • The Lymphoma data [14] contains 7,399 genes and was collected from 240 patients with diffuse large B-cell lymphoma, comprising n1 = 102 survivors and n2 = 138 non-survivors. Here, the local fdr estimates are computed based on the mixture model (2). The estimate of f is computed by smoothing the histograms of the observed statistics, and similarly f0 from permuted test statistics. The permuted statistics are obtained from permutations of the group labels to generate the null distribution. Technically, we also need an estimate For each procedure, we rank the genes by their estimated fdr, and compute their estimated global FDR among the top-ranked genes as the cumulative mean of their local fdrs. The global FDR is then plotted as a function of the percentage of genes declared DE. For comparison purposes, we also include the FDR as computed by the EDGE software. The resulting OC curves are shown in Figure 3. We get the same ranking as for the simulated data: t1d performs worst and is easily bettered by S1d; t2d performs better than S1d for the 2% most highly regulated genes, and is equivalent otherwise; S2d has a slight advantage over t2d on the BRCA data. Additionally, as a check that our implementation of ODP is correct, we are happy to see that EDGE and S1d yield virtually identical FDR curves.
We [7] have previously compared t2d with other procedures such as SAM [11], Efron's modified t [10], and an empirical Bayes modification of the t-statistic [12]. To add more comparisons, we have run two procedures by Pounds and Cheng (Splosh [15] and robust FDR [16]) for the two real data sets. We use their own software, with a little modification so that we can specify the
DiscussionThe main motivation for using the FDR has been that it offers a way of dealing with multiplicity that is less restrictive and more powerful than traditional p-value adjustments. The challenge is how to explicitly exploit the multiplicity by pooling information across genes in order to make the FDR even more powerful. In the case of t1d, the test statistic is computed gene-by-gene and does not use information shared with other genes. Moderated t-statistics [10-12], which borrow strength across genes for estimating standard errors, are more powerful than simple t-statistics. The ODP appears to be the ultimate in combining information, where to some extent all genes contribute to the statistic for each other gene. The fdr2d approach on the other hand augments the grouping of genes based on individual test statistics by sub-grouping them based on their variability. In all cases we find that when there are few instances of genes with similar variability, the performance of the different methods tends to converge towards the simple t1d (Figures 2(e) and 2(f)). From a practical point of view, it seems that the smoothing procedure underlying our implementation of fdr2d seems to work as well for the statistic log At first glance, the empirical ODP statistic seems to rely on the assumption that the expression values for all genes are normally distributed. From a practical point of view, however, the empirical ODP procedure works even if the normal assumption does not hold, because it relies on the permutation algorithm. In this sense, the normal densities in (1) only represent a scoring function that exponentially downweights contributions from genes with different mean structure and/or large variability. However, the performance of the empirical ODP will depend on how precisely the normal assumption holds for the data at hand. Some loss of the optimality property in the real data applications is probably due to non-normality. But even in the simulations, the empirical ODP is not better than t2d. This can only mean that the presence of large number of nuisance parameters degrades the performance of ODP. ConclusionThe estimation of the nuisance parameters required to apply the ODP in practice makes the procedure described in [6] no longer optimal. We have shown in this paper that the combination of a conventional t-statistic with the standard error of the mean as described in [7] can outperform the empirical ODP. Further improvements can be made by combining the ODP test statistic with standard error information, but the gains are comparatively small. The ODP procedure exploits similarities in the distribution for a collection of genes, for example similarity in variance. When variances between genes are dissimilar, there is little gain by the ODP compared to the standard t-statistic. One advantage of the ODP over the modified t-statistics is that the adaption is done automatically, without calculating a model-based or heuristic fudge factor for the denominator. The computational demand of calculating the ODP statistic is a serious practical disadvantage: each density term f(x) or g(x) requires computation across the whole dataset, so a single ODP statistic already involves substantial computations. Doing this for the whole collection of genes and for repeated permutations of the group labels is an order of magnitude more laborious than the computation required for the standard statistics. MethodsSimulation scenariosOur model for simulating microarray data is based on the model described in [12]. We assume that the expression values for all m genes are normally distributed (possibly after suitable transformation), and that their variances In detail we proceed as follows for our simulations: 1. Initialize the design with m = 10,000 genes, proportion of nonDE genes π0 = 0.8, and two groups with n1 = n2 = 7. 2. For each gene i = 1, ... m, draw a gene-specific variance from where 3. For each gene i = 1,... m, determine randomly with probability π0 whether it is to be DE or not. (a) In case of nonDE, set μ1 = μ2 = 0. (b) In case of DE, set μ1 = 0 and draw μ2 randomly from where is v0 is another tuning parameter. i. In case of an asymmetric scenario, set the sign of μ2 to positive with probability 0.8, and to negative otherwise. 4. Simulate n1 and n2 values in the first and second group, respectively, following normal distributions Following [12], we set the constants to For each scenario, we then generate 100 data sets, for a total of 106 genes. For each procedure, the true local fdr of the genes is estimated from the known DE status of each simulated gene, simply as the proportion of false positives in each histogram interval or grid cell. This means specifically that no permutation, smoothing, or estimation of π0 is required. Real data setsThe permutation and smoothing approach used for estimating the fdr values for real data has been described in detail in [9] and [7]. The estimates The BRCA data set [13] was collected from patients with hereditary breast cancer who had mutations either of the BRCA1(n = 7) or the BRCA2 gene (n = 8). Expression was originally reported for 3,226 genes, but following [8], we removed 56 extremely variable genes and analysed only the remaining 3,170 genes. For all four procedures, we used The Lymphoma data set [14] was collected from 240 patients with diffuse large B-cell lymphoma, n1 = 102 of whom survived the study period, and n2 = 138 of whom did not. We used all 7,399 genes reported in the original article. For all four procedures, we used All expression values were logged prior to analysis. SoftwareMethods t1d and t2d are implemented in the R package OCplus, which is freely available at the Bioconductor website [18]. R code implementing S1d and S2d is available from the authors on request. EDGE, the official implementation of EODP described in [19], is available at [20]. Competing interestsThe author(s) declare that they have no competing interests. Authors' contributionsEP wrote computer programs, ran simulations and drafted the manuscript. AP wrote computer programs, ran data analysis and co-wrote the manuscript. SC co-wrote the manuscript. YP conceived the study and drafted the manuscript. All authors read and approved the final manuscript. AcknowledgementsThis work was partially supported by a research grant from the Swedish Cancer Foundation. References
Have something to say? Post a comment on this article! |



on Google Scholar







author email
corresponding author email













Figure 1.
Figure 2.
Figure 3.
Figure 4.


