Abstract
Background
Analysis of microarray and other highthroughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear.
Results
We conduct an extensive survey of statistical approaches for gene set analysis and identify a common modular structure underlying most published methods. Based on this finding we propose a general framework for detecting gene set enrichment. This framework provides a metatheory of gene set analysis that not only helps to gain a better understanding of the relative merits of each embedded approach but also facilitates a principled comparison and offers insights into the relative interplay of the methods.
Conclusion
We use this framework to conduct a computer simulation comparing 261 different variants of gene set enrichment procedures and to analyze two experimental data sets. Based on the results we offer recommendations for best practices regarding the choice of effective procedures for gene set enrichment analysis.
Background
The analysis of "enrichment" of gene sets is a natural extension of the study of differential expression of individual genes. Focusing on sets of genes rather than on individual genes has several benefits. From a statistical point of view the analysis of groups instead of individual genes is advantageous as this typically increases power and reduces the dimensionality of the underlying statistical problem. From the biological perspective gene set enrichment analysis allows one to ask (and answer!) questions that are of direct interest to the understanding of the functional mechanism in a cell: is a certain pathway activated in a given tissue under some treatment X? Is the pathway more active than other pathways? These questions directly relate to various null models for gene sets.
Therefore, for good reason a substantial number of statistical procedures to assess gene set enrichment have been introduced in the last few years – see Table 1 and references [136] for an overview. Given this extensive literature, biologists are now confronted with the difficult choice of a gene set method that is best suited to analyze their data at hand. So far, a "taxonomy" of enrichment analysis approaches and a systematic comparison is lacking.
Table 1. Overview over statistical algorithms for the analysis of gene set enrichment.
The objective of the present work is threefold. First, we present a metatheory of gene set enrichment analysis in the form of a general modular framework (Fig. 1). This modular scheme encompasses most if not all published approaches for gene set analysis, and hence allows to study the interplay among the various methods. Second, based on this framework we conduct a principled comparison investigating 261 variants of gene set enrichment procedures. From extensive computer simulations and the analysis of two experimental data sets, we offer specific recommendations for conducting an effective gene set analysis. Third, we present an extensive survey of existing statistical methods for detecting enriched gene sets. This is found in the Appendix and forms the basis of the proposed modular framework.
Figure 1. Schematic overview of the modular structure underlying procedures for gene set enrichment analysis.
Methods
In this section we describe the key result from our studies: a modular framework that provides a taxonomy for gene set analysis. In the Appendix we have compiled a very comprehensive overview of the respective approaches that underlie our proposed metatheory of gene set enrichment.
Modular structure underlying procedures for gene set enrichment analysis
Our suggestion for a common modular framework for gene set enrichment analysis methods is depicted in Fig. 1. The scheme consists of five distinct modules: the calculation of a genelevel statistic, an optional transformation of these statistics, the choice of a null hypothesis, the computation of a gene set statistic and the significance assessment. The first two modules correspond to an analysis of differential expression on the individual gene level. The remaining three modules process these results further to obtain an assessment of gene set enrichment. A separate track in this diagram is provided for a module for conducting global tests.
We will now describe each module in more detail. Note that the scheme presented in Fig. 1 is both descriptive as well as generative: by selecting a method in each box one arrives at some specific approach for detecting gene set enrichment. To our knowledge, all enrichment approaches published so far fit into this scheme. In addition, there are also many combinations of modules that will lead to a "novel" method (i.e. one that has not yet been explicitly described in the literature). However, note that some modules are tightly interconnected, e.g., the choice of null hypothesis and of significance assessment. These dependencies need to be kept in mind when setting up a gene set analysis.
Description of individual modules
Genelevel statistics
In all the univariate approaches (cf. Appendix), the first step in conducting a gene set enrichment analysis is to assess the amount of differential expression of the individual genes. Choosing a suitable test statistic for this purpose has been discussed extensively in the literature, for an overview we refer, e.g., to [37] and references therein.
Typically, the genelevel statistic will be selected from the following list:
• fold change,
• signaltonoise ratio,
• (regularized) tstatistics,
• (shrinkage) correlation coefficient,
• coefficient of ANOVA or linear/logistic regression, and
• (penalized) loglikelihood ratio.
Due to the small sample size found in genomic data in most instances the regularized versions of these test statistics (via penalization of likelihood, suitable Bayes priors, Stein shrinkage, etc.) are preferable.
Transformation of genelevel statistics
The second step of univariate procedures for detecting gene set enrichment consists of the transformation of the local gene statistic. Suitable options suggested in the literature include:
• no transformation,
• absolute values,
• squared values,
• binary transformation,
• ranks,
• pvalues or Bayesian posterior probabilities, and
• (local) false discovery rates.
There are many instances when a transformation is beneficial, for example in order to improve robustness, or to account for both up and downregulation. The binary transformation is an extreme case of ranking considering only two classes (e.g., "differentially expressed" versus "nondifferentially expressed"). Many of the adhoc procedures used by biologists to declare differential expression (such as based on fold change and some kind of arbitrary rules) fall into this category. Also note that binary transformation combined with testing gene set membership is equivalent to the contingency table approach (cf. Appendix).
The influence of the transformation of the genelevel statistics on the results of gene set enrichment procedures is shown by Newton et al. [20]. These authors compare the power properties of averaging and selection test statistics, i.e. methods based on the mean of local statistics like the tstatistic and methods where the mean over the binary indicator of differential expression is used. Newton et al. [20] find that selection methods work better when the gene set enrichment is characterized through a small number of highly differentially expressed genes, whereas averaging is able to detect enriched sets consisting of a large number of genes with subtle increase or decrease of the expression levels.
Null hypotheses
A factor that strongly influences the general setup of an enrichment procedure is the choice of the null hypothesis. The two most common choices are termed Q_{1 }and Q_{2 }[14].
Q_{1 }is called the "competitive null hypothesis" and corresponds to the case where the association between the genes in the set and the phenotype is compared with the association of the remainder of the genes and the phenotype. Note that this implies a model where the genes are the sampling units and the association between the samples and the phenotypes is fixed. Thus, testing this hypothesis is conditional on the genelevel scores. The "complete null hypothesis" [16] that all the gene sets under consideration have the same differential expression can be seen as a special case of Q_{1}.
In contrast, Q_{2 }is the "selfcontained null hypothesis" that focuses on the given gene set without considering genes outside of this group. It compares the association of the gene set and the phenotype with that of random phenotypes. Under Q_{2 }the sampling units are the phenotypes while the gene set membership is fixed. A special case of the selfcontained null hypothesis is the "global hypothesis" that there are no differentially expressed genes at all.
A third possible null model, here called Q_{3}, is the "nested null hypothesis", where differential expression of the genes in the gene set is compared to the differential expression of all genes under consideration (both inside and outside the gene set). This model is motivated by the FDRbased approach described in Efron [21], where FDR and conditional FDR values are compared.
For understanding gene set enrichment procedures it is absolutely essential to understand the implications of the various null models [5]. For instance, if there are many differentially expressed genes in all gene sets, testing under the selfcontained model Q_{2 }is quite likely to indicate enrichment, even though the particular gene set under investigation might not show any special "enrichment" compared to other gene sets.
Furthermore, as Goeman and Bühlmann [5] point out, the aim of a gene expression experiment is to assess whether the transcript levels of the genes differ between the different phenotypes. Thus, it is based on a model where the phenotypes are the sampling units. Consequently, one would repeat the experiment with different samples and not with different genes to verify the results. In contrast, the competitive hypothesis Q_{1 }assumes that the genes are the sampling units, turning the experimental design upside down. Therefore, some authors [38,33] consider Q_{1 }only to be "of limited utility" for statistical analysis.
Gene set statistics
A further step in an enrichment analysis is the computation of a gene set statistic. Examples include:
• the sum or the mean of the (transformed) single gene statistics,
• the median of the (transformed) single gene statistics,
• the (modified) KolmogorovSmirnov statistic,
• the maxmean statistic,
• the Wilcoxon rank sum test statistic,
• the sign test statistic, and
• the conditional local FDR.
The question of which gene set statistic is optimal in a given setting is subject to ongoing discussion. For example, Efron and Tibshirani [11] show that their maxmean statistic is more powerful than a KolmogorovSmirnov test or an average of local statistics. Jiang and Gentleman [16] call attention to the problem of robustness against outliers and advocate summaries such as the median or the sign test statistic.
Significance assessment
The last step of both the univariate and the global and multivariate procedures (cf. Appendix) for enrichment analysis is the assessment of significance of the observed gene set statistics. The calculation of the pvalue can be done in three different ways:
1. gene sampling: A large number of random gene sets of the same size as the set under investigation is drawn from all the genes and the global statistic is recomputed for every random set. The pvalue is calculated as the fraction of resampled gene set statistics that exceed (or fall below) the observed value.
2. sample label permutation: The phenotypes of the subjects are permuted a large number of times and the local and global statistics are recomputed. The pvalue is the fraction of permutation gene set statistics that exceed (or fall below) the observed value.
3. restandardization: Both gene sampling and sample label permutation are conducted. Then each permutation statistic is standardized with the mean and standard deviation of all permutation statistics. To obtain the restandardized value, the standardized permutation statistic is multiplied with the standard deviation of the gene sampling statistics and the mean of the gene sampling statistics is added. The pvalue is the fraction of restandardized gene set statistics that exceed (or fall below) the observed value.
Note that the choice of the null hypothesis and the pursued sampling strategy are interconnected: Q_{1 }implies gene sampling, and Q_{2 }sample label permutation. Furthermore, gene resampling implicitly assumes independent genes in the group, a prerequisite that is unlikely to hold, e.g., for genes within a pathway. The restandardization strategy put forward in [11] combines the two very different sampling schemes into a single procedure. Therefore, this approach mixes the two null hypotheses Q_{1 }and Q_{2 }which may lead to difficulties in the interpretation of the resulting pvalues. Interestingly, note that the GSEA algorithm [7,8] implicitly also considers both permutation approaches. While the GSEA null distribution itself is found by permuting sample labels, the KolmorogovSmirnovtype test statistic equally carries information on the whole set of genes (at least if the set S is small compared to the number of all genes). Thus, GSEA implements something in effect related to restandardization, albeit not in the extent of the method by [11]. Finally, a computational consideration is whether the chosen null distribution may be approximated by a parametric distribution. This is fundamentally tied to the choices made earlier, and is possible only in very rare circumstances, e.g., in the method of Kim and Volsky [18].
Global and multivariate approaches
Enrichment approaches that do not follow the structure introduced above are the global and multivariate procedures. These directly define one model for the whole gene set. Nevertheless, the globaltest of [22] and the GlobalAncova of [23] also allow to obtain a score for every gene as well, since the gene set models can be written as a combination of the models for each gene. For the T^{2}based tests this is only possible when the covariance matrix is diagonal.
In a simulation study, [23] showed that the globaltest and their GlobalAncova perform equally well for independent genes but that the GlobalAncova outperforms the approach of [22] when there are strong genegene dependencies. Kong et al. [25] show that their combination of PCA and Hotelling's T^{2}test detects more gene sets than the globaltest. They explain this finding with the multivariate nature of their test. However, the assumption is only made on the basis of the evaluation of one data set and further analyzes are needed to show the superiority of this procedure.
Results and discussion
Employing the developed taxonomy and modular framework we now compare the various gene set enrichment approaches by a principled simulation study and exemplary data analysis.
Simulation study
General setup
All necessary programming and calculations were implemented using the free software environment R [39]. We simulated data in a fashion to represent stylized yet typical constellations of gene set structures found in real gene expression data. We considered gene sets with different levels of differential expression (Δμ = 0, 0.75, 1, 1) and with varying levels of intragroup correlation (ρ = 0, 0.6, 0.6). Furthermore, we also constructed mixed sets, i.e. gene sets that included both differentially expressed and nondifferentially expressed genes, and also sets containing both up and downregulated genes.
Each generated data set consisted of p = 600 genes with n = 20 samples (10 in each of two "treatment" groups). The data were generated from a 600dimensional multivariate normal distribution, with variances set to 1 and means and correlations specified as follows:
• background (noninformative genes): 520 genes were simulated with μ = 0 and ρ = 0.
• set 1 (differential expression + correlation): 20 genes with Δμ = 0.75 and pairwise correlation ρ = 0.6 between the genes in the set.
• set 2 (differential expression + no correlation): same as set 1 but with zero correlation among genes.
• set 3 (no differential expression, no correlation): 20 genes selected randomly from the 520 background genes.
• set 4 (differential expression + correlation, mixed with background): 10 genes with Δμ = 0.75 and ρ = 0.6 from set 1, the other 10 randomly selected from the background set.
• set 5 (differential expression + no correlation, mixed with background): same as set 4, but with zero correlation among genes (i.e. 10 genes from set 2 and 10 genes from the background set).
• set 6: (differential expression, correlation, both up and down regulation): 10 genes with Δμ = 1 and 10 genes with Δμ = 1, the pairwise correlation between the genes with the same direction of differential expression is ρ = 0.6, between the up and downregulated genes it is ρ = 0.6.
• set 7 (differential expression, no correlation, both up and down regulation): same as set 6, but without any correlation among genes.
• set 8 (differential expression, correlation, both up and down regulation, mixed with background): 10 genes from set 6 (5 upregulated and 5 downregulated), mixed with 10 genes from the background set.
• set 9 (differential expression, no correlation, mixed with background, both up and down regulation): same as set 8, but without any correlation among genes (i.e. 10 genes from set 7 and 10 genes from background set).
Thus, the 600 simulated genes are constituted by the background set, set 1, set 2, set 6 and set 7. Any well working procedure for gene set enrichment analysis should be able to detect at least the pure sets 1, 2, 6 and 7, but ideally also the sets 4, 5, 8, 9 where only half of the genes are differentially expressed. Gene set 3 serves as a negative control.
Using this data we conducted gene set enrichment analysis by considering combinations of the following module methods:
• genelevel statistics: twosample tstatistic, moderated tstatistic, Pearson correlation coefficient.
• transformation of genelevel statistics: none, squared values, ranks, binary transformations, local FDR.
• gene set statistics: mean, median, maxmean statistic, ES (weighted by absolute value of gene level statistics), conditional FDR (linear logistic regression of probability of gene set membership depending on the value of the genelevel statistic), Wilcoxon rank sum test.
• significance assessment: resampling, permutation, restandardization (1000 iterations each); parametric (only in combination with mean and Wilcoxon rank sum test).
The gene set statistic ES was not combined with a binary transformation since the latter does not allow a sensible ranking of the genes. Hence, in total 3 × 5 × 6 × 3 – 9 = 261 variants of gene set analysis were considered. For each combination of modules we simulated 100 data sets, and determined the false positive rate for set 3 and the true positive rates for the remaining sets.
Genelevel statistics and their transformations
First, we studied the effect of the choice of a specific gene level statistic and the corresponding transformation. Specifically, we varied the gene level statistic and the transformation and fixed the gene set statistic to the mean and computed pvalues by resampling. The results are summarized in Table 2.
Table 2. The effect of choice of gene level statistic and of a corresponding transformation on the detection rate.
In general, the choice of the genelevel statistic does not seem to have a great impact on the results of the enrichment analysis. In our simulation setting we obtained roughly the same results, regardless whether the conventional or the moderated tstatistic or the correlation was used. We note that the small differences might be explained with the rather large number of samples per group (10) which is already sufficient to properly estimate the variance of each gene without using a regularization approach. For smaller sample sizes, the merits of regularization or borrowing information across genes is likely to be more pronounced. In contrast, we find that the choice of a transformation has quite a substantial effect on the detection rates. It is striking that all three investigated gene level statistics fail to find sets containing both up and downregulated genes. In this situation applying an appropriate transformation instead of using the original scores is indispensable. In this regard, a very useful transformation is the calculation of squared values. Although in Table 2 it can be seen that this results in slightly lower true positive rates for sets 1, 2, 4 and 5 compared to untransformed genelevel scores, it enables the detection of the sets 6, 7, 8, and 9, which are not recognized using the untransformed gene level statistics. Interestingly, if the squared transformation is combined with the rank transformation, this further improves the true positive rates and offers the best overall performance of all transformations under study (see Table 2). Furthermore, the (rank) quadratic transformations outperform the binary and the FDR transformations, which are two approaches that also detect sets containing both up and downregulated genes.
Gene set statistics
Next, we considered in our simulations the impact of the choice of the gene set statistic. To facilitate comparison, we used as underlying gene level statistic the moderated tstatistic and employed a quadratic transformation. For significance assessment we applied the resampling. Table 3 shows the results.
Table 3. Impact of choice of gene set statistics on detecting gene set enrichment.
Overall, the gene set statistics all behave as expected for the nine sets. They assign high pvalues to the uninteresting set 3 and relatively low pvalues to the other sets. All approaches have some difficulty to reliably detect gene sets 4 and 5, where there is relatively weak differential expression and half of the genes belong to the uninformative background.
Performancewise, the test statistics may be grouped as follows. The mean and the maxmean statistic produce rather similar and overall very good results. The median and the Wilcoxon rank sum test, which are coarsened and more robust location estimators for the change in transcript level, perform markedly better than the mean or maxmean for the correlated sets. However, if phenotype permutation instead of resampling is used, this effect vanishes and the performance of the median and the Wilcoxon test deteriorates (data not shown, cf. [40] for more information). Thus, the usage of the median and the Wilcoxon test is primarily advantageous if the competitive null hypothesis is tested, or if there are many outliers in the data. The conditional FDR procedure compares well to the mean and maxmean procedures in the simulations. However, we find that the results of this approach vary strongly with the choice of the genelevel statistic, transformation and permutation approach. Since it is not markedly better than the other approaches for any gene set and it is computationally expensive, it might not be the best overall choice for an enrichment analysis. Perhaps surprising is the comparatively weak performance of the popular ES score, which yields results that are worse than that of the sum or the mean.
Significance assessment
In a further simulation experiment, we compared four approaches for obtaining significance values (see Table 4). In this study we used moderated t as the gene level statistic, employed a quadratic transformation, and used the mean as the gene set statistic.
Table 4. Comparison of methods for assigning significance.
First, the parametric approximation of the null distribution yields the best results. However, the assumption of independence of the individual genes is clearly violated for sets 1, 4, 6 and 8. Thus, the corresponding pvalues are overoptimistic.
The resampling procedure that investigates the competitive null hypothesis Q_{1 }also performs quite well. Sets 3 and 7 are correctly categorized for almost all the 100 repetitions of the simulation. For sets 2 and 9 this is achieved in more than 90% of the cases. The correlated sets 1, 6 and 8 achieve lower true positive rates. Resampling performs worst for the difficult sets 4 and 5, where only half of the genes are differentially expressed.
Testing the selfcontained null hypothesis Q_{2 }using sample label permutation appears to be slightly easier than testing Q_{1 }(see third column of Table 4), especially for the uncorrelated gene sets. Note that permutation preserves the correlation structure among genes. Since the effective gene set size is smaller in a correlated set compared to a set where the genes are independent, it is more difficult to detect enrichment in these situations. For this reason it is surprising that in most of the cases the permutation works still better for the correlated sets than resampling.
The restandardization procedure performs very similar to resampling. This was the case in almost all of the combinations of enrichment methods that were investigated in this work, so the benefits of restandardization are likely to be small in general. However, we note that [11] showed in their paper that there might be some special situations where restandardization is advantageous.
Comparison with global and multivariate approaches
In order to compare the modular univariate approaches with the global and multivariate procedures, we additionally analyzed the simulated data using the globaltest and Hotelling's T^{2}test with a shrinkage covariance matrix [41].
The results for the globaltest under different significance assessment approaches are summarized in Table 5. Clearly, the performance of this procedure is not better than that of the less sophisticated univariate methods, especially the sum or mean of the squared genelevel statistics. Indeed, the results are very close to these two approaches. Hence, using the globaltest does not improve the results, but it is computationally a little bit faster, especially when the parametric approximation of the null distribution is used. Since it leads to comparable results, it might be a useful alternative when a large number of sets is to be investigated.
Table 5. Performance of the globaltest.
The corresponding results for the Hotelling T^{2}test can be found in Table 6. Using this approach the uncorrelated sets are found with the same reliability as with univariate approaches. However, in stark contrast, the sets with correlation (sets 1, 4, 6, and 8) are hardly detected. The reason is clear from a statistical perspective: the Hotelling test statistic penalizes correlation (i.e. the test statistic is largest for zero correlation). Specifically, it is much more likely that genes in a set are all simultaneously differentially expressed if there is correlation. While the Hotelling statistic corrects for this bias accordingly, this implicit penalization is clearly not desired from a biological point of view. A clear advantage of the Hotelling T^{2}test is its improved performance with sample label permutation as opposed to gene sampling. Since the approach does by definition test the selfcontained null hypothesis, this is a positive result. Nevertheless, the overall very poor performance of this multivariate approach raises the question whether taking into account the dependence structure between the genes is at all useful in the context of enrichment analysis.
Table 6. Performance of the Hotelling approach using a shrinkage correlation matrix.
Analysis of p53 cancer data
In order to verify our conclusions from the simulations we reanalyzed the p53 cancer data from Subramanian et al. [8]. The protein p53 is a tumor suppressor preventing the development of cancer cells. It regulates genes involved in the cell cycle and the induction of apoptosis after DNA damage. Gene set enrichment analysis allows to study which pathways are involved in these cellular mechanisms.
The p53 data is used as a benchmark data set in a number of papers [e.g., [19,11]] and contains expression profiles of 50 cancer cell lines that can be found in the data base of the International Agency for Research in Cancer (IARC). The gene expressions were measured with the Affymetrix HGU95Av2 chip which contains 12, 625 probe sets. For each cell line, the mutational status of the p53 protein was reported. 17 cell lines were found to have normal p53 status while the remaining 33 samples showed a mutation of p53. The analyzes in [8,19] and [11] recovered several gene sets that are differentially expressed between the mutation and wildtype cell lines. Five functional groups were mentioned in all three publications:
• the p53 pathway,
• the downstream targets of p53,
• genes induced by radiation,
• genes induced by hypoxia, and
• the heatshock protein signaling pathway.
Dinu and coworkers presented 31 additional gene sets found with the SAMGS method [19].
In our preprocessing of the data we filtered out genes that showed maximum signal intensities of less than 5 on log_{2 }scale and a variation of less than 2fold over all samples to reduce the noise in the data. The resulting data set contained 8,768 genes. From the biologically chosen gene sets of [8] we focused on sets with sizes between 15 and 500 genes. This left 290 gene sets of varying size, each containing between 15 and 321 genes.
In the following we report only our main findings. A much extended analysis of the p53 data set including detailed tables of significant gene sets and pathways can be found in [40].
Genelevel scores and transformations
As in the simulation study, in order to compare the various gene level statistics and their transformations we employed as gene set statistic the mean and use resampling for computing pvalues.
Concerning the choice of gene statistics, there was little difference in the overall ranking of the gene sets. The 25 top ranked gene sets are nearly the same for all three statistics in combination with a specific transformation. The significant findings are in broad accordance with the results published previously [8,19,11]. Specifically, we also recover the additional sets reported by Dinu et al. [19] if a squared gene transformation is used (recall that their SAMGS approach is based on a squared regularized tstatistic). For this data set we noted a distinct difference when positive and negative scores were analyzed separately, instead of combined. In particular, taking squared values of the individual gene scores appears to give more weight to gene sets that are not high scored in a separate analysis for up and downregulated genes. This is probably due to the fact that summing up the squared values leads to the detection of gene sets with many subtle expression changes in both directions instead of sets where some constituents are either up or downregulated.
When ranks or the binary transformation are used, fewer of the Dinu and Subramanian gene sets are detected. Furthermore, the number of significant gene sets is decreased compared to using the original or squared scores. Thus, for this data the results based on the rank and binary transformations appear to suffer substantially due to the loss of information. In contrast, the transformation to local FDR values (empirical posterior probabilities) performs much better than in the simulation study.
Gene set statistics and significance assessment
Next, we investigated the impact of the gene set statistic and the choice of the significance procedure. In this analysis we used the squared moderated tstatistic as the gene level statistic.
The results are summarized in Fig. 2. For this data, both phenotype permutation as well as resampling lead to the same number of significant gene sets regardless of the pvalue cutoff, the only exception being the Wilcoxon statistic. The Hotelling statistic and the conditional FDR approach consistently found the largest number of significant gene sets for any cutoff, followed by the global test and the mean and maxmean. The ES, median, and Wilcoxon statistics typically resulted in the least number of gene sets. Note, however, that we do not know the ground truth for the p53 data. Therefore, one needs to keep in mind that the larger power of the mean or the global test compared, e.g., to the ES score might be due to a higher specificity of the latter. This is in concordance with our simulation study (cf. Table 3) where ES and the median were the only two statistics that never declared the negative control (set 3) significant. The outcome of the Hotelling T^{2}test is in very good accordance with the findings of [19]. Thus, in contrast to the simulated data the Hotelling approach gives meaningful results for real data. We investigated this discrepancy more closely by inspecting the correlations among the genes in a gene set. As can be seen from Fig. 3 the correlations within a gene set are comparatively weak. This explains the good result of the Hotelling approach, which for vanishing correlation essentially reduces to the SAMGS [19].
Figure 2. Analysis of p53 data set. Left: Number of significant gene sets in dependence of pvalue cutoff and choice of gene set statistic. On the gene level, the squared moderated tstatistic was employed. Right: Bar plot for pvalue cutoff 0.01.
Figure 3. Distribution of correlation across the 290 gene sets investigated for the p53 data. Top: histogram of averaged pairwise correlations. Bottom: histogram of averaged absolute values.
Analysis of Hedenfalk breast cancer data
To complement the study of gene expression data we analyzed the breast cancer gene expression data from Hedenfalk et al. [42]. This wellstudied data set includes 7 cancer patients carrying the BRCA1 genetic mutation and 8 other patients carrying the BRCA2 mutation. Microarray measurements were taken on 3,226 genes.
A peculiar feature of this data set is that on the gene level there is virtually no signal to distinguish the two groups (BRCA1 versus BRCA2) – see for example [43].
In our preprocessing for gene set analysis we considered a subset of 2,200 genes (all that could be uniquely mapped from cloneIDs to a unigene symbol) and a catalog of 89 gene sets from [8] relevant for these genes.
Gene set statistics and significance assessment
The p53 and the Hedenfalk cancer data behave strikingly different, as can be seen in Fig. 4. For the Hedenfalk data employing resampling resulted in very few gene sets detected as significant, regardless of gene set statistic. In contrast, when phenotype permutation was applied a substantial number of significant gene sets were found. As for the p53 data set, the gene set statistics can be divided into two groups. The Hotelling, mean, maxmean and global statistics resulted in larger numbers of significant set, whereas median, ES, Wilcoxon and conditional FDR detected fewer gene sets. Thus, for this data the conditional FDR approach is as conservative as in the simulations.
Figure 4. Analysis of Hedenfalk data set. Left: Number of significant gene sets in dependence of pvalue cutoff and choice of gene set statistic. On the gene level, the squared moderated tstatistic was employed. Right: Bar plot for pvalue cutoff 0.01.
The difference between results for permutation and resampling may be explained as follows. If most of the active genes show only small expression changes, then the interesting sets do not differ markedly from random sets. Consequently, they cannot be found by the resampling approach. But still, in the Hedenfalk data the subtle changes of the interesting genes between the phenotypes are large enough when they are accumulated over the active gene set to be detected with sample label permutation.
Enriched gene sets
Finally, it is also instructive to inspect the list of detected gene sets. For gene set statistic mean with genelevel statistic moderated t, transformation squared values and sample label permutation all significant (pvalue cutoff 0.01) gene sets are listed in Table 7. Note the enrichment of many cancerrelated sets as a result of the gene set analysis.
It is noteworthy here that using the ES as gene set statistic does not yield the set Breast_Cancer_Estrogen_Signalling among the 20 top ranked gene sets, whereas in our Table 7 it is ranked number 1. We remark that this is a biologically important finding, in concordance with Hedenfalk et al. [42] who state that tumors with BRCA1 mutations are generally negative for estrogen receptors, whereas BRCA2 mutations are positive, implying that BRCA1 and BRCA2 genes induce the formation of breast tumors through separate pathways.
Table 7. Top scoring gene sets resulting from the analysis of the Hedenfalk data.
Overall, it appears that the Hedenfalk data is an exemplary case where merits of gene set enrichment over analyzing individual genes can be most prominently seen.
Conclusion
In this work we have investigated statistical procedures for detecting enrichment of gene sets. By conducting an extensive survey of corresponding approaches we found that all procedures in current use may be fit into a simple modular framework as depicted in Fig. 1.
The identification of this modular structure greatly facilitates the systematic comparison of the diverse methods. Based on simulations and data analysis we arrive at a number of guidelines for conducting a gene set enrichment analysis. We believe that these will contribute to the establishment of best practices for gene set enrichment analysis.
Perhaps the most surprising result from our study is that the simple univariate procedures appear to work best under a large variety of conditions. Specifically, using the mean of the squared (regularized) tstatistics and assessing the significance with one of the two permutation approaches reliably detects gene sets of diverse types of expression and correlation structures. This procedure was proposed, e.g., in [15,14,19,17]. Intriguingly, this simple approach typically outperforms (by a fairly large margin) the in biological circles very popular GSEA method [8]. Thus, we strongly disagree with the recommendation put forward in [34] which was primarily given on grounds of software availability.
With regard to the choice of the best approach for each module in Fig. 1, our conclusions can be summarized as follows:
• genelevel score: The choice of the genelevel score is not crucial. All different kinds of scores lead to very similar results. However, it is recommended to use a statistic that is standardized by the variation of the gene expression, as this makes the measure of differential expression comparable across genes. Scores with this characteristic are tstatistics, correlation coefficients or standardized regression coefficients. If the sample size is small, using one of the regularized tstatistics is the best choice since they are designed to stabilize small variances.
• transformation: The choice of transformation has a substantial impact on the overall results. In our study the (rank) squared transformation was most accurate. The application of absolute values instead of squared values gives comparable results. Note that using the squared values essentially corresponds to using a modified version of Hotelling's T^{2}test with a diagonal covariance matrix. Binary transformations often lead to satisfying results but also imply a loss of information. If the focus is put on gene sets that are concertedly regulated in one direction the untransformed scores should be used.
• gene set statistics: Overall, the simple approaches such as the mean or the median work very well. Using the median instead of the mean was proposed by [16] as an approach that is less sensitive with regard to outliers. However, using the median may also lead to a smaller number of significant findings. Similar arguments apply to the rankbased Wilcoxon test. The GSEA, or more precisely the enrichment score, is not as reliable as the other gene set test statistics. Furthermore, the construction of the GSEA is rather complicated since it mixes the competitive and the selfcontained null hypotheses.
• significance assessment: The choice of the permutation approach for significance assessment is mainly determined by the null hypothesis that is tested. Hence, it has to be decided carefully which procedure is used. If a selfcontained test is conducted, the phenotypes have to be permuted. In this way, the null distribution of the gene set statistic under the assumption of no difference between the treatment groups is obtained. If the gene set is to be compared to other sets, gene sampling is the correct approach. Then the null model is based on the assumption that the gene set membership is random. Depending on the data structure, both approaches can yield quite different results.
• multivariate procedures: Using a multivariate procedure such as the globaltest or the Hotelling approach not necessarily yields better results than the best univariate approaches. However, for the globaltest a parametric approximation of the null distribution is available, which might be beneficial if a large number of gene sets are analyzed. The Hotellingtype approach is not appropriate for the detection of gene sets containing genes that are highly correlated. From a statistical perspective this is straightforward to understand, as it is more unexpected to find a group of independent genes that are coregulated than a group of correlated genes which behave similarly.
In conclusion, we find that despite the large variety of available gene set enrichment procedures there is a common core among all procedures. In order to reliably detect enriched gene sets it is in general sufficient to use a univariate approach, in combination with either resampling or sample label permutation, depending on the desired underlying null model.
In the present study, we have focused on methods for detecting enrichment of a single prespecified gene set. A topic not investigated here is the issue of multiple testing of gene sets. In this context, standard corrections for multiplicity need to be refined in order to account for dependencies among gene sets, which may be due to the underlying gene ontology graph structure or because genes may belong to several gene sets at once [44,28].
Authors' contributions
Both authors contributed equally to the development of the modular framework, the design of the study, and to writing the paper. MA conducted all simulations and data analysis.
Appendix
In this section we provide an overview of various currently employed statistical methodologies for detecting enrichment of gene sets. We distinguish between univariate and multivariate approaches. The univariate procedures first calculate a score for differential expression for each single gene. Subsequently, a gene set summary statistic is computed and its significance is assessed by different parametric and permutation approaches. In contrast to the univariate approaches, the global and multivariate procedures directly infer the enrichment of the gene set without referring back to the genelevel. The multivariate approaches are distinguished from the global procedures by explicitly taking the correlation matrix of the genes into account when calculating the test statistic
Univariate procedures
Overrepresentation analysis
One of the simplest methods to investigate enrichment of predefined gene sets is based on testing a 2 × 2 contingency table. Correspondingly, this approach has been employed by many authors, e.g. [1,4]. In this setting, each gene is classified according to both membership in a gene set S and whether it is differentially expressed (cf. Table 8). Subsequently, one investigates overrepresentation of genes belonging to S in the group of the differentially expressed genes. This is done by testing for the independence of the two criteria "differential expression" and "gene set membership". For instance, one can use the test statistic
Table 8. Contingency table for testing gene set enrichment.
which for sufficiently large N follows a χ^{2}distribution with one degree of freedom. Alternatively, one may compute Fisher's exact test [45] for the hypergeometric distribution [2,3]. It is also possible to determine the distribution of the test statistic nonparametrically using sample label permutation [5], which has the benefit of respecting the correlation between the genes. Vêncio and Shmulevich [6] introduce an approach to account for categorization uncertainty, employing the GoodmanKruskal [46] measure of statistical association between the two margins of the contingency table.
GSEA algorithm and related methods
One of the most popular methods for detecting enrichment is the "Gene Set Enrichment Analysis" (GSEA) introduced in [7] and improved by the same group only a short time later [8].
This method starts with a list L of N genes ranked by their correlation with the phenotype of interest, e.g., using the tscore or the signaltonoise ratio. In order to test the null hypothesis that the m genes in a set S are randomly spread in the list L a socalled enrichment score (ES) is computed. This equals the maximum deviation from zero of a running sum (running down the list of genes L) that increases every time a gene in the list L is in S and decreases every time a gene is not in S (j = 1, ..., N). Additionally, every gene set member is weighted by its absolute value of correlation with the phenotype. As a result, genes with very low correlations do not contribute to the enrichment score of the set. The running sums for the genes in the set S and the complementary set (including all the genes from the list which are not in S) are calculated separately as follows:
r_{j }is a measure for the correlation of the individual gene g_{j }with the phenotype, for example a tstatistic. The ES is defined as the maximum deviation of ES_{S }(l)  from zero (l = 1, ..., N). This test statistic is similar to a KolmogorovSmirnov statistic in that it compares the deviation between two cumulative distribution functions. But contrary to the KolmogorovSmirnov statistic, the ES is signed and uses weights.
The choice of the exponent p in the above formula determines the influence of the single genes on the score. For p = 0 an unweighted statistic is obtained. When p = 1 the genes in S are weighted by their absolute value of correlation with the phenotype according to the chosen correlation measure. Other values of p can also be used (see [8], supplementary material). The authors suggest p <1 if one wishes to penalize a lack of coherence in a set or p > 1 if correlation of a small subset of S with the phenotype is sufficient to call a set enriched. Additionally, the summands are normalized by the sum of all test statistics in S. The summands for the genes in are constant.
Other authors presented similar statistics, see for example [12] who use X_{j }= N  m and X_{j }= m as the contribution of the genes in S and to the running sum respectively. All the choices result in a score that assumes the value 0 when the running sum reaches the end of the list L.
Significance of the test statistics is assessed by permuting the class labels and recalculating the ES for every gene set in every permutation. This necessitates correction for multiplicity when a larger number of sets is tested.
An approach related to GSEA but more general is "Significance Analysis of Function and Expression" (SAFE) as described by [9]. This is a twostep procedure, where in the first step a local test statistic is computed for each gene (e.g. a tscore) and in the second the statistics of the genes in the gene set S are combined to give a global statistic. This is done in order to compare the distribution of the local statistics in the gene set with that of the local statistics of the rest of the genes. However, [9] leave it to the user to decide which global statistic is to be applied. They propose rankbased statistics as the Wilcoxon rank sum test or the KolmogorovSmirnov test since they do not require any assumptions about the joint distribution of the local statistics. Significance in the SAFE algorithm is evaluated by permuting the sample labels and recalculating the local and global statistics. For taking account of multiplicity when testing for several gene sets at a time [9] suggest to control the false discovery rate and related criteria.
Yet another modification of the GSEA method is introduced in [10]. These authors employ regression coefficients of the interesting outcome (in their case the age of the patient) to rank the genes. Instead of taking a KolmogorovSmirnov statistic as the gene set statistic, [10] use a van der Waerden statistic
where Φ^{1 }is the quantile function of the standard normal distribution N (0, 1) and r_{j }is the rank of gene g_{j }in the list of all genes. Under the null hypothesis the test statistic Y is approximately normally distributed with mean zero. [10] employ a bootstrap sampling approach to estimate the variance of the null distribution.
Averaging methods
Another natural approach of combining information of a group of genes pursued by many authors is to simply average over the test statistics of the individual genes. Specifically, let t_{j}, j = 1, ..., N, be the association measure of gene g_{j }with the phenotype. Furthermore define G_{kj }as the binary indicator of whether gene g_{j }belongs to gene set S_{k }(G_{kj }= 1) or not (G_{kj }= 0). Then the test statistic is given by
where is the number of genes in set S_{k}.
A crucial point noted by [14] is that one needs to distinguish between two types of null hypotheses, which in turn necessitates different ways of assigning significance. The two hypotheses are:
1.Q_{1}: "The genes in a gene set show the same pattern of association with the phenotype compared with the rest of the genes."
2. Q_{2}: "The gene set does not contain any genes whose expression levels are associated with the phenotype of interest."
These correspond to the "competitive" and "selfcontained" hypotheses discussed in [5]. Note that the first hypothesis compares the genes in the set with the remaining genes, thus implying that in a permutation test the genes are the sampling units. Q_{2 }tests the association between the genes in the set and the phenotype regardless of the genes that are not in the set. Therefore, here the phenotype labels are the sampling units. When testing Q_{2 }[14] employ a permutation approach whereas for testing Q_{1 }they apply a normal approximation to compute pvalues.
An averaging approach based on the moderated tstatistic is described in the documentation of [15]. Further discussion regarding the averaging of genelevel statistics can be found in [20,33] and [18]. The "category" software of [17] also implements a general form of the averaging approach.
The maxmean statistic and restandardization
[11] propose several improvements of the averaging approach and GSEA. First, they introduce a new test statistic called the "maxmean" statistic which leads to more effcient tests. It is defined as
where and are the averages of the positive and negative parts of the scores in the set S containing m genes. The separation of positive and negative scores facilitates the detection of gene sets containing both up and downregulated genes. Furthermore, the form of the test statistic prevents that sets with only very few unusually large zvalues are called significant, yet it allows to detect sets with both moderately large positive and negative zvalues.
Second, [11] show the necessity not only to incorporate sample label permutation as in the GSEA method but also gene sampling into the analysis. This idea is implemented in a procedure called "restandardization" which combines gene sampling and phenotype permutation.
Let μ^{† }and σ^{† }be the mean and standard deviation of the distribution obtained by gene label permutation and μ* and σ* the corresponding parameters of the sample label permutation distribution. For a general test statistic T, for example the ES used in GSEA, the restandardization can be written as
It corrects the phenotype permutation statistics for the estimated distribution of the individual genes in the data set. The corresponding pvalue can then be calculated in the usual way as the fraction of a large number of restandardized test statistics exceeding the observed test statistic.
When the gene set score is simply the average of the individual gene scores in the set, the restandardization formula can be simplified to
Here T_{ave }denotes the gene set statistic obtained by an averaging approach, μ_{s }and σ_{s }are the mean and standard deviation of the scores s_{j }of all N genes and μ* and σ* are the mean and standard deviation of the gene set test statistics obtained by sample label permutation. In this special case, it is not necessary to conduct gene sampling before doing the restandardization since the mean of the gene sampling distribution is the same as the overall mean of the single genes and the standard deviations only differ by the factor .
Extensions to GSEA and averaging methods
Some further variations to the GSEA and methods using the average of genelevel statistics are reported in Jiang and Gentleman [16]. They propose using the median or a sign test for robustly averaging over the scores in a gene set. [16] also suggest an alternative test statistic for the individual genes when there are other covariates such as gender or age which might influence the expression levels. In essence, this test statistic is based on fitting a linear model where the gene expression is the response and the phenotype and other covariates are the independent variables.
A similar approach was proposed by [13] who additionally incorporate a measure for coexpression in the gene sets, the mean of the pairwise correlation coefficients, and a measure for the distinctiveness of the sets based on a nearest neighbors classifier.
In their paper, [16] also propose a Bayes approach for the investigation of gene set enrichment. The procedure is somewhat similar to the two groups empirical Bayes model described in [21]. Another issue addressed by [16] is the potential overlap of gene sets. They also investigate dimension reduction (i.e. principle components analysis) to find gene sets with coregulated genes.
Parametric method
Another very simple approach to detecting gene set enrichment is PAGE (Parametric Analysis of Gene Set Enrichment) suggested by [18]. Essentially, this approach proceeds by averaging over fold changes or other genelevel statistics and comparing this test statistic with the standard normal distribution. Since the approach is fully parametric and because no permutation is required this procedure is very time saving and computationally efficient. Using the normal approximation is justified by [18] via asymptotic arguments. In a similar fashion, the "category" averaging approach [17] also employs (among other options) a normal approximation for significance assessment.
Randomset method
Newton et al. [20] aggregate some ideas from the averaging methods, the parametric models, and the 2 × 2 table methods. Instead of only counting the number of differentially expressed genes present in the gene set, as done when applying 2 × 2 table methods, they allow for more general test statistics similar to the ES introduced by [7]. However, they do not apply sample label permutation to assess significance but consider the genes to be the sampling units.
Starting with the measures of differential expression s_{j}, j = 1, ..., N for each gene, an unstandardized score for a gene set S consisting of m genes is calculated as the average of the gene scores of these genes
As the method aims to compare the enrichment in a gene set S with the enrichment of all other distinct randomly drawn gene sets of size m, a so called randomset model is introduced. That means that the gene set S is now considered as a random collection of m genes whose scores s_{j }are fixed. For most choices of the genelevel statistic, the exact distribution of becomes intractable. [20] propose to approximate it with the normal distribution with mean and variance as follows:
The score can then be standardized with these two quantities, resulting in the test statistic
Under the null hypothesis of no enrichment, Z is approximately normally distributed with mean zero and unit variance.
A false discovery ratebased method
Another approach to enrichment analysis is presented by [21]. This procedure is based on estimating the false discovery rate (FDR) of a gene conditional on belonging to a given gene set. In this analysis two subgroups of genes are considered – the genes in the gene set and the rest of the genes – and a mixture model for computing false discovery rates is fitted separately to these two subgroups. In addition, the false discovery rate is also computed without consideration of subgroups.
Efron [21] showed that the conditional local FDR and the unconditional local FDR are linked by the simple relation
where π_{S0}(z) is the probability for a null case of being in S given that its zvalue is equal to z. π_{S }(z) is defined analogously.
In this framework the null hypothesis of no enrichment corresponds to the statement that fdr_{S }(z) is equal to the overall fdr(z), implying that the genes in S do not behave differently compared to all the N genes together. If the set is enriched, then fdr_{S }(z) will be smaller than fdr(z) for very large (or small) zvalues indicating differential expression in the set. Note that this assumes that the densities for the nonexpressed genes are the same for the genes in S and those in (i.e. f_{S0}(z) = (z)). [21] describes a corresponding test based on logistic regression.
Global and multivariate procedures
Globaltest
Goeman et al. [22] approach the problem of enrichment analysis from the perspective of class prediction. They point out that if the gene set is associated with the phenotype, then its constituents will have different expression patterns in the two types of the response and thus will serve as reliable predictors for the clinical outcome. Testing for enrichment can therefore be done in the framework of generalized linear models. Let Y be the vector of the phenotypes of the n samples. For simplicity it is assumed that Y only takes the two values 0 and 1 for the two phenotypes although in this setting Y could also be continuous, for example the survival time of a patient. Furthermore, let X = (x_{ij}) be the n × m matrix of expression patterns of the genes, where m is the number of genes in the gene set under investigation and n is the number of samples. In a generalized linear model, the influence of the gene expression on the phenotype can be modeled as
where β_{j }is the regression coefficient of gene g_{j}, α is the intercept and h is the link function. This can for example be the identity or the logit function, resulting in a simple linear model or a logistic regression model respectively. If the set is not enriched, then the genes do not have a predictive potential. Hence, a null hypothesis for the enrichment problem is
However, the size of the gene set may be quite large, so that m is greater than n. In this situation, the hypothesis above cannot be tested in the usual way. The solution proposed by [22] is to assume that the regression coefficients all follow the same distribution with mean zero and a common but unknown variance τ^{2}. Then the test problem can equivalently be expressed as
Goeman et al. [22] develop a corresponding score test called "globaltest". It is particularly powerful for alternative hypotheses where all the β_{j }are not equal to zero but do not deviate too much from the null hypothesis either. Thus, the globaltest seems very useful with regard to the underlying biological question about whether the set of genes as a whole influences the phenotype, i.e. whether or not a large proportion of the genes in the set have a moderate influence on the response.
We note that the globaltest is implemented assuming a diagonal covariance matrix. This means that the genes are considered to be uncorrelated and the test statistic can therefore be expressed as the average of the genewise test statistics for the individual genes. Thus, the method is comparable to the averaging procedures that first calculate a score for each gene and then summarize these scores for each set.
GlobalAncova
The linear model can also be used for enrichment analysis when the roles of gene expression and phenotype are exchanged. This was shown by Mansmann and Meister [23]. The authors call their method GlobalAncova, referring to methods for the analysis of covariance. As the globaltest, the procedure of [23] does not explicitly model the correlation structure in the gene set when constructing the test statistic as a combination of genewise models. However, the correlation structure is still implicitly maintained because of the use of the sample permutation procedure. Furthermore, a refined followup model allows for the explicit inclusion of an estimated correlation matrix [24].
Hotelling's T^{2}test and related methods
Other procedures for gene set enrichment analysis which are also directly based on the gene set information can be derived from the wellknown multivariate extension of the ttest, the Hotelling's T^{2}test. The corresponding test statistic is
where and are the mean vectors of the gene expressions in the set for the two phenotypes, and n_{1 }and n_{2 }are the sample sizes in both groups. S is the pooled empirical covariance matrix of the genes in the set, and R is the corresponding correlation matrix. With t we denote the vector containing all local twosample equal variance tstatistics. Note that as a weighted sum of the genewise tstatistics Hotelling's T^{2 }explicitly incorporates the dependence structure among the genes – unlike most other test statistics for gene set enrichment.
Under the null hypothesis that the mean vectors in both groups are equal, follows an F distribution with m and n  m  1 degrees of freedom, where m is the gene set size. However, in the analysis of gene expression data it is rarely the case that the number of samples is greater than the number of genes in the set, so the above procedure needs to be modified appropriately.
Kong et al. [25] propose to use dimension reduction by means of singular value decomposition. In effect, their approach amounts to using the MoorePenrose pseudoinverse for inverting the singular empirical covariance matrix in the definition of Hotelling's T^{2}.
Another Hotellinglike statistic can be found in [19]. These authors substitute the usual tstatistic with the SAM statistic [47]
where s_{0 }is a regularizing constant, and assume a diagonal correlation matrix. This leads to the SAM Gene Set (SAMGS) statistic
for a gene set S. Significance is assessed by permuting sample labels.
A further possibility is to substitute the empirical covariance matrix in the Hotelling T^{2 }statistic with a more efficient estimator. This is important if the sample size is small [48].
Further procedures
Rahnenführer et al. [26] present procedures to investigate enrichment whilst taking into account known pathway structures and coregulations of gene sets. Adewale et al. [30] extend the approach of [19] to account for diverse phenotypes including survival and count data. Lewin and Grieve [28] developed special methods considering the connections between gene sets derived from the graph structure of the Gene Ontology [49]. Procedures to infer the enrichment of genegene and geneprotein networks have been presented by Nacu et al. [29].
Läuter et al. [31] describe a procedure for detecting gene set enrichment and the simultaneous selection of suitable gene sets. This utilizes an exact highdimensional multivariate spherical test [50]. Note that in this procedure the gene sets are not fixed a priori, but are chosen according to the correlation among genes. As gene set statistic a weighted average of univariate beta statistics is employed.
Edelman et al. [27] present an approach similar to the GSEA of [8]. A key feature of their method is that it is constructed to measure the enrichment of gene sets in individual samples. This aims at investigating the variation of the activity of functional groups of genes in the population so that the expression profile of relevant gene sets can be used for predicting the phenotype of an individual. As local test statistic the loglikelihood ratio is used.
Acknowledgements
We thank Hans Binder and Jürgen Läuter for many valuable discussions concerning multivariate testing and the analysis of gene sets. This paper is based on the diploma thesis of the first author [40].
References

Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression.
Genomics 2003, 81:98104. PubMed Abstract  Publisher Full Text

Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE.
Genome Biol 2003, 4:R70. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a webbased platform for interpreting sets of interesting genes using gene ontology hierarchies.
BMC Bioinformatics 2004, 5:16. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems.
Bioinformatics 2005, 21:35873595. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues.
Bioinformatics 2007, 23:980987. PubMed Abstract  Publisher Full Text

Vêncio RZN, Shmulevich I: ProbCD: enrichment analysis accounting for categorization uncertainty.
BMC Bioinformatics 2007, 8:383. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1αresponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34:267273. PubMed Abstract  Publisher Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proc Natl Acad Sci USA 2005, 102:1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barry WT, Nobel AB, Wright FA: Significance analysis of functional categories in gene expression studies: a structured permutation approach.
Bioinformatics 2005, 21:19431949. PubMed Abstract  Publisher Full Text

Zahn JM, Sonu R, Vogel H, Crane E, MazanMamczarz K, Rabkin R, Davis RW, Becker KG, Owen AB, Kim SK: Transcriptional profiling of aging in human muscle reveals a common aging structure.

Efron B, Tibshirani R: On testing the significance of sets of genes.

Keller A, Backes C, Lenhof HP: Computation of significance scores of unweighted gene set enrichment analyses.
BMC Bioinformatics 2007, 8:290. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pavlidis P, Lewis DP, Noble WS: Exploring gene expression data with class scores.
Pac Symp Biocomput 2002, 474485. PubMed Abstract

Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies.
Proc Natl Acad Sci USA 2005, 102:1354413549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth GK: Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. Springer; 2005:397420.

Jiang Z, Gentleman R: Extensions to gene set enrichment.
Bioinformatics 2007, 23:306313. PubMed Abstract  Publisher Full Text

Gentleman R: Category: using categories to model genomic data. [http:/ / www.bioconductor.org/ packages/ 2.3/ bioc/ vignettes/ Category/ inst/ doc/ Category.pdf] webcite

Kim SY, Volsky DJ: PAGE: Parametric analysis of gene set enrichment.
BMC Bioinformatics 2005, 6:144. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dinu I, Potter JD, Mueller T, Liu Q, Adewale A, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAMGS.
BMC Bioinformatics 2007, 8:242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Newton MA, Quintana FA, Boon JAD, Sengupta S, Ahlquist P: Randomset methods identify distinct aspects of the enrichment signal in geneset analysis.

Efron B: Simultaneous inference: when should hypothesis testing problems be combined?

Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome.
Bioinformatics 2004, 20:9399. PubMed Abstract  Publisher Full Text

Mansmann U, Meister R: Testing differential gene expression in functional groups.
Methods Inf Med 2005, 44(3):449453. PubMed Abstract  Publisher Full Text

Hummel M, Meister R, Mansmann U: GlobalANCOVA: exploration and assessment of gene group effects.
Bioinformatics 2008, 24:7885. PubMed Abstract  Publisher Full Text

Kong SW, Pu WT, Park PJ: A multivariate approach for integrating genomewide expression data and biological knowledge.
Bioinformatics 2006, 22:23732380. PubMed Abstract  Publisher Full Text

Rahnenführer J, Domingues FS, Maydt J, Lengauer T: Calculating the statistical significance of changes in pathway activity from gene expression data.
Stat Appl Genet Mol Biol 2004, 3:Article16. PubMed Abstract

Edelman E, Porrello A, Guinney J, Balakumaran B, Bild A, Febbo PG, Mukherjee S: Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genomewide expression profiles.
Bioinformatics 2006, 22(14):e108e116. PubMed Abstract  Publisher Full Text

Lewin A, Grieve IC: Grouping gene ontology terms to improve the assessment of gene set enrichment in microarray data.
BMC Bioinformatics 2006, 7:426. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nacu S, CritchleyThorne R, Lee P, Holmes S: Gene expression network analysis and applications to immunology.
Bioinformatics 2007, 23(7):850858. PubMed Abstract  Publisher Full Text

Adewale AJ, Dinu I, Potter JD, Liu Q, Yasui Y: Pathway analysis of microarray data via regression.
J Comput Biol 2008, 15(3):269277. PubMed Abstract  Publisher Full Text

Läuter J, Horn F, Rosolowski M, Glimm E: Highdimensional data analysis: selection of variables, data compression, and graphics – application to gene expression.

Liu Q, Dinu I, Adewale A, Potter J, Yasui Y: Comparative evaluation of geneset analysis methods.
BMC Bioinformatics 2007, 8:431. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen JJ, Lee T, Delongchamp RR, Chen T, Tsai CA: Significance analysis of groups of genes in expression profiling studies.
Bioinformatics 2007, 23:21042112. PubMed Abstract  Publisher Full Text

Nam D, Kim S: Geneset approach for expression pattern analysis.
Brief Bioinform 2008, 9:189197. PubMed Abstract  Publisher Full Text

Song S, Black MA: Microarraybased gene set analysis: a comparison of current methods.
BMC Bioinformatics 2008, 9:502. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dopazo J: Formulating and testing hypotheses in functional genomics.
Artif Intell Med 2008. PubMed Abstract  Publisher Full Text

OpgenRhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distributionfree shrinkage approach.
Statist Appl Genet Mol Biol 2007, 6:Article9. PubMed Abstract

Klebanov L, Glazko G, Salzman P, Yakovlev A, Xiao Y: A multivariate extension of the gene set enrichment analysis.
J Bioinform Comput Biol 2007, 5(5):11391153. PubMed Abstract  Publisher Full Text

R Development Core Team: R: a language and environment for statistical computing. [http://www.Rproject.org] webcite
R Foundation for Statistical Computing, Vienna, Austria; 2008.
[ISBN 3900051070].

Ackermann M: A comparison of statistical methods for gene set enrichment analysis. In Diploma thesis. Department of Statistics, Technische Universität Dortmund; 2008.

Schäfer J, Strimmer K: A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics.
Stat Appl Genet Mol Biol 2005, 4:Article32. PubMed Abstract

Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J: Geneexpression profiles in hereditary breast cancer.
N Engl J Med 2001, 344:539548. PubMed Abstract  Publisher Full Text

Efron B: Largescale simultaneous hypothesis testing: the choice of a null hypothesis.

Alexa A, Rahnenführer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure.
Bioinformatics 2006, 22:16001607. PubMed Abstract  Publisher Full Text

Fisher RA: On the interpretation of χ^{2 }from contingency tables, and the calculation of P.

Goodman LA, Kruskal WH: Measures of association for crossclassification.

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98:51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Strimmer K: Connections between the augmented bootstrap and the shrinkage covariance estimator.

The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology.
Nature Genetics 2000, 25:2529. PubMed Abstract  Publisher Full Text

Läuter J: Exact t and F tests for analyzing studies with multiple endpoints.