Abstract
Background
Multiple dataanalytic methods have been proposed for evaluating geneexpression levels in specific biological pathways, assessing differential expression associated with a binary phenotype. Following Goeman and Bühlmann's recent review, we compared statistical performance of three methods, namely Global Test, ANCOVA Global Test, and SAMGS, that test "selfcontained null hypotheses" Via. subject sampling. The three methods were compared based on a simulation experiment and analyses of three realworld microarray datasets.
Results
In the simulation experiment, we found that the use of the asymptotic distribution in the two Global Tests leads to a statistical test with an incorrect size. Specifically, pvalues calculated by the scaled χ^{2 }distribution of Global Test and the asymptotic distribution of ANCOVA Global Test are too liberal, while the asymptotic distribution with a quadratic form of the Global Test results in pvalues that are too conservative. The two Global Tests with permutationbased inference, however, gave a correct size. While the three methods showed similar power using permutation inference after a proper standardization of gene expression data, SAMGS showed slightly higher power than the Global Tests. In the analysis of a realworld microarray dataset, the two Global Tests gave markedly different results, compared to SAMGS, in identifying pathways whose gene expressions are associated with p53 mutation in cancer cell lines. A proper standardization of gene expression variances is necessary for the two Global Tests in order to produce biologically sensible results. After the standardization, the three methods gave very similar biologicallysensible results, with slightly higher statistical significance given by SAMGS. The three methods gave similar patterns of results in the analysis of the other two microarray datasets.
Conclusion
An appropriate standardization makes the performance of all three methods similar, given the use of permutationbased inference. SAMGS tends to have slightly higher power in the lower αlevel region (i.e. gene sets that are of the greatest interest). Global Test and ANCOVA Global Test have the important advantage of being able to analyze continuous and survival phenotypes and to adjust for covariates. A free Microsoft Excel AddIn to perform SAMGS is available from http://www.ualberta.ca/~yyasui/homepage.html webcite.
Background
Some microarraybased gene expression analyses such as Significance Analysis of Microarray (SAM) [1] aim to discover individual genes whose expression levels are associated with a phenotype of interest. Such individualgene analyses can be enhanced by utilizing existing knowledge of biological pathways, or sets of individual genes (hereafter referred to as "gene sets"), that are linked via. related biological functions. Geneset analyses aim to discover gene sets the expression of which is associated with a phenotype of interest.
Many geneset analysis methods have been proposed previously. For example, Mootha et al. [2] proposed Gene Set Enrichment Analysis (GSEA), which uses the KolmogorovSmirnov statistic to measure the degree of differential gene expression in a gene set by a binary phenotype (see also [3]). Goeman et al. [4] presented Global Test, modeling differential gene expression by means of randomeffects logistic regression models. Goeman et al. [5] also extended their methods to continuous and survival outcomes. Mansmann and Meister [6] proposed ANCOVA Global Test, which is similar to Global Test but having the roles of phenotype and genes exchanged in regression models. Mansmann and Meister [6] pointed out that their ANCOVA Global Test outperformed Global Test, especially in cases where the asymptotic distribution of Global Test cannot be used. Dinu et al. [7] discussed some critical problems of GSEA as a method for geneset analysis and proposed an alternative method called SAMGS, an extension of SAM to geneset analysis. Goeman and Bühlmann [8] provided an excellent review of the methods, discussing important methodological questions of geneset analysis, and summarized the methodological principles behind the existing methods. An important contribution of their review was the distinction between testing "selfcontained null hypotheses" via. subject sampling and testing "competitive null hypotheses" via. gene sampling. They argue, and we agree, that the framework of the competitive hypothesis testing via. gene sampling is subject to serious errors in calculating and interpreting statistical significance of gene sets, because of its implicit or explicit untenable assumption of probabilistic independence across genes.
Although Global Test, ANCOVA Global Test, and SAMGS each test a selfcontained hypothesis on the association of expression patterns across a gene set with a phenotype of interest in a statistically appropriate manner, it is unclear how the three methods compare on performance in detecting underlying associations. In this paper, we compare the performance of the three methods via. simulation and realworld microarray data analyses, both statistically and biologically.
Results
Simulation experiment
Our first evaluation of the three methods used a simulation study, similar to that of Mansmann and Meister [6] with some modifications that make the simulated data more realistic, and evaluated the size and power of the three hypothesis tests. Geneset analysis was performed for both the original and "zscore standardized" simulated datasets so that the effects of standardization on the three tests' performance can be assessed. The zscore standardization was motivated by that fact that geneexpression variances can vary greatly across genes, even after a normalization, which could influence geneset analysis. In the zscore standardized datasets, gene expression was standardized using the following equation:
where x_{jk }is the gene expression for gene j in sample k, and s_{j }are the sample mean and standard deviation of gene j expression using all samples, respectively. All simulation analyses compared the mean expression of a geneset of interest between two groups, each with a sample of 10 observations.
First, we checked the size of the three tests, before and after the standardization, according to the following three scenarios of no differential expression between two groups: (1) randomly generate expression of 100 genes for the two groups from a multivariate normal distribution (MVN) with a mean vector μ and a diagonal variancecovariance matrix Σ, where the 100 elements of μ and the 100 diagonal elements of Σ were randomly generated as 100 independentlyandidenticallydistributed (i.i.d.) uniform random variables in (0,10) and 100 i.i.d. uniform random variables in (0.1, 10), respectively (i.e., no gene was differentially expressed between the two groups and expression was uncorrelated among the 100 genes); (2) exactly same as (1) except the variancecovariance matrix Σ of the MVN being changed to have a correlation of 0.5 between all pairs of the first 20 genes and also between all pairs of the second 20 genes; (3) exactly same as (2) with the correlation value changed from 0.5 to 0.9.
Second, we estimated the power of the three tests, before and after the standardization, by randomly generating a gene set of size 100, using the exactly same simulation setup of the sizeevaluation (2) above, but allowing the first 40 genes being differentially expressed. The mean expression of the 40 differentially expressed genes was randomly generated from Uniform(0,10) as in the sizeevaluation (2), but was subsequently modified by an addition and a subtraction of a constant γ, as in Mansmann and Meister [6], such that mean vectors μ_{i}'s for the two groups (i = 1, 2) differ by 2γ, , for j = 1,..., 40. We considered a range of γ from 0 to 2 with an increment of 0.1. The 40 differentially expressed genes were set to have a correlation of 0.5, as in the sizeevaluation (2), but no correlation and a correlation of 0.9 were also considered.
In the comparison of size across the three tests, the size was estimated by the observed proportion of replications with a pvalue smaller than the correct size α. By definition, under the null hypothesis, a proportion α of the replications of an experiment is expected to yield a pvalue smaller than α. In order to assess the size, we ran 5000 replications and used α = 0.05. For each permutationbased pvalue, 1000 random permutations were carried out.
In the comparison of power across the three tests, the power was estimated by the observed proportion of the replications of an experiment in which the null hypothesis was correctly rejected. Given the fixed numbers of samples and genes with the fixed correlation structure in the simulation experiment, a larger effect size γ leads to higher power for a given αlevel. In estimating the power, we ran 1000 replications of an experiment for each γ value. We considered α at 0.05, 0.01, 0.005, 0.0025, and 0.001. For obtaining a permutationbased pvalue, 1000 random permutations were carried out.
The empirical Type I error rates of SAMGS and the two Global Tests with permutations were almost right on the target of the nominal value of 0.05, before and after the standardization, for all three scenarios considered for the evaluation of size (Table 1). Type I error rates of Global Test with the scaled χ^{2 }null distribution and Global Test with the asymptotic null distribution with a quadratic form deviated noticeably from the nominal size, being too liberal with the scaled χ^{2 }and too conservative with the asymptotic distribution (non χ^{2 }distributed quadratic form) as shown in Table 1. As the correlation among the 40 genes increased, the Type I error rates of Global Test with the scaled χ^{2 }null distribution and the Global Test with the asymptotic null distribution with a quadratic form generally moved towards the nominal size of 0.05. Type I error rates of ANCOVA Global Test with the asymptotic distribution also deviated noticeably from the nominal size: 0.0692, 0.1034 and 0.0898 before the standardization and 0.037, 0.0848 and 0.0792 after the standardization, for r = 0, 0.5, and 0.9, respectively. Hereafter, therefore, the pvalues for Global Test and ANCOVA Global Test are calculated based on permutations. We also estimated the size of the three tests using 25 samples, instead of 10 samples, in each group, and observed similar patterns. As the sample size increased, the Type I error rates of the two Global Tests by using the asymptotic distributions moved towards to the nominal level of 0.05.
Table 1. Assessment of type I error probabilities
The second step of the simulation was to assess power, the results of which are shown in Figure 1, 2, 3, 4, 5, 6. Before the standardization, SAMGS showed higher power than the Global Tests at α = 0.05, with increasing power differences with decreasing α levels. This pattern was observed regardless of the correlation level in the 40 differentiallyexpressed genes (correlation of 0, 0.5, or 0.9). After the standardization, the performances of these three methods became closer: SAMGS showed slightly higher power than the Global tests with increasing power difference with decreasing α levels.
Figure 1. The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0 among 40 genes.
Figure 2. The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0 among 40 genes.
Figure 3. The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0.5 among 40 genes.
Figure 4. The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.5 among 40 genes.
Figure 5. The results of the simulation experiment, evaluating power of the three tests before the standardization, for correlation of 0.9 among 40 genes.
Figure 6. The results of the simulation experiment, evaluating power of the three tests after the standardization, for correlation of 0.9 among 40 genes.
Realworld data analyses
Our next evaluation of the performance of the three methods used biologically, a priori defined gene sets and three microarray datasets considered in Subramanian et al. [3], download from GSEA webpage, [9]: 17 p53 wildtype vs. 33 p53 mutant cancer cell lines; 15 male vs. 17 female lymphoblastoid cells; 24 acute lymphoid leukemia (ALL) vs. 24 acute myeloid leukemia (AML) cells. For pathways/gene sets, we used Subramanian et al.'s geneset subcatalogs C1 and C2 from the same webaddress above on "Molecular Signature Database." The C1 catalog includes gene sets corresponding to human chromosomes and cytogenetic bands, while the C2 catalog includes gene sets that are involved in specific metabolic signaling pathways [3]. In Subramanian et al., Catalog C1 included 24 sets, one for each of the 24 human chromosomes, and 295 sets corresponding to the cytogenetic bands; Catalog C2 consisted of 472 sets containing gene sets reported in manually curated databases and 50 sets containing genes reported in various experimental papers. Following Subramanian et al. [3], we restricted the set size to be between 15 and 500, resulting in 308 pathways to be examined.
We compared the performance of the three methods before and after the standardization by listing the gene sets which had a pvalue ≤ 0.001 by any of the three methods.
Table 2 shows the associations of biologicallydefined gene sets with the phenotype, assessed by Global Test, ANCOVA Global Test, and SAMGS, in the analysis of gene expression differences between p53 wildtype vs. mutant cancer cell lines. Gene sets with a pvalue ≤ 0.001 by any of the three methods are listed in Table 2. Before the standardization, SAMGS identified 16 gene sets with a pvalue ≤ 0.001, while Global Test and ANCOVA Global Test identified three and one gene sets, respectively, with a pvalue ≤ 0.001 (Table 2). Two of these three sets were among the 16 sets identified by SAMGS. The third set was CR_DEATH which had a pvalue = 0.008 from SAMGS. Among the 17 gene sets listed in Table 2, seven involve p53 directly as a geneset member. Furthermore, five gene sets directly involve the extrinsic and intrinsic apoptosis pathways, and three involve the cellcycle machinery; each of these eight gene sets, then, is in a direct, wellestablished relationship with aspects of p53 signaling. The remaining two gene sets have plausible, if less well established, links with p53 [7]. The disagreement between results of SAMGS and the two Global tests was considerable before standardization. Although 16 of the 17 gene sets in Table 2 had a SAMGS pvalue ≤ 0.001, 7 had pvalues larger than 0.1 by the two Global tests. For example, SAMGS identified the gene set p53hypoxia pathway as a significant set with a pvalue < 0.001, which seems biologically appropriate, yet the Global Test and the ANCOVA Global Test gave pvalues greater than 0.6.
Table 2. Gene sets in the p53 dataset with Pvalue ≤ 0.001 by any of the three methods
We then compared the three methods incorporating the zscore standardization. For SAMGS, the pvalues before and after the standardization were highly consistent, and, therefore, we used the results of SAMGS before the standardization for the comparisons with the other two methods. For Global Test and ANCOVA Global Test, pvalues changed appreciably. Notably, pvalues of Global Test and ANCOVA Global Test after the standardization agreed closely with those of SAMGS (Table 2, Figure 7). For example, the pvalues of p53hypoxia pathway changed from 0.626 to <0.001 for Global Test and from 0.622 to <0.001 for ANCOVA Global Test. Although the pvalues of the three methods agreed with each other after the standardization, the pvalues from SAMGS tended to be smaller than those from Global Test and ANCOVA Global Test, in the lower range of pvalues (gene sets that are of the greatest interest) (Table 2, Figures 7 and 8): this is consistent with the powercomparison simulation in which SAMGS showed slightly higher power than the Global tests at small α levels, even after the standardization.
Figure 7. Pvalues of 308 gene sets in the p53 data analysis: pvalues of Global Test and ANCOVA Global Test after standardization vs. SAMGS pvalues before the standardization. The line indicates equal pvalues between SAMGS and Global Tests.
Figure 8. Lowest Pvalues in the p53 data analysis: pvalues of Global Test and ANCOVA Global Test after standardization vs. SAMGS pvalues before the standardization. The line indicates equal pvalues between SAMGS and Global Tests.
The same pattern was found in the analyses of the malevs.female lymphoblastoid dataset and the ALLvs.AML dataset (See Figures S1, S2, S3 and S4 in Additional file 1, comparing the results from the three methods). Before the standardization, pvalues from Global Test and ANCOVA Global Test differed greatly from pvalues from SAMGS. The pvalues of Global Test and ANCOVA Global Test changed markedly after the standardization and were very close to those of SAMGS. After the standardization, in the malevs.female analysis, 21 gene sets had a pvalue < 0.15 by one or more of the methods; 17 of these had a SAMGS pvalue smaller than, or equal to, those of Global Test and ANCOVA Global Test. In the ALLvs.AML analysis, all sets were statistically significant with pvalues < 0.001 by all three tests: which is unlikely to be of any biological significance.
Additional file 1. The analysis results of the two realworld microarray datasets (gender and leukemia) by the three methods. These three methods were applied and compared on two realworld microarray datasets: the male vs. female lymphoblastoid cell microarray dataset and the ALL and AMLcell microarray dataset.
Format: PDF Size: 61KB Download file
This file can be viewed with: Adobe Acrobat Reader
In the User Guides for Global Test and ANCOVA Global Test, Variance Stabilization (VSN) was used to normalize the data [10,11]. We also assessed the performance of the three methods on the p53 dataset, male vs. female dataset, and the ALL/AML dataset using VSN. The results for the p53 dataset are shown in Table 2 and Figure 9. When VSN was used for the normalization of the data, we observed: (1) pvalues of Global Test and ANCOVA Global Test became similar to those of SAMGS, but not as close as the pvalues after the zscore standardization; and (2) in the lower range of pvalues, the pvalues for SAMGS tended to be smaller than those of Global Test and ANCOVA Global Test, (Table 2, Figure 9).
Figure 9. Lowest Pvalues in the p53 data analysis: pvalues of Global Test and ANCOVA Global Test after the VSN normalization vs. SAMGS pvalues after the VSN normalization. The line indicates equal pvalues between SAMGS and Global Tests.
Discussion
From the simulation results, we suggest that, when Global Test and ANCOVA Global Test are used for the analysis of microarray data, permutations should always be used for the calculation of statistical significance. In the documentation included with the Global Test R package, Goeman et al. noted that the asymptotic distribution with a quadratic form is the recommended method for large sample sizes and it can be slightly conservative for small samples. In our simulation study, we used 10 and 25 samples for each of the two groups. In each situation, the asymptotic method with a quadratic form gave conservative pvalues, although the difference between asymptotic and permutationbased methods did decrease when the sample size increased. Goeman et al. also noted that the scaled χ^{2 }method can be slightly anticonservative, especially for large gene sets. Our simulation study showed that the scaled χ^{2 }method can be markedly anticonservative. This is in accord with the manual of Global Test, which recommends against using the scaled χ^{2 }approximation.
We found that performance of the two Global Tests changed greatly before and after standardization, but SAMGS performance remained unchanged. This can be explained by: (1) the invariance of ttest statistics under shifting and rescaling of data, that is relevant to SAMGS; (2) ANCOVA's explicit assumption that all genes in the set to have an equal variance, a violation of which would clearly affect the performance of ANCOVA Global Test; and (3) Global test's assumption that the regression coefficients come from the same normal distribution, an assumption that is met by the standardization of gene expression. Therefore, some sort of standardization that makes the variances of gene expression similar across genes is needed before using Global Test and ANCOVA Global Test. SAMGS employs a constant in the denominator of its tlike test statistic to address the small variability in some of the gene expression measurements and, thus, effectively standardizes expression across genes; neither Global Test nor the ANCOVA Global Test addresses this characteristic of microarray data. Both Goeman et al. [4] and Mansmann and Meister [6] have stated that an appropriate normalization is important. Note that not many normalization methods would standardize the expression across genes. It is only after applying zscore standardization (1) or the VSN normalization, that the results of the three methods became congruent. The similarity between Global Test and Global ANCOVA Test has already been commented upon in [6]. The similarity between SAMGS and Global Test may be inferred from the construction of the latter as a weighted sum of squared transformed tstatistics [12], which is similar to the SAMGS test statistic.
It should be noted that Global Test allows four different types of phenotype variables: binary; multiclass; continuous; and survival. ANCOVA Global Test allows binary, multiclass, and continuous phenotypes. The ability to handle different classes of phenotypes is a very important advantage of Global Test and ANCOVA Global Test over SAMGS. It is also possible to use Global Test and ANCOVA Global Test while adjusting for covariates (e.g., potential confounders). If covariates are incorporated, the two tests assess whether the geneexpression profile has an independent association with the phenotype that is above and beyond what is explained by the covariates. The ability to adjust for covariates is another important advantage of Global Test and ANCOVA Global Test over SAMGS.
We focused on pvalues in this paper because we were comparing the three methods that test "selfcontained null hypotheses" via. subject sampling. To account for multiple comparisons when multiple gene sets are tested, one might consider False Discovery Rate (FDR) instead of Type I error probability. For example, SAM uses a qvalue, an upper limit of the FDR, for each gene, which could be extended here to each geneset using the method of Storey [13]. The qvalues of the 17 gene sets listed in Table 2 are displayed in Additional file 2.
Additional file 2. FDR values for the 17 gene sets listed in Table 2. FDR values of the 17 gene sets listed in Table 2 are presented.
Format: PDF Size: 33KB Download file
This file can be viewed with: Adobe Acrobat Reader
We have considered, but did not report detailed comparison results of two other methods, Tian et al. [14] and Tomfohr et al. [15], that test selfcontained hypotheses via. subject sampling, in addition to the three methods we highlighted above. Tian et al. [14] tests the significance of a gene set by taking the mean of tvalues of genes in the gene set as a test statistic and evaluating its significance by a permutation test. Tomfohr et al. [15] reduces the gene set's expression into a single summary value by taking the first principal component of expressions of genes in the gene set and performs a permutationbased ttest of the single summary. The two methods gave appreciably different results when compared to Global Test and ANCOVA Global Test, and SAMGS. Of the 17 gene sets in Table 2 for the p53 analysis, for instance, Tian et al. and Tomfohr et al. identified only eight and one gene sets, respectively, with pvalue < 0.10: the ATM pathway, for example, was identified by Global Test, ANCOVA Global Test, and SAMGS with pvalue ≤ 0.001, while the methods of Tian et al. and Tomfohr et al. gave pvalue = 0.61 and 0.99, respectively. The main reasons for their large discrepancies from the results of the three highlighted methods are as follows. Tian et al. sums up the tvalues for all the genes in a gene set, which will result in cancellation of large positive tvalues and large negative tvalues. Among the 11 upregulated and 8 downregulated genes in the ATM pathway, for example, two upregulated genes had large positive tvalues (about 2 or greater) and three downregulated genes had large negative tvalues (about – 2 or smaller): these large positive and negative tvalues cancel each other when summing up all tvalues in the Tian et al. test statistic, leading to reduced power for detecting gene sets that contain both significantly upregulated genes and significantly downregulated genes. The method of Tomfohr et al. summarizes the Sdimension geneexpression vector of genes in the gene set S by the first principal component without considering the phenotype: if the direction of the first principal component does not correspond to the direction that separates the two phenotypes, their method does not capture the differential expressions even when they exist, leading to markedly reduced power.
Although we focused on the comparison of the "selfcontained null hypothesis" approaches, it is also of methodological interest to see how "competitive null hypothesis" approaches compare. We, therefore, applied three "competitive null hypothesis" approaches to the analysis of the p53 dataset: Gene Set Enrichment Analysis (GSEA) [2]; the Significance Analysis of Function and Expression (SAFE) [16]; and Fisher's exact test [17]. The results are shown in Additional file 3. The results from the three "competitive null hypothesis" approaches were greatly different from those of SAMGS and the Global Tests. Most of the gene sets identified as being significantly associated with the p53 mutation by SAMGS and Global Tests were not identified as such by the three "competitive null hypothesis" approaches. The only gene set additionally identified as being significantly associated with the p53 mutation (with p < 0.001) was HUMAN_CD34_ENRICHED_TF_JP: for this gene set, the Fisher's exact test pvalue was < 0.001, but all the other five methods gave pvalues > 0.37. Known biological functions of p53 are clearly more consistent with the results of the "selfcontained null hypothesis" approaches. The differences observed between "selfcontained null hypothesis" and "competitive null hypothesis" approaches can be attributable, at least partly, to the fact that the significance of a gene set depends only on the genes in the set under the "selfcontained null hypothesis" testing, while, under the "competitive null hypothesis" testing, the significance of a gene set depends not only on the genes in the set but also on all the other genes in the array.
Additional file 3. Pvalues and FDR values for the three "selfcontained null hypothesis" and three "competitive null hypothesis" approaches. The three "selfcontained null hypothesis" and three "competitive null hypothesis" approaches were applied to the p53 dataset. The pvalues and FDR values for the 17 gene sets listed in Table 2 are presented.
Format: PDF Size: 46KB Download file
This file can be viewed with: Adobe Acrobat Reader
In summary, the primary advantage of SAMGS may be the slightly higher power in the low αlevel region that is of highest scientific interest, whereas, despite the need for appropriate standardization, Global Test and the ANCOVA Global Test can be used for a variety of phenotypes and incorporate covariates in the analysis.
Conclusion
In conclusion, Global Test and ANCOVA Global Test require appropriate standardization of gene expression measurements across genes for proper performance. Standardization of these two methods and the use of permutation inference make the performance of all three methods similar, with a slight power advantage in SAMGS. Global Test and the ANCOVA Global Test can be used for a variety of phenotypes and incorporate covariates in the analysis.
Methods
In this section, we describe the three geneset analysis methods. The phenotype of interest is assumed to be binary.
1) Global Test
The Global Test is based on a regression model that predicts response from the gene expression measurements of a gene set [4]. Generalized linear models are used to model the dependency of response Y (an n × 1 vector) on gene expression measurements X (an n × m matrix) of a set of m genes on n samples:
where h denotes the link function and α and β's are parameters. If the genes are not differentially expressed, the regression coefficients (β's) should be zero. Under an assumption that all regression coefficients are sampled from a common distribution with mean zero and variance τ^{2}, the null hypothesis of no differential geneexpression is reduced to τ^{2 }= 0. Using the notation r_{i }= Σ_{j}x_{ij}β_{j}, the model simplifies to a randomeffects model: E(Y_{i}r_{i}) = h^{1 }(α + r_{i}). The null hypothesis can then be tested, based on a score test statistic discussed in Le Cessie and Van Houwelingen[18] and HouwingDuistermaat et al. [19]:
where R = (1/m)XX', μ = h^{1}(α), and μ_{2 }is the second central moment of Y under the null hypothesis. It can be shown that Q is asymptotically normally distributed (a quadratic form which is nonnegative). However, when the sample size is small, a better approximation to the distribution of Q is a scaled χ^{2 }distribution. The pvalue can, therefore, be calculated based on an approximate distribution of the test statistic, i.e., the asymptotic distribution with a nonchisquared distributed quadratic form or the scaled χ^{2 }distribution, or permutations of samples.
2) ANCOVA Global Test
The null hypothesis of the Global Test is in the form of P(YX) = P(Y). The ANCOVA Global Test changes the roles of gene expression pattern X and phenotype Y, and the null hypothesis becomes P(XY = 1) = P(XY = 2), or, for each gene j in a gene set of interest, μ_{1j }= μ_{2j}, where μ_{ij }is the mean expression of gene j in phenotype group i, i = 1,2. A linear model of the form, μ_{ij }= μ + α_{i }+ β_{j }+ γ_{ij}, with group effects α, gene effects β, and the genegroup interaction γ, is then used to test the null hypothesis. The conditions Σα_{i }= Σβ_{j }= Σ_{i}γ_{ij }= Σ_{j}γ_{ij }= 0 ensure identifiability of the parameters. The null hypothesis under the parameterization of the linear model is H_{0}: α_{i }= γ_{ij }= 0. The test statistic is the Ftest statistic for linear models: , where SSR_{H }and df_{H }denote the sum of squares and degrees of freedom, respectively, under the hypothesis H. The pvalue can be calculated by a permutation distribution of the F statistic or an asymptotic distribution of the test statistic.
3) SAMGS
SAMGS extends SAM to geneset analysis. SAMGS tests a null hypothesis that the mean vectors of expression of genes in a gene set does not differ by the phenotype of interest. The SAMGS method is based on individual tlike statistics from SAM, addressing the small variability problem encountered in microarray data, i.e., reducing the statistical significance associated with genes with very little variation in their expression. For each gene j, the d statistic is calculated as in SAM:
where the 'genespecific scatter' s(j) is a pooled standard deviation over the two groups of the phenotype, and s_{0 }is a small positive constant that adjusts for the small variability [1]. SAMGS then summarizes these standardized differences in all genes in the gene set S by:
A permutation distribution of the SAMGS statistic is used to calculate the pvalue. We note that even though the recalculation of s_{0 }is needed for each permutation, practically the implication is small, and both SAM and SAMGS excel addins do not recalculate s_{0}.
Each of the three methods provides a statistically valid test of the null hypothesis of no differential gene expression across a binary phenotype.
For the purpose of methodological comparisons, we also applied three "competitive null hypothesis" approaches to the analysis of the p53 dataset: Gene Set Enrichment Analysis (GSEA) [2]; the Significance Analysis of Function and Expression (SAFE) [16]; and Fisher's exact test [17]. Both GSEA and SAFE employ a twostage approach to access the significance of a gene set. First, genespecific measures are calculated that capture the association between expression and the phenotype of interest. Then a test statistic is constructed as a function of the genespecific measures used in the first step. The significance of the test statistics is assessed by permutation of the response values. For GSEA, the Pearson correlation is used in the first step, according to Mootha et al. [2] and the Enriched Score is used in the second step. For SAFE, the student tstatistic is used in the first step and the Wilcoxon ranksum test is used in the second step, both of these being the default options. For the Fisher's exact test, the list of significant genes is obtained from SAM [1]. An FDR cutoff of 0.3 assigned significance to 5% of the genes in the entire gene list.
Availability and requirements
Project name: Comparison of statistical methods for gene set analysis based on testing selfcontained hypotheses via. subject sampling.
Project home page: http://www.ualberta.ca/~yyasui/homepage.html webcite
Operating system(s): Microsoft Windows XP
Programming language: R 2.4.x and Microsoft Excel 2003 or 2007
Abbreviations
Significance Analysis of Microarray for Gene Sets (SAMGS)
Authors' contributions
JDP provided biological interpretations of the analysis results of the realworld dataset. QL and ID contributed significantly to data analysis, refinement of SAMGS, and programming. The manuscript was written primarily by QL, ID, and YY, and critically reviewed and revised by all authors. All authors read and approved the final manuscript.
Acknowledgements
ID, AJA, and YY are supported by the Alberta Heritage Foundation for Medical Research and YY is supported by the Canada Research Chair Program and the Canadian Institutes of Health Research.
References

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98:51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34:267273. PubMed Abstract  Publisher Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proc Natl Acad Sci USA 2005, 102:1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome.
Bioinformatics 2004, 20:9399. PubMed Abstract  Publisher Full Text

Goeman JJ, Oosting J, CletonJansen AM, Anninga JK, van Houwelingen HC: Testing association of a pathway with survival using gene expression data.
Bioinformatics 2005, 21:19501957. PubMed Abstract  Publisher Full Text

Mansmann U, Meister R: Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach.
Methods Inf Med 2005, 44:449453. PubMed Abstract  Publisher Full Text

Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving GSEA for Analysis of Biologic Pathways for Differential Gene Expression across a Binary Phenotype.
BMC Bioinformatics 2007, 8:242. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues.
Bioinformatics 2007, 23:980987. PubMed Abstract  Publisher Full Text

Gene Set Enrichment Analysis [http://www.broad.mit.edu/gsea] webcite

Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression.
Bioinformatics 2002, 18(Suppl 1):S96104. PubMed Abstract  Publisher Full Text

Huber W, von Heydebreck A, Sueltmann H, Poustka A, Vingron M: Parameter estimation for the calibration and variance stabilization of microarray data.
Stat Appl Genet Mol Biol 2003, 2:Article3. PubMed Abstract

Goeman JJ, Van de Geer SA, van Houwelingen HC: Testing against a high dimensional alternative.
J R Statist Soc B 2006, 68:477493. Publisher Full Text

Storey JD: A direct approach to false discovery rates.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002, 64:479498. Publisher Full Text

Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies.
Proc Natl Acad Sci USA 2005, 102:1354413549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition.
BMC Bioinformatics 2005, 6:225. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Barry WT, Nobel AB, Wright FA: Significance analysis of functional categories in gene expression studies: a structured permutation approach.
Bioinformatics 2005, 21:19431949. PubMed Abstract  Publisher Full Text

Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression.
Genomics 2003, 81:98104. PubMed Abstract  Publisher Full Text

le Cessie S, van Houwelingen HC: Testing the fit of a regression model via score tests in random effects models.
Biometrics 1995, 51:600614. PubMed Abstract  Publisher Full Text

HouwingDuistermaat JJ, Derkx BH, Rosendaal FR, van Houwelingen HC: Testing familial aggregation.
Biometrics 1995, 51:12921301. PubMed Abstract  Publisher Full Text