Abstract
Background
Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses.
Results
We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the pvalues using permutation and a scaled chisquare approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA).
Conclusion
We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.
Background
Genomewide analysis using microarray data, including RNA expression, DNA copy number and epigenetic DNA methylation, has become a popular tool in genomic research. Single gene/marker analysis provides a quick and convenient tool to identify top genes that might be associated with phenotypic trait. However, it is subject to a large number of false positives due to a large number of comparisons, and does not fully take into account that some genes have similar biological functions and work together.
Microarray gene expressions or genetic markers usually have natural groupings based on biological knowledge. For example, multiple genes belong to the same biological pathway or network; or contiguous copy numberdetecting probes belong to the same gene or cytoband. Incorporating the prior knowledge or annotation about the grouping underlying the genomewide data can make the results more interpretable. Note that the grouping may not necessarily come from biology. It can also be a cluster of genes identified using clustering methods. In this paper, these natural or statistical groupings are loosely called a gene set, which refers to a set of genes, or a set of markers or simply a set of probes.
Numerous approaches for gene set analyses have been proposed [1], including the overrepresentation analysis [2], the univariate tests [3], the multivariate tests [4,5], the global test [6], and gene set enrichment analysis (GSEA) [7,8] and its variant [9]. The overrepresentationtype analysis has been found to suffer from methodological problems, which may lead to confusing results [10]. The global test and GSEA improve over the overrepresentationtype analysis. The global test regresses the phenotype on gene expressions in a gene set and tests for regression coefficients. GSEA performs a modified KolmogorovSmirnov test by comparing a gene set with the rest of the genes in the genome. However, the test statistics used in both methods ignore the correlation among the genes in a gene set and hence are subject to loss of statistical power, as genes in a gene set are often correlated and function together. The univariate test does not account for the correlation and loses power when the interdependence within the gene set is high, compared with the multivariate tests [11].
We propose in this paper to test for the effect of a gene set using a variance component test in multivariate regression model, where the correlation among genes in a gene set is explicitly taken into account. We term this test TEGS (Test for the Effect of a Gene Set). Specifically, we regress the gene expressions in a gene set on an independent variable, such an exposure or biological state variable, e.g., smoking (yes/no) or lung cancer status (yes/no), using multivariate regression, where the correlation among genes in a gene set is modeled using a working covariance matrix. As the number of genes might be large in a gene set, we develop a variance component score test for testing the effects of the exposure/biological state on the overall gene set profile by assuming regression coefficients follow a common distribution.
We show that TEGS includes the global test of Goeman, et al (2004) as a special case when correlation among the genes in a gene set is ignored. We conduct simulation studies to evaluate the finite sample performance of TEGS and compare it with the global test and GSEA. We apply the proposed method to analysis of the Type II Diabetes data set [7].
Methods
The model
Suppose that there are n subjects and subject i has p continuous outcomes Y_{i1},Y_{i2},…,Y_{ip}. In gene set analysis, the p outcomes indicate the expression values of p genes in a gene set, and x_{i} is an independent variable, e.g., exposure/biological state variable, such as mutation status: 1 if mutant and 0 if wildtype; or disease status (yes/no) for subject i. We consider the multivariate linear model
where the errors, ε_{i} = (ε_{i1},ε_{i2},...,ε_{ip})^{T} are assumed to be independent across different subjects and follow a multivariate normal distribution with mean 0 and true covariance Σ, which is often unknown, and α_{j} is the average expression value of gene j for those with x = 0. Covariates can be incorporated in the model (1) by expanding α_{j} to be where K is the number of covariates plus one (i.e., the intercept), z_{ik} is the covariate k of subject i, z_{i1} is 1, and α_{jk} is the regression coefficient of the covariate k for the gene j. However, because the data we are dealing with has small n and large p, we would need the ridge regression to estimate α_{jk}. If x_{i} is binary, e.g., disease status, β_{j} is the mean difference of the expression levels of gene j between the two disease groups. Model (1) can be written in matrix notation by stacking data of n subjects and p gene expressions as
where is an np × 1 vector, Y_{i} = (Y_{i1}, Y_{i2},…,Y_{ip})^{T}, , J = (I_{p}, ⋯,I_{p})^{T},X = (x_{1}I_{p}, ⋯,x_{n}I_{p})^{T}, α = (α_{1},α_{2}, ⋯,α_{p})^{T}, β = (β_{1},β_{2}, ⋯,β_{p})^{T}.
Gene set analysis using TEGS: A variance component score test
The null hypothesis H_{0} : β = 0 indicates that x_{i} has no effect on the mean of gene expression profile Y_{i} in a gene set. A traditional multivariate test for H_{0}[4] is based on a pdegree of freedom test and hence has limited power when the size of the gene set p is large, especially in the presence of a large number of null genes. To overcome this problem and improve test power, we assume the regression coefficients β_{j} follows an arbitrary common distribution with mean 0 and variance τ. The model (2) becomes a linear mixed model [12]. The null hypothesis H_{0} : β = 0 is equivalent to the null hypothesis for the variance component H_{0} : τ = 0. To test for H_{0} : τ = 0, one can perform a variance component score test [13].
Specifically, following Lin (1997), simple calculations show that the score for the variance component τ under the induced linear mixed model is
where Σ_{n} = diag(Σ,⋯,Σ) is an np × np block diagonal matrix. As the second term does not depend on data, we use the first term to construct the test statistic
where is the maximum likelihood estimator of α under H_{0}. One can easily show that under H_{0}, , where is simply the sample mean. Hence Equation (4) can be written as
where H = n^{−1}JJ^{T}. As Q_{T} is quadratic in Y. Some calculations show that Q_{T} follows a mixture of chisquare distribution , where the weights λ_{j} are the eigenvalues of the matrix .
The test statistic Q_{T} depends on the true covariance matrix Σ of Y_{i}, which is often unknown in practice and requires estimation of a large number of parameters. Although sample covariance can be used to estimate Σ, it is not stable when the size of the gene set p is large or moderate and sample size is small. We hence propose the use of a working covariance V for ε_{i} in (1) [14], which has a simpler structure and depends on a small number of parameters. We derive a variance component test for H_{0} : τ = 0 assuming ε_{i} has a covariance V, which might misspecify the true covariance Σ. Under this working model, similar calculations show that the variance component score statistic for H_{0} : τ = 0 is
where V_{n} = diag(V,⋯,V). We term the variance component test using Q TEGS (Test for the Effect of a Gene Set).
Examples of working covariance V include working independence (Indpt), which assumes the genes are independent in a gene set; factor analysis covariances assuming two factors (F2); adaptive factor analysis covariance with the estimated number of factors explaining up to 80% variability (Fadpt), compound symmetry (CpSym), which assumes the same pairwise correlation among genes; and unstructured sample covariance (Unstr).
The unstructured sample covariance is estimated using the residuals obtained by performing separate simple linear regression of individual gene expressions Y_{ij} on x_{i} in (1). When x_{i} is binary, e.g., disease=yes/no, is simply the jth centered outcome using the group specific means. When the number of genes in a gene set is large and the sample size is small, the standard empirical unstructured sample covariance estimator is unstable. We hence stabilize it using a ridge estimator by adding the 5th percentiles of sample variances to the diagonal of the empirical covariance estimator. Estimation for the compound symmetry covariance and the factor analysis covariance was based on the ridge unstructured covariance estimator. Specifically, for the compound symmetry covariance estimator, the pairwise covariance is estimated as the sample mean of the offdiagonal elements of the ridge unstructured covariance estimator. The twofactor and adaptivefactor covariances are estimated by singular value decomposition of the ridge unstructured covariance estimator.
We discuss in Section Null distribution of TEGS estimation of the pvalue using the TEGS test statistic Q. We perfromed simulation studies to investigate the performance of size and power using different working covariances in a wide range of scenarios and compare TEGS with that using Q_{T}, which is based on the true covariance matrix of Y_{i} and is the optimal test statistic within the TEGS statistic family, but cannot be calculated in practice as the true covariance of Y_{i} is unknown.
Special case of two group comparison and relationship of TEGS with the global test
Consider the setting of testing for the effect of a binary exposure/disease status on expressions in a gene set, i.e., x_{i} is binary (0/1), some calculations show that the TEGS statistic Q in (5) can be simplified as
where and are the sample mean of the outcome vector for group 1 and 2, and and (k = 1,⋯,p) are their components, v^{jk} is the (j,k)th element of V^{−1}. This suggests that the TEGS statistic Q compares the weighted average of the outcomespecific mean differences of the gene expression profiles between the two groups.
If one assumes working independence V = I, the TEGS test statistic Q in (6) becomes
It can be easily shown that the TEGS statistic that uses the working independence covariance among gene expressions in a gene set (Q_{ind}) is the same as the global test of Goeman, et al (2004). Although the global test is equivalent to the TEGS with working independence, it is still derived under the model (1) where the true covariance Σ is not necessarily independent.
Specifically, the global test is derived as the variance component test under the logistic regression model of the binary disease status x_{i} on the p gene expressions
where π_{i} = Pr(x_{i} = 1Y_{i1},⋯,Y_{ip}) is the probability of disease given the gene expression profiles in a gene set. Under the logistic model (8), to test for the null hypothesis of no gene set effect on disease status H_{0} : γ_{1} = ⋯ = γ_{p} = 0, Goeman, et al (2004) assumed the coefficients γ_{j} are independent and follow an arbitrary distribution with mean 0 and variance τ. The logistic model (8) hence becomes a logistic mixed model [15]. It follows that the null hypothesis H_{0} : γ_{1} = ⋯ = γ_{p} = 0 is identical to H_{0} : τ = 0. Goeman, et al (2004) derived the variance component score test for H_{0} : τ = 0 and termed it as the global test. One can easily show that the global test is identical to Q_{ind} in (7), apart from a term that does not depend on Y.
A comparison of (6) and (7) shows that TEGS has the flexibility to account for different correlations among gene expressions in a gene set by comparing the weighted differences of the means of gene expressions between two groups, while the global test, which is the same as the TEGS assuming working independence among gene expressions, ignores correlation among gene expressions. One hence would expect that TEGS that accounts for within gene set correlation is likely to be more powerful than the global test.
Another testing procedure that is closely related to TEGS is the Sequence Kernel Association Test (SKAT), a method developed to analyze SNP (single nucleotide polymorphism) or sequence data in genomewide association studies [16]. It has been shown that the global test is equivalent to the SKAT with linear kernel [16,17]. Thus, the TEGS with working independence is equivalent to the SKAT with linear kernel. However, TEGS with other working correlations and SKAT with other kernels do not have an obvious correspondence.
Null distribution of TEGS
As the TEGS statistic Q in (5) is a quadratic function of Y, we have shown that it follows a mixture of chisquare distributions, where the weights depend on the true covariance Σ and the working covariance V. We propose two methods to estimate the pvalue of TEGS.
Permutation
One approach to calculate the pvalue for the TEGS statistic Q, is based on permutation, where we permute the x_{i}’s, and calculate Q for each permuted dataset and compare the observed value of Q with those calculated based on the permuted samples. Note that for each permutation, V need to be reestimated given an assumed structure, e.g., under independence, unstructured, exchangeable, as V is the covariance conditional on the x. If the sample size is large (i.e. >100), one may use the Monte Carlo approach proposed by Lin [18].
Scaled χ^{2} approximation
The second approach is to compute the pvalue for the TEGS statistic Q is to use the Satterthwaite method [19] to approximate the null distribution of Q, which is a mixture of χ^{2} distributions. The Satterthwaite method approximates the null distribution of Q by a scaled χ^{2} distribution , where κ is the scale parameter and ν is the degree of freedom. The values of κ and ν can be estimated by matching the first two moments of Q under H_{0} with those of the the scaled χ^{2} distribution as
We estimate the mean and variance of Q under the null using permutation and denote the pvalue estimated using this approach as . Using the Satterthwaite approximation, we are able to calculate small pvalues based on a smaller number of permutations than the first method.
Normal mixture approximation
In order to achieve better precision of smaller pvalues, we further propose a method using the normal mixture approximation [20]. Specifically, we fit a twopopulation normal mixture π_{1}N(μ_{1},σ12) + π_{2}N(μ_{2},σ22) for the where is the scaled χ^{2} approximated pvalue for the statistic Q^{(b)} obtained at permutation b, b = 1,...,B (B is the number of permutation), Φ is the cumulative distribution function of the standard normal, and π_{a}, μ_{a} and are proportion, mean and variance of the normal distribution a (a = 1,2), respectively. pvalue can then be estimated as the tail probability by comparing and where and , respectively are maximum likelihood estimates of μ_{a}, and π_{a}.
Power calculations
To design a new study using a gene set analysis, one needs to perform power calculations. We discuss in this section power calculations using TEGS. The distribution of Q under the alternative hypothesis follows a mixture of noncentral chisquare distributions. We approximate this distribution using a scaled noncentral chisquare distribution . Specifically, we first estimate κ and ν under H_{0} as and , where and are the mean and the variance estimated theoretically as under the null as
For power calculations, to estimate the noncentrality parameter δ, the theoretical mean ) under the alternative is
Setting , which is the mean of , one can solve for δ, and calculate the power of the test using where α is the size of the test. The true covariance Σ and the working covariance V can be estimated using the pilot data. One can perform calculations by varying and the effects β.
Simulation study
Single gene set
We simulated the data using model (1). Two different combinations of n and p were considered: n = 50 and p = 10 and n = 20 and p = 40. Four different true covariances of ε_{i}, Σ, were investigated: (1) compound symmetry (CS), where we assumed the diagonal elements equal to 1 and the offdiagonal elements equal to 0.1 or 0.5; (2) firstorder autoregressive (AR1), where we assumed the diagonal elements equal to 1 and offdiagonal elements decay by a factor of 0.5 or 0.8; (3) two factor covariance (F2): , where the p elements of the two factors, P_{1} and P_{2} were generated independently from two Gaussian distributions, and u was chosen to make the diagonal elements of the Σ equal to 1’s; (4) the unstructured covariance (UNS), which was assumed to be the sample covariance of MAP00240_Pyrimidine_metabolism (p = 40) using the Type II Diabetes data in Mootha et al. (2003). The sample covariance of MAP00240_Pyrimidine_metabolism was calculated based on 17 subjects with normal glucose tolerance and 17 Type II Diabetes patients by conditioning on the disease status. To avoid singularity of the sample covariance, the 5th percentile of the diagonal elements was added to the diagonal to construct the true covariance matrix used in simulations.
The regression coefficients β was set by varying the proportion of nonzero β’s and their magnitudes. For n = 50 and p = 10, 0%, 40% and 80% of β’s were set to nonzero. The nonzero βs were set to be β = ±0.25 or ±0.5. For n = 20 and p = 40, 0%, 25%, 50% and 60% of β’s were set to be nonzero. The nonzero βs were set to be ±0.5 or ±1.0. The numbers of (−0.5,0.5) (or (−1.0,1.0)) are (2,8), (5,5), (5,15), (10,10), (5,25), (10,20) and (15,15). The effect size is summarized by an index, where is the average variance of the p gene expression in the same gene set, in the power plots given in Figures 1, 2, 3 and 4.
Figure 1. Power comparison using simulations with the true covariance to be compound symmetry. We set n = 20 and p = 40; covariance (cov) = 0.1 (top panel) and 0.5 (bottom panel). For each panel, the left plot is based on permutation and the right plot is based on the Satterthwaite scaled chisquare approximation. For each plot, the powers of TEGS assuming six different covariances were compared. TEGS assuming working independence (Indpt) corresponds to the global test of Goeman et al. (2004). The horizontal dotdash line indicates the size of the test (i.e., 5%).
Figure 2. Power comparison using simulations with the true covariance to be AR1. R is the value of the lag 1 autocorrelation. We set n = 20 and p = 40; covariance (cov)=0.1 (top panel) and 0.5 (bottom panel). Other notations and symbols are similar to Figure 1.
For each simulation and each true covariance configuration, we compared the performance of TEGS assuming six different covariance matrices: true covariance, unstructured covariance, independence, two factor analysis covariance, adaptive factor analysis covariance, and compound symmetry. Note that the TEGS assuming working independence corresponds to the global test (Goeman et al. 2004). The pvalues were calculated as the tail probability of Q under the null distribution. The null distribution was approximated by the methods described in Section Null distribution of TEGS. A total of 1000 permutations were performed to nonparametrically approximate the null distribution of Q. A total of 5000 and 1000 simulations, respectively were run for the setting under the null hypothesis (i.e., β = 0) and the alternative hypothesis to calculate sizes and powers. Type I error was calculated at the α = 0.05 level. Statistical power was calculated as the percentage of pvalues less than 0.05 among 1000 simulations.
Multiple gene sets
Gene set enrichment analysis (GSEA) is a widely used approach to study the enrichment of a gene set in a large number of genes, which often consists of multiple gene sets. The null hypothesis hence corresponds to the competitive null hypothesis [10]. To compare the performance of our proposed method TEGS with GSEA, we set up a simulation study involving multiple gene sets. The configuration is as follows:
• Setting 1: We set n = 20 and the number of gene sets be 20. Ten gene sets have p = 10 genes (gene sets #110). Among them, six gene sets are under the null and four gene sets are under the alternative. The other ten gene sets have p = 40 genes per gene set (gene sets #1120). Among them, six gene sets are under the null and four gene sets are under the alternative. Among the gene sets under the alternatives, we allowed some genes to have no effects (i.e., those with β_{j} = 0), and varied the number of signal genes (i.e., those with β_{j} ≠ 0). The number and magnitude of nonzero β’s or each of the gene sets under the alternative hypothesis are given in the top of Table 1. This setting has a total of 500 genes, with the total number of signal genes equal to 104 spreading across 8 gene sets and the total number of null genes equal to 396. We assumed in this setting the 20 gene sets were uncorrelated. Within each gene set, we assumed the genes were correlated with the two factor covariance matrix: .
• Setting 2: This setting is identical to Setting 1 except that we allowed correlation among the gene sets: gene sets #13, #46, #79, #1113, #1416 and #1719 are correlated. The correlation structures are estimated by two factor covariance from the sample covariances of the gene sets with p = 30 and 120 in the diabetes dataset. We marked correlated gene sets in Table 1.
• Setting 3: This setting is identical to Setting 2 except that we added additional 4500 null genes to the 500 genes in the 20 gene sets. This setting mimics more practical gene expression studies. This gives a total of 5000 genes, with 104 signal genes spreading across 8 gene sets and 4896 null genes. Among the 20 gene sets, same as before, there are 8 gene sets under the alternative and 12 null gene sets.
Table 1. The simulation results comparing size and power of TEGS and GSEA
For each setting, we applied TEGS and GSEA to each of the 20 gene sets to compare size and power.
Application: reanalysis of Type II Diabetes data
We applied the proposed method to analyze the Type II Diabetes gene expression data, which were previously analyzed by Mootha et al. (2003) using GSEA to study for the pathway effects. The original data have three patient groups: normal glucose tolerance, impaired glucose tolerance, and Type II Diabetes. To illustrate our method and compare it with GSEA, we restricted our analysis to two groups: 17 patients with normal glucose tolerance and 17 patients with Type II Diabetes. A total of 124 out of 149 gene sets were analyzed here. We excluded 25 small gene sets, which have less than four probes.
We performed TEGS assuming five different working covariances, including independence, unstructured covariance, factor analysis covariance using two factors and the number of factors that explain up to 80% variability, and compound symmetry covariance. We calculated the pvalues using permutation and the Satterthwaite method described in Section Null distribution of TEGS. The number of permutations for each gene set was 2000. The working independence TEGS corresponds to the global test [4]. We compared the performance of TEGS with GSEA. The q value, an index measuring the false discovery rate (FDR) [21,22], was used to adjust for multiple comparisons.
Results
Simulation study
Single gene set
Four true covariances were considered in the simulations: compound symmetry, AR1, two factor, and unstructured covariance. The results are presented in Figures 1, 2, 3 and 4. For each true covariance, we compared the powers of TEGS assuming the true covariance and five different working covariances.
Type I error rate is well protected at the size of 5% in all the settings with different approximation methods. For lower levels of the size (0.5% and 0.05%), different approximation methods perform well when using the true covariance (see Additional file 1: Table S1). For different working correlations, the permutation method protects the type I error at 0.5% and 0.05% where the type I error rate using the Satterthwaite approximation is inflated at 0.5% and 0.05% and that using the normal mixture approximation is wellprotected at 0.5% but inflated at 0.05%.
Additional file 1. Supplementary results.
Format: PDF Size: 493KB Download file
This file can be viewed with: Adobe Acrobat Reader
In all settings, TEGS assuming the true covariance (the black solid line) has the best power, while TEGS assuming independence among the genes (the black dash line), has the lowest power. TEGS calculated by accounting for withingene set correlation using an estimated working covariance is less powerful than that assuming the true covariance, but more powerful than TEGS assuming independence among the genes. As TEGS assuming independence among the genes is the same as the global test the results indicate that TEGS accounting for correlation among genes in a gene set is more powerful than the global test. A comparison of the top panel and the bottom panel in Figures 1 and 2 shows that the power gain of TEGS accounting for correlation among genes over the global test is more pronounced when the correlation is stronger.
When the working covariance structure is correctly specified, TEGS using the estimated covariance has the power closest to that using the true covariance. For example, when the true covariance is compound symmetry, TEGS assuming the compound symmetry structure with the constant pairwise covariance estimated from the data has the power curve closest to that assuming the true compound symmetry covariance with pairwise covariance equal to 0.1 or 0.5.
When the true covariance is the sample covariance of MAP00240_Pyrimidine_metabolism estimated from the diabetes data (Figure 4), TEGS obtained by estimating the covariance matrix using any of the working covariance matrices gives similar results, all outperforming TEGS assuming working independence among the genes (i.e., the global test). In all settings, TEGS assuming two factor analysis (F2) and adaptive factor analysis (Fadpt) have most robust performance, and give powers that are closest to TEGS assuming the true covariance structure. The simulation results for the setting with n = 50 and p = 10 show similar patterns to those with n = 20 and p = 40, and are provided in the Additional file 1.
Multiple gene sets
Table 1 compares the performance of TEGS and GSEA in settings 2 and 3. The results in Table 1 show that the number of signal genes not in the gene set affects the type I error rate and power of GSEA, but does not affect TEGS. For example, when the total number of genes is 500, the size of GSEA for testing a null gene set is somewhat conservative, and less than the nominal size 0.05. As the total number of genes increased to 5000 with much more null genes, the size of GSEA for testing a null gene set became closer to 0.05.
A comparison of the powers of the 8 gene sets under the alternative show that TEGS has better power than GSEA. When the number of genes increases from 500 to 5000 with the number of signal genes remaining the same, i.e., increasing the number of null genes, the power of GSEA for testing the effect of a gene set does not improve much. The reason can be explained as below. GSEA tests for competitive null hypothesis. For a given gene set, say gene set 1, when the 4500 null genes are added to the set of 500 genes, the proportion of signal genes in gene set 1 remains the same (4/10 = 40%), while the proportion of signal genes not in the gene set decreases from 100/490 = 20% to 100/4990 = 2%. Although the difference of the proportions of signal genes in gene set 1 and not in gene set 1 becomes bigger, as the size of gene set 1 remains the same as 10, the pvalue using the KolmogorovSmirnov test does not change much. Note that as TEGS tests for a selfcontained null hypothesis [10], its power remains the same as the total number of genes increases from 500 to 5000.
The power of TEGS increases quickly with the average effect size of a gene set, , while the power using GSEA improves slightly. This is because GSEA assesses whether the genes in a gene set are enriched towards the top in a list of all the genes, where individual genes are ranked by their pvalues. Hence the difference between the proportion of signal genes in a given gene set and the proportion of signal genes not in the gene set affect the pvalues calculated using GSEA, while the magnitudes of the signal genes have limited impact.
A comparison of the results of gene sets of size 10 with those of size 40 in Table 1 shows that the size of a gene set, p, affects TEGS and GSEA differently. A smaller gene set, e.g., p = 10, is less likely to be identified as significant using GSEA. However, the effect of the size of a gene set on TEGS is smaller. We report in Additional file 1: Table S2. The results when gene sets are independent.
To run one simulation data generated in the setting 2 (500 genes) on a desktop with 2.53 GHz CPU, the computation time of TEGS and GSEA (both with 200 permutations) is about 5.6 and 4 seconds, respectively. For the setting 3 (5000 genes), the computation time of TEGS and GSEA is about 60 and 14 seconds, respectively. The most computation burden in TEGS is to invert the working covariance in each permutation ( in (5)). Thus, the analysis with larger gene sets can cost much more computation.
Application: reanalysis of Type II Diabetes data
TEGS assuming independence among the genes identified 5 and 8 gene sets with p values less than 0.05 using permutation and Satterthwaite methods, respectively. The corresponding numbers of gene sets were 13 and 14; 15 and 14; and 9 and 10 using TEGS by estimating the covariance assuming the two factor analysis covariance, the adaptive factor analysis covariance, and the unstructured sample covariance. GSEA identified 4 differentially expressed gene sets. The overlapping numbers of differentially expressed gene sets between TEGS using the four working covariances and GSEA are shown in Table 2. TEGS assuming adaptive factor analysis covariance identified 10 gene sets with FDR less than 0.1 and 20 gene sets with FDR less than 0.15.
Table 2. Results of reanalyses of 124 gene sets in Type II Diabetes data
The gene set MAP00252_Alanine_and_aspartate_metabolism was identified as differentially expressed between Type II Diabetes patients and those with normal glucose tolerance: pvalue=0.006 using TEGS assuming adaptivefactor covariance, p = 0.003 using TEGS with exchangeable covariance, pvalue = 0.005 with TEGS with independence covariance and 0.054 using TEGS with unstructured covariance; p = 0.028 using GSEA. Figure 5a shows that five genes in this gene set were differentially expressed based on singlegene analysis with the ttest. The heatmap in Figure 5a also show that diabetes patients have higher expression in the upper two third of the genes but lower expression in the lower one third. Another interesting gene set we identified is MAP00531_Glycosaminoglycan_degradation (Figure 5b), which was found statistically significant using TEGS with different working correlations: pvalues are 0.00034 (adaptivefactor covariance), 0.0032 (unstructured), 0.021 (independence) and 0.022 (exchangeable), but was not significant using GSEA: pvalue=0.39.
Figure 5. Gene expression values in two gene sets, (a) MAP00252_Alanine_and_aspartate_metabolism and (b) MAP00531_Glycosaminoglycan_degradation. The main plot is the expression values of genes standardized by the means and standard deviations. Each column represents a patient labeled by either “N” (normal glucose tolerance) or “D” (Type II Diabetes). The right panel represents the p values from ttest of single gene analysis where the dashed line indicates p = 0.05 and the dotted line indicates p = 0.2.
Discussion
The power of TEGS is improved by accounting for the correlation among genes within the gene set, especially when the working covariance gets closer to the true covariance, and outperforms the TEGS assuming working independence. We have also shown that the TEGS with working independence among genes in a gene set corresponds to the global test proposed by Goeman, et al (2004). Our numerical studies show that the TEGS assuming two factors or adaptive factor covariance matrix overall works well in practice for difference true covariance structures, especially when the number of genes p is larger than sample size n.
We have compared the performance of TEGS with GSEA. Both tests borrow information across multiple genes in a gene set and are hence beneficial when a gene set has multiple signal genes with modest effects. TEGS and GSEA differ in several aspects. The TEGS statistic is constructed by accounting for correlation among genes in a gene set, while GSEA uses individual gene pvalues to calculate the KolmogorovSmirnov test, which ignores the within gene set correlation. TEGS considers the selfcontained null hypothesis, while GSEA considers the competitive null hypothesis [10]. GSEA studies the enrichment of genes in a gene set by testing the relative rankings of the genes in a gene set among all the genes under investigation. GSEA hence is influenced by the size of a gene set, the proportion of signal genes in the gene set, and the proportion of signal genes not in the gene set. When the proportion of signal genes in a gene set is much larger relative to that not in the gene set, GSEA performs well. We note that GSEA has difficulties in capturing a differentially expressed gene set when the number of genes containing true effects is small in a gene set even if the effects of these signal genes are strong. When the size of a gene set is small/modest, the power of GSEA does not increase much when the number of null genes not in the gene set increases or when the sample size increases. As TEGS considers a selfcontained null hypothesis, i.e., testing whether a gene set is differentially expressed, it is not affected by the behavior of the genes not in the gene set. TEGS improves power when sample size increases or the magnitudes of signal genes in a gene set increase. However, TEGS does not directly compare a gene set with other gene sets, although one can rank gene sets using pvalues calculated using TEGS. Our numerical results show that TEGS outperforms GSEA in terms of size and power, although the powers are not directly comparable as they test for different null hypotheses.
Due to the nature of the null hypothesis we specified, it is possible that a significant gene set from our proposed testing procedure is driven by one or two very significant genes, which is less likely to occur in GSEA. There are several possible ways to guard against this. For example, after a gene set is identified to be significant, one can perform single gene analyses to further characterize how the signals are distributed within the gene set. Or, one can use the same multivariate model as (1) to estimate and test the association of each gene with the phenotype using the ridge regression.
TEGS is not limited to testing mRNA expression in biological pathways/networks. It can also be applied for testing the effects of other genomic markers, such as DNA copy numbers, RNA or protein expressions, and DNA methylations in a genomic region or a functional set.
Conclusions
We have proposed in this paper a new method for the gene set analysis, TEGS. By introducing a working covariance, TEGS directly models the interdependence of the expression values in a gene set, the most important feature of biological pathways or gene sets that is often overlooked in existing methods. TEGS incorporates information from multiple genes in a gene set through the working covariance and thus outperforms two widely used approach, GSEA and global test in simulation studies and a diabetes microarray data.
Competing interests
Both authors declare that they have no competing interests.
Authors’ contributions
YTH developed the statistical methodology, designed and implemented the methods in analyzing microarray data and simulation data, and wrote the manuscript. XL developed the methodology and helped to draft and revise the masnucript. Both authors read and approved the final manuscript.
Acknowledgements
Xihong Lin is supported by grants from the National Cancer Institute (R37CA076404 and P01CA134294).
References

EmmertStreib F, Glazko GV: Pathway analysis of expression data: deciphering functional building blocks of complex diseases.

Draghici S: Global functional profiling of gene expression.
Genomics 2003, 81:98104. PubMed Abstract  Publisher Full Text

Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discoverying statistically significant pathways in expression profiling studies.
Proc Natl Acad Sci 2005, 102:1354413549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lu Y, Liu PY, Xiao P, Deng HW: Hotelling’s T2 multivariate profiling for detecting differential expression in microarrays.
Bioinformatics 2005, 21:31053113. PubMed Abstract  Publisher Full Text

Klebanov L: A multivariate approach for integrating genomewide expression data and biological knowledge.

Goeman J, van der Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome.
Bioinformatics 2004, 20:9399. PubMed Abstract  Publisher Full Text

Mootha V, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34:267273. PubMed Abstract  Publisher Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proc Natl Acad Sci 2005, 102:1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R: On testing for the significance of sets of genes.
Ann Appl Stati 2007, 1:107129. Publisher Full Text

Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues.
Bioinformatics 2007, 23:980. PubMed Abstract  Publisher Full Text

Glazko GV, EmmertStreib F: Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets.
Bioinformatics 2009, 25:23482354. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Laird NM, Ware JH: Randomeffects models for longitudinal data.
Biometrics 1982, 38:963974. PubMed Abstract  Publisher Full Text

Lin X: Variance component testing in generalised linear models with random effects.

Liang K, Zeger S: Longitudinal data analysis using generalized linear models.

Breslow NE, Clayton DG: Approximate inference in generalized linear mixed models.

Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare variant association testing for sequencing data using the Sequence Kernel Association Test (SKAT).
Am J Hum Genet 2011, 89:8293. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu D, Ghosh D, Lin X: Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed model.
BMC Bioinformatics 2008, 9:292. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lin D: An efficient Monte Carlo approach to assessing statistical significance in genomic studies.

Satterthwaite F: An Approximate Distribution of Estimates of Variance Components.
Biometrics Bull 1946, 2:110114. Publisher Full Text

Cai T, Lin X, Carroll R: Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test.
Biostatistics 2012.
13:776–790.

Storey J: A direct approach to false discovery rates.
J R Stat Society, Series B 2002, 64:479498. Publisher Full Text

Storey J: The positive false discovery rate: a Bayesian interpretation and the qvalue.
Ann Stat 2003, 31:20132035. Publisher Full Text