Department of Environmental Sciences & Engineering, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA

Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Centers for Environmental Bioinformatics and Computational Toxicology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Abstract

Background

Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon.

Results

We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data.

Conclusions

These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.

Background

Methods for statistical analysis of gene expression microarrays are maturing rapidly, and there are a variety of approaches to normalization, detection of differential expression, clustering, and class prediction

The simplest approach to gene set testing relies on 2 × 2 tables of gene set membership (in gene set or not) vs. significance (significant or not). Gene set testing is often performed using a χ^{2 }or Fisher's Exact test

Publications that assume independence between genes (light grey) greatly outnumber publications that use array resampling methods (dark grey)

**Publications that assume independence between genes (light grey) greatly outnumber publications that use array resampling methods (dark grey)**. Panel (a) shows the cumulative number of publications and panel (b) shows the number of publications using each method per year. Year of publication is displayed on the horizontal axis.

As the independence assumption is clearly violated for gene expression microarrays

Here, we clearly demonstrate that the correlation between genes inflates the false positive rate, that the magnitude of this inflation is quite high, and that a simple resampling method for gene set testing produces the correct false positive rate. We use a large number (over 200) of real experimental datasets, not simulated data, to show that 1) inter-gene correlation is a pervasive feature of gene sets, regardless of experimental condition, 2) inter-gene correlation inflates the apparent significance of gene set statistics, leading to an increase in false positives, and 3) array resampling approaches can correctly address the problem of inter-gene correlation. As there are several existing tools that use resampling approaches to determine functional enrichment among significant gene lists, we argue that the naïve approaches should no longer be used and should be replaced by tools employing a resampling approach.

Results

We investigated the degree to which correlation within gene sets is preserved across multiple experiments by analyzing Gene Expression Omnibus

The inter-gene correlation within GO categories is consistent across experiments and platforms

**The inter-gene correlation within GO categories is consistent across experiments and platforms**. Mean correlation among genes in GO categories (a,b) and KEGG pathways (c,d) on two human (a,c) and two mouse (b,d) microarray platforms. The correlation of all transcripts with all transcripts on each platform is shown in red. Spearman correlations of the correlations are in upper right. Crosses represent +/- 1 standard error on each axis.

One simple way to describe the effect of correlation is through a standardized enrichment test statistic. Let ^{2 }statistic (see Methods). Under the null hypothesis (no enrichment),

To empirically demonstrate that variance inflation results in the false significance of gene sets, we randomly permuted the sample labels associated with each of the 202 Gene Expression Omnibus data sets 10,000 times, and then performed a gene set analysis on each permutation using a common independence assumption method. For each permutation a list of "significant" genes with ^{2 }statistic was used to assess the significance of each gene set. A gene set was called significant if it had a Bonferroni-corrected (for the number of sets) χ^{2 }

To determine the overall experiment-wise increase in false positives, we counted the number of permutations in which at least one gene set was declared significant, using the Bonferroni-corrected 0.05 threshold. If the independence assumptions were true, no more than 5% of the permutations would give rise to significant gene sets. However, the observed proportion is much higher (Figure

False positive rates are greatly increased using independence assumption methods, GO Biological Process categories

**False positive rates are greatly increased using independence assumption methods, GO Biological Process categories**. The proportion of permutations in which at least one GO Biological Process category is called significant using an independence assumption method with a Bonferroni correction (α = 0.05), the Benjamini & Hochberg FDR (α = 0.05, 0.10), and the resampling approach described in this manuscript. Red lines = 5% & 10%.

**Additional file 1, Figure S1 False positive rates are greatly increased using independence assumption methods, KEGG pathways**. The proportion of permutations in which at least one KEGG pathway is called significant using an independence assumption method with a Bonferroni correction (α = 0.05), the Benjamini & Hochberg FDR (α = 0.05, 0.10), and the resampling approach described in this manuscript. **Additional file 1, Figure S2**. Variance inflation due to gene expression correlation increases the false positive rate, even when using a Bonferroni correction. The percentage of permutations in which at least one KEGG pathway was called significant is plotted versus the variance of the standardized gene set statistic (signed square root of the χ^{2 }statistic). Results are shown for two human (a,b) and two mouse (c,d) arrays. **Additional file 1, Figure S3**. KEGG pathways that are called significant by chance under permutation are likely to be called significant in the observed data. The proportion of times that a KEGG pathway is declared significant under permutation is plotted versus the proportion of times it is called significant in the observed data. **Additional file 1, Figure S4**. The variance of the gene set statistic (signed square root of χ^{2 }statistic) increases in proportion to the variance inflation factor (VIF = 1 + (

Click here for file

In addition, as predicted, we found that variance inflation is directly related to the false-positive rate for a gene set. For each GO Biological Process (Figure ^{2 }statistic across all permutations, and plotted it against the number of permutations in which that gene set was called "significant" using a Bonferroni correction.

Variance inflation due to correlation of gene expression increases the false positive rate, even when using a Bonferroni correction

**Variance inflation due to correlation of gene expression increases the false positive rate, even when using a Bonferroni correction**. The percentage of permutations in which at least one GO Biological Process category was called significant is shown versus the variance of the gene set statistic, for two human (a,b) and two mouse (c,d) arrays. Spearman correlations are in the upper left of each panel.

Fortunately, the shortcomings of independence assumption methods can be addressed through the use of resampling-based methods. These methods use the resampling data to construct an empirical null distribution for the gene set test statistics, taking into account the correlation structure between genes and providing a more accurate assessment of statistical significance. Several existing tools use resampling based approaches ^{2 }statistic as above. Using the empirical distribution of test statistics under resampling to generate the gene set p-value, proper control of the false positive rate was obtained (Figure

While the analyses above clearly show that gene correlation increases the false positive rate for independence assumption methods, it may be tempting to argue that correlation does not affect false positive rates in real datasets, which presumably include true enrichment. To investigate this, we computed the proportion of observed datasets in which each gene set was declared significant by the independence assumption method, and compared these values to the previously generated proportions observed under random permutation. The results for GO Biological Process categories (Figure

GO Biological Process categories that are called significant by chance under permutation are likely to be called significant in the observed data

**GO Biological Process categories that are called significant by chance under permutation are likely to be called significant in the observed data**. The proportion of times that a category is declared significant under permutation is plotted versus the proportion of times it is called significant in the observed data. Spearman correlations in upper left corner.

Discussion

The most straightforward strategy for gene expression analysis is to focus on individual genes for which expression differs among samples of interest. Such approaches often consider the genes selected based on significance thresholds, with the goal of predicting or finding associations with disease prognosis

Gene set testing, in which enrichment of significant differentially expressed genes is sought among gene sets, is by now a standard method to provide biological interpretation for gene expression data. Groups of genes with a common biological function, cellular localization, regulation, or chromosomal location may hold additional clues regarding underlying biology, or potentially improve prediction or classification. While many easy-to-use tools have been developed to facilitate pathway- or gene set-based analysis, inter-gene correlation within gene sets violates the independence assumption that underlies many gene set analysis methods. Indeed, our meta-analysis demonstrates that the correlation patterns persist across a wide variety of mouse and human experiments. Thus, we argue that correlation patterns should largely be viewed in terms of their effects on false positives, as we find no evidence that correlation within the gene set confers additional plausibility to a gene set finding.

Though it may be viewed as a technical matter, violation of the independence assumption is not a mere statistical detail. It is a tangible phenomenon that increases the chance of falsely declaring a gene set significant. The false positive rates established by our study are very high - sometimes an order of magnitude beyond the intended false positive rate. Array resampling methods correctly handle inter-set correlation and are not subject to these high false positive rates. Barry et al.

Several methods have been proposed which use multivariate modeling of array data to perform gene set testing ^{2 }distribution. However, we found that 94% of permutations called at least one GO BP category significant, with a median of 10 categories called significant in each permutation, suggesting difficulties in proper control of the false positive rate. It is worth noting that use of empirical covariance estimates is extremely challenging, especially when gene sets are larger than the sample size. GlobalANCOVA

As the use of independence assumption methods in gene array studies is widespread, we conclude that published false gene set findings may be common. The precise impact of the overly optimistic statistical support for many gene set findings is difficult to assess, because we do not know the underlying truth; however, a basic requirement of a valid statistical test is that it controls false positives under the null hypothesis, and we have clearly demonstrated here that methods that rely on independence assumptions are therefore invalid in this sense.

It should be acknowledged that very small microarray studies may have too few samples to support permutation. For such studies, a strong statistical result from an independence assumption-based method should be supported by corroborating biological evidence, and even then be interpreted with caution.

Conclusions

Here, we have demonstrated, using over 200 real experimental data sets that independence assumption methods for gene set enrichment suffer from such a high false positive rate that they should not generally be used. Array resampling methods, which correctly control the false positive rate, should be used by investigators, incorporated into existing routines and workflows, and insisted upon by reviewers of scientific manuscripts. We strongly encourage the use of tools such as GSEA

Methods

Compilation and Categorization of Pathway Tools

Using the list of pathway tools compiled by

Gene Expression Omnibus Datasets

Using the GEOmetadb

**Additional File 2. Table S1. HG-U95A array datasets**. PubMed, GEO and other IDs and descriptions of the datasets used. **Additional File 2. Table S2**. HG-U133A array datasets. PubMed, GEO and other IDs and descriptions of the datasets used. **Additional File 2**. Table S3. mgu74a array datasets. PubMed, GEO and other IDs and descriptions of the datasets used. **Additional File 2. Table S4**. moe430a array datasets. PubMed, GEO and other IDs and descriptions of the datasets used.

Click here for file

Correlation Analysis and Variance Inflation

For each array platform, the genes in each GO category from all ontologies and each KEGG pathway with between 5 and 5000 genes were collected. The mean Pearson correlation was taken as the mean pairwise correlation between all genes in the gene set, excluding the unit correlation of each gene with itself. For each gene set, the mean and standard error of the correlations in all datasets for a single platform was calculated. The variance inflation factor, used here only to provide motivation for our empirical results, is derived as follows. We construct the 2 × 2 table of gene significance (using nominal ^{2 }statistic, can be constructed from the single table entry for "significant and in the gene set," assuming that no substantial enrichment is expected in the remaining genes. Using

where

Independence Assumption Analysis

GO Biological Process categories and KEGG pathways with between 25 and 5000 genes were used to satisfy the assumptions of the χ^{2 }test. Based on the experimental annotation provided in Gene Expression Omnibus for each dataset, either a Student's T-test or analysis of variance (ANOVA) was carried out on each gene to select differentially expressed genes between the experimental classes provided in the annotation. Significant genes were selected using a nominal ^{2 }test (i.e. to test for enrichment only) was performed on each gene set, and significant gene sets were selected at a Bonferroni corrected

Permutation Analysis

For each dataset, either a Student's T-test or ANOVA was carried out on each gene to select differentially expressed genes between experimental classes at a nominal ^{2 }test statistic was used for gene set testing. The sample labels on the arrays were then permuted, and the entire analysis was repeated 10,000 times, resulting in a matrix of gene set statistics with one row for each gene set and one column for each permutation.

Resampling Analysis

The data was permuted as above and the permutation matrices containing gene set statistics were retained. Empirical p-values for each gene set were calculated as the proportion of permutations for which the χ^{2 }statistic was greater than or equal to the observed χ^{2 }statistic. A gene set was called significant if its Benjamini & Hochberg FDR adjusted

Abbreviations

GO: Gene Ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes; FDR: False Discovery Rate; ANOVA: Analysis of Variance

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DG, WB and FW participated in the design of the study; DG carried out the statistical analyses and drafted the manuscript; AN, FW, and IR were involved in the analysis of the data, drafting the manuscript and revising it critically for important intellectual content. All authors have given final approval of the version to be published.

Acknowledgements

Financial support for these studies was provided, in part, by grants from the National Institutes of Health (R01 ES015241) and the United States Environmental Protection Agency (RD833825; RD832720; and F08D20579). However, the research described in this article has not been subjected to each Agency's peer review and policy review and therefore does not necessarily reflect the views of the Agency and no official endorsement should be inferred. DG was also supported by the UNC Environmental Sciences & Engineering Interdisciplinary Fellowship.