Abstract
Background
In studies that use DNA arrays to assess changes in gene expression, our goal is to evaluate the statistical significance of treatments on sets of genes. Genes can be grouped by a molecular function, a biological process, or a cellular component, e.g., gene ontology (GO) terms. The meaning of an affected GO group is often clearer than interpretations arising from a list of the statistically significant genes.
Results
Computer simulations demonstrated that correlations among genes invalidate many statistical methods that are commonly used to assign significance to GO terms. Ignoring these correlations overstates the statistical significance. Metaanalysis methods for combining pvalues were modified to adjust for correlation. One of these methods is elaborated in the context of a comparison between two treatments. The form of the correlation adjustment depends upon the alternative hypothesis.
Conclusion
Reliable corrections for the effect of correlations among genes on the significance level of a GO term can be constructed for an alternative hypothesis where all transcripts in the GO term increase (decrease) in response to treatment. For general alternatives, which allow some transcripts to increase and others to decrease, the bias of naïve significance calculations can be greatly decreased although not eliminated.
Introduction
The purpose of this work is to evaluate the statistical significance of treatments on the expressions in subsets of the genes on an array; for example, sets defined by gene ontology terms (GO terms, http://www.geneontology.org webcite). GO terms group genes according to a biological process, molecular function, or cellular component. Inferences about the impact of a treatment are usually more straightforward when based on GO terms or equivalent groupings as opposed to lists of significant genes. Hence, we want to assess the statistical significance of the treatments on the group.
mRNA levels measured among genes within a GO group will be correlated. Correlations among genes involved in a common biological task are likely, and correlations are also expected simply because the set of expressions are measured within an animal and array, i.e., under shared conditions. The pvalues computed in many packages[1,2] assume independence and therefore could be misleading.
The statistical significance of a GO group is commonly assessed by counting the number of statistically significant genes in the group. The null hypothesis that this count is a random sample of the significant genes on the array is tested versus an alternative hypothesis that the count is enriched [13]. The test is Fisher's exact test (probabilities computed using a hypergeometric distribution) or one of its many approximations (e.g., chisquared test which approximates the hypergeometric distribution with a binomial distribution) [4].
We do not test for enrichment primarily because the null distribution depends upon effects among interrogated genes that are unrelated to genes in the evaluated GO term. Put another way, the significance of a GO term should not depend upon whether or not other genes on the array (not in GO term) were affected by the treatments. This approach is not amenable to corrections for correlations among the pvalues, since the test inherently assumes exchangeability among genes; an assumption which is not met under arbitrary correlation structures. Furthermore, the usual implementation simply counts 'significant' genes which precludes extracting supporting evidence from the 'not significant' genes.
The distribution of pvalues for the individual genes on the array allows one to estimate the number of affected genes, and this estimate typically is larger than any list of 'significant' genes, which can be compiled with an acceptably low rate of misclassification [57]. Conceptually, methods that collate individual pvalues within biologically meaningful groups can extract any supporting evidence for treatment effects from group members that individually cannot be identified as affected by the treatments.
We prefer an approach that is based on a pvalue having a uniform distribution under the null hypothesis [8]. For any continuous probability distribution, x = F^{1}(1p) transforms p to the distribution specified by F. Two choices for F, which are in common use for combing pvalues, are the standard normal distribution, Φ(x), and the chisquare distribution with 2 degrees of freedom, F(x) = 1  exp(x/2). When pvalues are independent, the distribution of the sum is straightforward for either normal or chisquared deviates, and serves as a basis for testing the significance of a GO term. This is elaborated for the normal distribution in the next section, where we also develop a correction for correlations. The test based on the chisquared deviate can also be corrected for correlation at least in a 'onesided' case [912].
Randomization tests can deal with correlations among the genes in a GO term, and, when applicable, they should generate uniformly distributed null pvalues [1315]. Our studies usually process samples in batches and the estimation of treatment differences is essentially done within batches [6,1618]. Such analyses usually cannot be implemented as randomization tests and our motivation to develop the presented methods is largely in this context. We will present adjustments for a onesample ttest because the results are transferable to pairwise contrasts in a fixedeffects linear model. While the pairwise contrasts are our real concern, their presentation carries algebraic baggage that is largely irrelevant to correcting for correlation and we have opted for streamlined notation, which focused on the fundamental issues in correcting for correlation. A onesample ttest can also be implemented as a randomization test, which suggests presenting it as a competitor. The randomization test is constructed to have a uniform distribution under the null although for sample sizes as small as n = 5 the number of possible permutations may be a complicating factor. When both methods are applicable, we would not choose between them based upon their abilities under the null distribution but rather on a consideration of their power and/or robustness, which is well beyond the scope of this paper.
We will assume that the data are n observations coming from a population with mean vector, Δ and covariance matrix, Σ. Although simple, this model is directly applicable to some of our studies, and it serves here to focus on basic issues with relatively simple mathematics. For an example where this model can be used directly, see the test for gender differences in gene expression as estimated in Delongchamp et al. (2005)[16].
Methods
Metaanalyses
Several methods are used in metaanalyses to combine a set of pvalues into an overall significance level. Under a null hypothesis, the pvalue for a corresponding statistic is a random variable with a uniform distribution and it can be transformed to a convenient probability distribution [8]. Here, we use the inverse of the standard normal distribution. Then
z_{i }= Φ^{1}(1  p_{i})
is a random variable from the standard normal distribution, and when the set of pvalues, {p_{i }: i = 1,...,m}, are also independent, the statistic,
where _{m}1_{1 }= (1,1,...,1)', also has a standard normal distribution. So, the pvalue,
gives an overall significance level for the set. We refer to this as the naïve estimate because it naively assumes that covariance of z is the identity matrix, cov(z) = I.
Adjusting for correlation
In studies which use DNA arrays, the expressions measured on an array will be correlated and these correlations imply that the set of pvalues, {p_{i }: i = 1,...,m}, are not independent. If the covariance of z is known, it is straightforward to modify the statistic so that it accommodates correlations. Suppose that cov(z) = R, then the variance of 1'z is 1'R1 and the appropriate pvalue is
Note that the naïve estimator takes R = I implying no correlations among the m pvalues.
When R is unknown, it must be estimated. In this context, it is useful to 'adjust' the variance of the naïve estimator. Let be the average value of the offdiagonal elements of R, i.e., = (1'R1  m)/(m(m  1)), then the implied adjustment is
That is,
The correction for Ponly depends on the average correlation, , and not on the individual correlations. We believe that this allows one to generate an acceptable estimate in small data sets even though the individual correlations will be poorly estimated.
A couple of insights can be drawn from Equation (1). Since the adjustment is less than 1 when > 0; a significance level based upon the naïve version is too small. Further, the need for adjustment increases with the size of the subset, i.e., m.
Estimating a correction for correlation
Continuing with the data model described in the Introduction, the mean of n observations is assumed to have a multivariate normal distribution with mean vector, Δ, and covariance matrix, ; that is,
Let
• S estimate Σ;
• D be a diagonal matrix; ; estimated by
The elementwise tstatistic for the null hypothesis, Δ = 0, can be written in vectorial form as
Since is a consistent estimate of D, the ttest approximates the ztest for large n, i.e., as n increases. The cov(z) = DΣD, which is estimated by S.
Then for a onesided pvalue with an increasing alternative, t → z implies that the appropriate correlation matrix in Equation 1 is R = DΣD, which is approximated by = Σ. Note that for a decreasing alternative, the same correlation applies for z = Φ^{1 }(p).
ttest
The onesided pvalue for a null hypothesis on Δ is based on the distribution of the statistic, . Let and u_{j }= ay_{j}, and note that
Further, the variance of satisfies
So
This provides an alternative way to compute Equation (1), which has some advantages. First, one only needs to know , which is computed in the bygene analyses. Second, the bygene analysis applied to {u_{j }: j = 1,...,n} computes the statistic and its pvalue directly. So algorithms are simpler. Finally, the pvalue is based on a tdistribution, which better reflects the effect of small samples.
Morrison [19] derives Hotelling's T^{2 }test in the context of tstatistics of linear combinations, and presumably such statistics have a history nearly as long as multivariate statistics. O'Brien [20] examined this specific statistic as a method for comparing multiple endpoints in clinical trials. Lauter [21] noted that the null distribution is not the tdistribution with small sample sizes and explained a modified statistic, which corrects this. While the modified statistic controls the Type 1 error, the modification seems to reduce the power relative to O'Brien's statistic (simulations not reported).
One versus twosided tests
So far, we have been discussing the case where p_{i }is a onesided pvalue. Essentially, Equation 1 applies whether p_{i }is a pvalue from a onesided test or a twosided test. However, the covariance of z is not the same as the covariance of its absolute value, z. Consequently a twosided test in which , needs a different adjustment than a onesided test in which p = 1  Φ(z).
For a twosided test, the null distribution of 1'z can be generated through Monte Carlo samples from the null distribution of z, MVN(0, cov(z)). That is, let z_{1},...,z_{k }be pseudorandom samples from the multivariate normal distribution, MVN(0, cov(z)), and directly compute the pvalue for the observed value, ψ = 1'z, as
where I(A) is an indicator function which gives 1 if A is true, or 0 otherwise. In simulations where Σ is specified, the adjustment can be implemented based upon R, or I. In the twosided case, it is also of interest to generate samples from MVN(0, cov(z)) where is computed from and . In theory, adjustments based upon R correct the pvalue for correlations, and for large enough n so will and . In practice, or must be useful when n is quite small. Their utility in this regard can be illustrated with simulated data.
Results
Simulation
The theory outlined in the previous sections provides adjustments which will work well for large sample sizes. It does not guarantee much when sample sizes are as small as is seen in most studies that use DNA arrays. We simulated a few 'representative' cases to illustrate that P computed from the naïve statistic can be very inaccurate and to demonstrate that the adjustments proposed herein give useful corrections with small sample sizes.
The 'representative' simulations assumed a covariance matrix, Σ, for m = 20 correlated genes. We constructed Σ = D^{1}RD^{1 }by specifying R and D. The correlation matrix, R, is given in Table 1, ≈ 0.45. These correlations were randomly selected to be between 0.35 and 0.55. Correlations in this range are commonly observed in our studies. Likewise, we randomly selected 20 variances in the range typical of our studies (Table 2). For a given sample size, n, pseudorandom samples y_{1},y_{2},...,y_{n }were generated from a multivariate normal distribution, MVN(0,Σ), and P was computed using Equations 1 or 2. Since Σ is known, these computations can be based upon R, , or I. This procedure was iterated at least 10,000 times to observe enough Pvalues to adequately estimate the empirical distribution.
Table 1. Correlation matrix, R, used in the simulations. The representative correlations were randomly selected to be between 0.35 and 0.55. These correlations in this range are commonly observed in our studies.
Table 2. Variances used in the simulations. 20 variances are randomly selected in the range typical of our studies.
Figure 1 plots the cumulative distributions of Pvalues from a oneside test with n = 5. Ideally the null pvalues follow a uniform distribution, the diagonal line in this figure. The naïve Pvalue (cyan line) grossly overstates the significance of the test statistic with roughly 30% of these Pvalues being less than 0.05. The Pvalues (red line) computed using Equation 1 with the true correlation, R, have the expected distribution; the observed departures from the diagonal are within the variation associated with estimating the uniform distribution by an empirical distribution (KolmogorovSmirnov test, p = 0.21). The corrected pvalues based on the ttest method (blue line) and the estimated correlation, (green line), are near the diagonal albeit not perfect. However, they offer a big improvement over naïve values.
Figure 1. Cumulative distribution of pvalues for onesided test case with sample size n = 5. The corrected pvalues based on a tdistribution (blue line) and the (green line) are near diagonal, while the naïve pvalue (cyan line) overstates the significance of the test. We can correct the problem when we know the true correlation, R between the genes in a GO term (red line).
The corrections are expected to improve with larger sample sizes, and this is illustrated for n = 15 in Figure 2 where all of the correction methods are fairly accurate. Note also that the distribution for the naïve Pvalue does not respond to increasing sample size and remains very inaccurate. Figures 1 and 2 were generated from 10,000 iterations. Essentially, the empirical distributions for the corrections plotted in Figure 2 are not statistically different, and discrepancies with our expectations most likely reflect the variability among the plotted distributions. For example, we would expect the ttest version to be at least as good as the correction based on , and this is not apparent in Figure 2.
Figure 2. Cumulative distribution of pvalues for onesided test case with sample size n = 15. The corrections become better with a larger sample size. All of the correction methods give right adjustment as the prediction of the correlation becomes fairly accurate.
In practice we use the ttest correction because it is easiest to compute. Figure 3 plots the interval, 0.0001 to 0.1, for the ttest correction where the empirical distributions were estimated from 1 million iterations. This figure illustrates the convergence toward the diagonal as sample sizes increase. Even at n = 5, the accuracy is acceptable for our purposes and represents a substantial improvement over the naïve calculation. For example, the nominal level of 0.05 would actually be 0.08; much closer to the nominal level than the naïve test at approximately 0.3 (Figure 1).
Figure 3. Cumulative distribution of pvalues for ttest correction with sample size n = 5, 10, 15, 20. The convergence toward the diagonal as sample sizes increase is illustrated by the pvalues in the interval, 0.0001 to 0.1, for the ttest correction where the empirical distributions were estimated from 1 million iterations.
Figure 4 and Figure 5 show the results for simulations of the twosided case, n = 5, 15 respectively. Clearly, the naïve Pvalues should not be used because they are inaccurate. The empirical distributions for the other methods in these figures involve considerable computation since each Pvalue is generated by Monte Carlo sampling. To speed up computations, we terminated Monte Carlo sampling when or k = 10^{6}. Consequently, small values of P are more accurately estimated than large values. In practice, we do not need to estimate large values with as much accuracy as small values so this is not a bad strategy. However, it makes it more difficult to evaluate whether the correction follows the diagonal. For small values, say < 0.1, the corrections are comparable in accuracy to those in the onesided simulations. The correction based upon (blue line) seems preferable since it appears to be more consistent with the correction based upon the true correlation, R (red line).
Figure 4. Cumulative distribution of pvalues for twosided test case with sample size n = 5. Pvalues calculated from random samples based on (blue line) gives a reliable correction, which is comparable with the pvalues from the true correlation R (red line).
Figure 5. Cumulative distribution of pvalues for twosided test case with sample size n = 15. A better correction is also seen with larger sample sizes in twosided case. In this case we have more accurate estimate of the true correlation R.
Discussion
Correlations among gene expressions within a GO term invalidate the computed pvalue when it is based upon assumed independence. Such estimates overstate the significance when the correlations are positive. This behavior was demonstrated analytically through a specific statistic, Equation (1), as well as through simulations. The simulated data show that the bias of the naïve pvalue can be substantial with moderate correlation.
For didactic reasons, we used a statistic and a scenario, which is mathematically tractable. However, it should be understood that overstating statistical significance is a problem for any statistic where the computed pvalue for the GO term assumes independence among gene expressions. This is true for widely implemented tests which evaluate if significant genes are 'overrepresented' within a GO term. In these tests, pvalues are based upon the hypergeometric distribution (Fisher's Exact Test) or its binomial or chisquared approximations, and an assumption of independence is essential to the construction of the null distribution. In addition to the presented statistic, there are other metaanalysis tests based upon a uniform distribution of null pvalues. A theorybased adjustment for correlation is difficult to construct for these tests. Naïve versions are biased, but not always as severe as the estimate presented here. So, we have been pursuing corrections for them.
For the illustrated statistic, the naïve pvalue is easy to correct when the correlation is known. In practice the correlation must be estimated from limited data. Under the onesample ttest scenario, we can estimate the applicable correlation. The simulation of a 'representative' group of 20 genes shows that estimating the correlation improves upon the naïve pvalue with as few as 5 samples.
In the onesided case, a tstatistic can be computed which implicitly adjusts for the presence of correlations. This statistic is easy to implement in existing computer programs; essentially the program that computed bygene pvalues can be used. This procedure can be extended to any statistical test that can be applied to the individual genes. As presented here, the onesided alternative specifies that all genes change in the same direction. It is trivial to apply the procedure for any prespecified direction of change for each gene. As our knowledge of expression profiles from responses to toxicity grows, this approach might become a standard test in screening chemicals for toxicity.
In an exploratory context, it is not possible to prespecify how individual genes will respond to treatment, and pvalues must reflect the twosided alternative. At least with the simulated data, the method worked well to control the size of the test. Because misrepresents the true correlation structure, we are cautious in recommending its general use and plan to simulate a broader set of scenarios.
Conclusion
Reliable corrections for the effect of correlations among genes on the significance level of a GO term can be constructed for a onesided alternative hypothesis. For general twosided alternatives the bias of naïve significance calculations can be greatly decreased although not eliminated.
Authors' contributions
RD conceived of the ideas presented herein, developed the main methodology and wrote the manuscript. TL conducted the simulations. CV reviewed literatures. TL and CV have been involved in development of the algorithm and drafting the manuscript. All authors read and approved the final manuscript.
Acknowledgements
TL was supported by an Oak Ridge Institute of Science and Education (ORISE) fellowship at NCTR.
References

Tong W, Harris S, Cao X, Fang H, Shi L, Sun H, Fuscoe J, Harris A, Hong H, Xie Q, Perkins R, Casciano D: Development of public toxicogenomics software for microarray data management and analysis.
Mutat Res 2004, 549:241253. PubMed Abstract  Publisher Full Text

Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems.
Bioinformatics 2005, 21:35873595. PubMed Abstract  Publisher Full Text

Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression.
Genomics 2003, 81:98104. PubMed Abstract  Publisher Full Text

Johnson NL, Kotz S: Discrete Distributions. New York: John Wiley & Sons; 1969.

Allison DB, Gadbury GL, Heo M, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data.
Computational Statistics and Data Analysis 2002, 39:120. Publisher Full Text

Delongchamp RR, Bowyer JF, Chen J, Kodell RL: Multipletesting strategy for analyzing cDNA array data on gene expression.
Biometrics 2004, 60:774782. PubMed Abstract  Publisher Full Text

Schweder T, Spjotvoll E: Plots of pvalues to evaluate many tests simultaneously.
Biometrika 1982, 69:493502. Publisher Full Text

Hedges LV, Olkin I: Statistical Method for MetaAnalysis. Academic Press; 1985.

Brown MB: A method for combining nonindependent, onesided tests of significance.
Biometrics 1975, 31:987992. Publisher Full Text

Kost JT, McDermott MP: Combining dependent pvalues.
Statistics & Probability Letters 2002, 60:183190. Publisher Full Text

Xu X, Tian L, Wei LJ: Combining dependent tests for linkage or association across multiple phenotypic traits.
Biostatistics 2003, 4:223229. PubMed Abstract  Publisher Full Text

Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS: Truncated product method for combining p values.
Genetic Epidemiology 2002, 22:170185. PubMed Abstract  Publisher Full Text

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34:267273. PubMed Abstract  Publisher Full Text

Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies.
PNAS 2005, 102:1354413549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

BRB ArrayTools User's Guide, Version 3.3: (National Cancer Institute Biometric Research Branch, Bethesda, MD) Technical Report. 2005., 28

Delongchamp RR, Velasco C, Dial S, Harris AJ: Genomewide estimation of gender differences in the gene expression of human livers: statistical design and analysis.
BMC Bioinformatics 2005, 6(Suppl 2):S13. PubMed Abstract  BioMed Central Full Text

Desai VG, Moland CL, Branham WS, Delongchamp RR, Fang H, Duffy PH, Peterson CA, Beggs ML, Fuscoe JC: Changes in expression level of genes as a function of time of day in the liver of rats.
Mutation Research – Fundamental & Molecular Mechanisms of Mutagenesis 2004, 549:115129. Publisher Full Text

Parrish RS, Delongchamp RR: Normalization. In DNA Microarrays and Statistical Genomic Techniques: Design, Analysis, and Interpretation of Experiments. Edited by Allison DB, Page GP, Beasley TM, Edwards JW. Boca Raton, FL: Chapman & Hall/CRC; 2005:928.

Morrison DF: Multivariate Statistical Methods. New York: McGrawHill Book Company; 1967.

O'Brien PC: Procedures for comparing samples with multiple endpoints.
Biometrics 1984, 40:10791087. PubMed Abstract  Publisher Full Text

Lauter J: Exact t and F tests for analyzing studies with multiple endpoints.
Biometrics 1996, 52:964970. Publisher Full Text