Abstract
In studies that use DNA arrays to assess changes in gene expression, it is preferable to measure the significance of treatment effects on a group of genes from a pathway or functional category such as gene ontology terms (GO terms, http://www.geneontology.org webcite) because this facilitates the interpretation of effects and may markedly increase significance. A modified metaanalysis method to combine pvalues was developed to measure the significance of an overall treatment effect on such functionallydefined groups of genes, taking into account the correlation structure among genes. For hypothesis testing that allows gene expression to change in both directions, pvalues are calculated under the null distribution generated by a Monte Carlo method.
As a test of this procedure, we attempted to distinguish altered pathways in microarray studies performed with Mitochips, oligonucleotide microarrays specific to mitochondrial DNAencoded transcripts. We found that our analytic method improves the specificity of selection for altered pathways, due to incorporation of the intergene correlation structure in each pathway. It is thus a practical method to measure treatment effects on GO groups. In many actual applications, microarray experiments measure treatment effects under complicated design structures and with small sample sizes. For such applications to real data of limited statistical power, and also in computer simulations, we demonstrate that our method gives reasonable test results.
Introduction
The advent of DNA microarray technology has revolutionized genomic research and medicine because of its ability to simultaneously determine the expression levels of thousands of genes. However, the interpretation of large amounts of microarray gene expression data, and the ability to derive biologically meaningful conclusions from such data, have always been daunting tasks for statisticians. Because of the high volume and complex characteristics of microarray data, much of the initial work on their analysis has focused on development of data mining or data reduction methods to identify differentially expressed genes. Typically, the pvalue of a test statistic is calculated for each gene, the genes are ranked according to these pvalues, and a prespecified significance criterion, such as the false discovery rate, is used to determine a cutoff which creates a category of differentially expressed genes [13].
Attempts to interpret individual genes in a list of significant genes are demanding and laborious. Therefore, recent efforts have focused on discovery of biological pathways rather than on individual gene function. Gene ontology terms (GO terms, http://www.geneontology.org webcite) reflect gene groupings based on molecular function, biological process, or cellular structure/organelle. The interpretation of differentially expressed GO groups is generally simpler than the presentation of a list of statistically significant genes, and more resistant to erroneous conclusions that can arise from microarray artefacts.
Several statistical methods that combine the analysis of differential gene expression with biological databases have been proposed and implemented in computer packages for a more rapid interpretation of genomewide expression data [4]. However, most such methods are based on a series of univariate statistical tests and do not properly account for the complex structure of gene interactions. The statistical significance of a GO group is commonly assessed by comparing the number of statistically significant genes in the group to the number expected by chance using Fisher's exact test, which is based on the hypergeometric distribution [5]. Fisher's exact test is used to compare these proportions to assess overrepresentation of significant genes in functional categories. This approach is not amenable to correction for correlations among pvalues, since the test inherently assumes exchangeability among genes, an assumption which is not valid under arbitrary or actual correlation structures [6,7].
Hotelling's T^{2 }Statistic and permutation methods address the correlation structure among genes. Hotelling's T^{2 }statistic is not applicable when the sample size is smaller than the number of genes in a GO term [8]. Permutation methods, although quite valuable under appropriate conditions, are severely compromised by limited numbers of permutable sample pairs. In many cases, the design of microarray studies has a rather complicated structure intended to manage technical variation associated with differences among probes, dyes, and reagent batches by creating treatment blocks within these sources of variation [9]. Such cases are not suited to permutation methods.
A modified metaanalysis method was developed by Delongchamp et al. [10] to combine pvalues, and thus to measure the significance of an overall treatment effect on a group of genes, while taking into account the intergenic correlation structure. The method is based on the fact that pvalues follow a uniform distribution under the null hypothesis. Inversenormal transformed pvalues have a normal distribution and their sum over a set of genes also would follow a normal distribution, provided that the component pvalues are independent. The test we have developed to measure the significance of overall treatment effect on genes within a GO category is based on a modification of this statistic, to reflect the actual correlation structure among genes sharing a GO term.
In this paper, we extend the method from a simple oneclass ttest with the null hypothesis H_{0 }: μ = 0 to a test for pairwise contrast in a fixedeffects linear model. In the following sections, we describe in detail the extension of the methodology, with validation through computer simulations and application to two toxicogenomics studies designed to evaluate treatment effects on the levels of mRNA transcripts involved in mitochondrial function. We thus demonstrate that this methodology provides a practical approach to testing the significance of the treatment effects on gene classes defined by GO terms, and by extension on any other prior categorization of genes into functional subsets. Because many microarray experiments measure treatment effects under complicated design structures and with small sample sizes [9], we used a simulation study to determine whether the method gives reasonable results under these conditions.
Specific applications to toxicogenomics studies showed that the methodology has improved specificity in choosing significantly altered pathways or functional categories, and may thus assist in the understanding of molecular mechanisms of mitochondrial toxicity in the liver induced by antiHIV drugs [11,12] and in assessing effects on mitochondrial function of weightreducing dietary supplements, such as usnic acid [13].
Methods
Measurement of a treatment effect for each gene
Under a fixedeffects linear model, gene expression data for an arbitrary gene can be written as y = Xβ + ε, where y and ε are n × 1 random vectors, X is a known n × p design matrix of rank r, and β is a p × 1 vector representing unknown parameters. The vector y denotes an observed measurement of expression, suitably transformed, for n biological samples, and ε is an error vector, distributed as N_{n }(0, σ^{2 }I_{n}), where σ^{2 }denotes the unknown withintreatment variance among samples. The parameters β and σ^{2 }are assumed to be genespecific. Statistical analyses are applied to one gene at a time, with a common design matrix, X. The unbiased estimators of β and σ^{2 }are
In many toxicogenomic studies, the significance of a treatment effect is tested under the null hypothesis H_{0 }: cβ = 0, where cβ is a pairwise contrast among treatments. Under the null hypothesis, has a tdistribution with n  r degrees of freedom, and the pvalue to assess the significance of a treatment is calculated from this statistic.
Test for a gene group
A modified metaanalysis method of combining pvalues was developed to measure the significance of an overall treatment effect on any group of genes by a oneclass ttest [10]. The pvalue calculated from the null hypothesis is a random variable with uniform distribution, which can be transformed to a suitable probability distribution. Inversenormal transformed pvalues, z_{k }= Φ^{1 }(1  p_{k}) ~N(0, 1), k = 1, ⋯, m have a normal distribution and their sum, , is also normally distributed when pvalues are independent. Here, p_{k }represents a pvalue for a gene in a GO group comprising m genes. The pvalue for the sum of inversetransformed pvalues, , gives the overall significance of the treatment effect on the GO group. We refer to this as the naïve estimate because it assumes independence among pvalues.
In reality, genes in a GO group are likely to be functionally related and thus not independent. When the correlation structure among genes is known, we can make a simple adjustment of the naïve estimate. In this case, the test statistics T still has a standard normal distribution and we denote it as , for the kth gene in a GO group. A common contrast vector, c, is used through all genes since we are measuring same contrast for each gene. The summary statistic for a GO group, is also normally distributed and its variance is , where 1 is m vector of 1s and R is the correlation matrix of (y_{1}, y_{2}, ⋯, y_{m}). Note that
where r_{s,t }is the sth row and tth column element of R. Therefore, is the appropriate pvalue which corrects the naïve pvalue, .
It follows that , where is the average of offdiagonal elements of R. The correction depends only on the average correlation, , and the correction tends to give a reduced significance when > 0. When R is unknown, we estimated the covariance, to provide an average correlation coefficient for the correction.
The correlation structure of pvalues is different for a twosided ttest, which allows gene expression changes in both directions, than for a onesided situation. A twotailed test, in which p = 2(1  Φ(z)), requires a different correction method, since the correlation among z_{k}, k = 1, ⋯, m differs from a onesided test in which p = 1  Φ(z). The null distribution of the summary statistics 1'z can be generated through Monte Carlo sampling from the null distribution of z, MVN(0, cov(z)). When z_{1}, ⋯, z_{n }are random samples from MVN(0, cov(z)), the pvalue for the observed value, Ψ = 1'z, is computed as , where I(A) is an indicator function which gives 1 if A is true, or 0 otherwise. Here, cov(z) has to be estimated from the data. The estimated correlation, and its variation, , which has for offdiagonal elements, are used to estimate cov(z).
Results
Simulation
The derivation of the method presented in the previous section is based on the known correlation matrix of the vector of dependent variable Y. When the correlation matrix is not known, we use an estimate of the correlation matrix. In reality, the correlation matrix is always unknown. The proposed method produces an approximately correct pvalue for a group of genes. To demonstrate that the method gives not perfect but acceptably correct pvalues, we present simulation results in this section. The validation is done by checking the cumulative distribution of pvalues from the proposed methods under the null distribution. The pvalues must have a uniform distribution, which should form a diagonal line in the following figures if pvalues are correctly calculated.
The simulation is conducted under a fairly common set of conditions for microarray studies, comprising three treatments with three samples (arrays) per treatment. Samples are generated from N(μ_{i}, Σ) for each treatment, i = 1, 2 3,. In this simulation, samples for two treatment have the same average values, μ_{1 }= μ_{2 }= 1, and samples for the other treatment have twice that average value μ_{3 }= 2. Pvalues for the pairwise contrast are calculated under the null hypothesis, H_{0 }: μ_{1 }= μ_{2}. A GO term is composed of m = 20 genes which have correlation structure generated randomly between 0.36 and 0.55. The standard deviations, σ_{i}, i = 1, ⋯,m for the genes, which are the diagonal elements of Σ, are generated randomly between 0.01 and 0.25. We iterated this procedure at least 10,000 times to observe the distribution of calculated pvalues.
Figure 1 plots the cumulative distribution of pvalues from a onesided test when the number of samples is n = 9, i.e., 3 groups with 3 samples for each group. The naïve pvalues, shown by the red line, clearly deviate from the diagonal line. Almost 30% of pvalues are estimated to be less than 0.05, indicating that the naïve pvalues lead to a very high falsediscovery rate. The corrected pvalues (dashed blue line) fall very near the diagonal line. The corrected pvalues thus have more specificity in choosing altered functional gene groups than the naïve pvalues.
Figure 1. Cumulative distribution of pvalues for onesided test case with sample size n = 9. The naïve pvalues (dashed red line) deviate from the diagonal line. Almost 30% of pvalues are estimated to be less than 0.05. The corrected pvalues (dashed blue line) fall very near the diagonal line.
Figure 2 shows the simulation result for a twosided case. Pvalues are calculated from the null distribution generated from Monte Carlo samples from MVN(0, cov(z)). Two estimates of cov(z) are used for the sampling. When is used, the empirical distribution of pvalues is closer to a uniform distribution than when is used. The estimate of the average correlation, , is more robust than that of each element of when sample size is small (n = 3 for each group).
Figure 2. Cumulative distribution of pvalues for twosided test case with sample size n = 9. Pvalues calculated from random samples based on (dashed blue line) and (dashed green line) give reliable corrections, while the naïve pvalue (dashed red line) overstates the significance of the test.
Figure 3 shows the distribution of pvalues for a case with larger samples. The simulation for oneclass ttest with sample size n = 25 was conducted as above, to compare several methods for two sided tests. The empirical distributions of pvalues were generated from 500,000 iterations. We looked at small pvalues between 0.1 and 0.001 on a log scale. Figure 3 shows that the Hotelling T^{2 }test gives the smallest difference from the uniform distribution. The Hotelling T^{2 }test is applicable when the number of sample is larger than the number of genes in a group. When we have a reasonably correct estimate of R, the method is a little better than the method which uses an approximation of R. Both the method and the method give quite accurate pvalues with reference to the pvalues from the true correlation matrix, R.
Figure 3. Comparison of twosided tests with sample size n = 25. Hotelling T^{2 }test (dashed blue line) gives the smallest difference from the uniform distribution. When we have enough number of samples to have a reasonably correct estimate of R, the method (dashed red line) is a little better than the method (dashed green line). Both the method and the method give quite accurate pvalues compared to the pvalues from the true correlation matrix, R (dashed cyan line).
Examples
We present two realworld examples based on data from Mitochip, a mitochondriaspecific mouse oligonucleotide microarray which was developed by Dr. Varsha Desai at the National Center for Toxicological Research [11]. Mitochip measures the levels of mRNA for 542 mitochondrial and nuclear genes associated with mitochondrial structure and function. Each Mitochip includes 9 housekeeping genes and 9 Arabidopsis plant genes to serve as positive and negative control genes, respectively. We considered 317 relevant GO groups related to mitochondrial functions, based on a database from Mouse Genome Informatics (MGI, http://www.informatics.jax.org webcite).
Table 1 shows the design of an experiment to test the effects of zidovudine (AZT) and lamivudine (3TC) on mouseliver gene expression. AZT is an antiHIV drug used to reduce mothertochild transmission of the virus. AZT is reported to produce severe adverse effects, and shows more toxicity when AZT is applied in combination with 3TC. Adverse effects are believed to be due to druginduced mitochondrial disfunction [14].
Table 1. Experimental design for the AZT and 3TC effects on mouseliver gene expression.
Oxidative phosphorylation is a key mitochondrial function that requires the electron transport assembly of four protein complexes (I, II, III, IV) to catalyze sequential oxidation/reduction reactions, and complex V to generate ATP. Several clinical and animal studies have investigated the effect of nucleoside reverse transcriptase inhibitors (NRTIs), analogs such as AZT, on mitochondrial respiratory chain complexes. These studies suggest that there is a deficit in one of the components of complexes I and IV in skeletal muscle of children perinatally exposed to antiretroviral nucleoside analogues [15].
Table 2 shows pvalues indicating the significance of treatment effects on the GO groups related to oxidative phosphorylation and apoptosis. The twosided correction method detects significant effects on genes encoding components of complexes III and IV, whereas the naïve method finds that genes in all 5 complexes are significantly affected. This demonstrates that the twosided correction method is more specific in finding significantly affected gene groups, although of course the "true" answer is not known a priori. Although Fisher's exact test also detects significant alteration in genes of complex IV, this test appears to be detecting a gene group that is different from the other groups, rather than registering treatment effects directly. The onesided test is not applicable since it seems that gene expression changes in both directions after AZT and 3TC treatment.
Table 2. Effects of AZT and 3TC on oxidative phosphorylation and apoptosis.
Usnic acid is a lichen metabolite used as a weightloss dietary supplement due to its uncoupling action on mitochondria. However, its use has been associated with severe liver disorders in many individuals. Animal studies conducted thus far have evaluated effect of usnic acid on mitochondria, primarily by measuring the rate of oxygen consumption and/or ATP generation. Generation of ATP requires tight coupling of electron transport with oxidative phosphorylation, maintained through a proton gradient across the inner mitochondrial membrane. An important finding of the study is a lack of usnic acid effect on complex V, despite a significant upregulation of all four complexes of the electron transport chain. Usnic acid is a known uncoupler that is highly lipophilic in both neutral and anionic forms due to its numerous carbonyl groups that absorb the negative charge of the anion by resonance stabilization. This lipophilicity of usnic acid and the usniate anion allows easy passage of both entities through the mitochondrial membranes by passive diffusion into the matrix where it is ionized, releasing a proton into the matrix. The resulting usniate anion can then diffuse back into the intermembrane space where it binds to the proton on the acidic side of the inner membrane to reform usnic acid which can then diffuse back into the matrix. The resulting cycle causes proton leakage that eventually can dissipate the proton gradient across the inner membrane, disrupting the tight coupling between electron transport and ATP synthesis. This model would explain the absence of geneexpression changes associated with complex V in usnic acidtreated mice, despite the increased electron transport by complexes I – IV. It may also explain the decline in ATP level in spite of increased oxygen consumption.
In Table 3, only the twosided correction method enables us to explain the function of usnic acid as described above. The onesided correction method gives pvalues similar to those in the twosided correction method, but this is likely to be due to most of the gene expression changes entailing upregulation. When the direction of gene expression change due to a treatment is known, then the onesided correction method is the appropriate choice; it also needs less computation time than the twosided correction method which employs Monte Carlo sample generation.
Table 3. Effects of usnic acid on phosphorylation and apoptosis.
Discussion
In many studies that use microarray data, the number of samples is small as in the first example shown above. While the number of samples in the simulations is only 3 for each group, the distribution of corrected pvalues approximates a uniform distribution. The estimation for the correlation, , might not be very close to the true correlation, R. However, the correction methods that depend only on the average correlation, , are more robust because the estimation of is more robust.
For the onesided test, the correction for the correlation depends only on , the average of offdiagonal elements of the correlation matrix. The corrections using or are equivalent for the onesided test method. Important points in choosing a correlation estimate for the twosided test are the following; 1) The correlation estimate should be robust for small sample sizes, and 2) The correlation estimate should preserve . satisfies these two conditions.
When the direction of gene expression change is prespecified, the onesided test is a good choice since it is easy and fast to calculate pvalues. However, the twosided test is the one we have to use in most cases, because it is usually not possible to prespecify how individual genes will respond to treatment in the exploratory context. When we have a small number of samples to estimate the correlation, the method gives a robust result. Since misrepresents the true correlation, and gives biased pvalues, the method works better for larger sample sizes. This is seen in Figure 3, where the method is better than the method. We hesitate to present a specific threshold sample size as sufficient for a converged correlation estimate, since it varies with respect to several conditions, such as the number of genes in a group, the variation of the elements of the correlation matrix, etc. The simulation result in Figure 3 shows that both methods give quite accurate pvalues compared to the pvalues from the true correlation matrix, R. The Hotelling T^{2 }test is the best choice whenever it is applicable.
In Table 1, the distribution of animals from different treatment groups (AE) in three batches (1–3) gives no permutable pairs. In this case, randomization methods are not applicable. Even though randomization methods inherently take into account the correlation structure among genes, they may not be practical when the design of the experiment is complicated and the number of samples per group is small, reducing the numbers of permutable pairs.
Randomization methods that permute class labels can adjust pvalues for the correlation structure among genes. Randomization methods choose a summary statistic (e.g. enrichment score (ES) in [16], average zscore in [17]), which reflects the degree to which a set of genes is enriched. When the significance of the summary statistic is measured by permuting class labels, the method preserves genegene correlations and when applicable, would give similar result to the presented method. Randomization methods that permute gene labels, such as Fisher's exact test, do not preserve the correlation structure and misrepresent the group's significance.
Conclusion
We have presented a method to test the significance of expression changes within a group of genes, while considering the correlation structure among genes in each group. This method will enable the rapid detection of microarray evidence indicating altered cell functions or pathways, and will facilitate the interpretation of microarray outcomes. Application of the method to real data shows that it is an improved, practical method to evaluate the effects of treatments on functional classes of genes such as those based on Gene Ontology descriptors.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
TL conducted the simulations, analyzed the data and wrote the manuscript. VGD conducted the microarray experiments using Mitochip and made the biological interpretations. CV reviewed the literature. RJSR gave valuable suggestions on the preparation of the manuscript. RRD directed the methodology development, data analysis, and manuscript preparation. All authors read and approved the final manuscript.
Acknowledgements
TL was supported by an Oak Ridge Institute of Science and Education (ORISE) fellowship at NCTR.
Disclaimer: The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the FDA.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 9, 2008: Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S9
References

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.
Journal of the Royal Statistical Society Series B 1995, 57:289300.

Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach.
Journal of the Royal Statistical Society, Series B 2004, 66:187205. Publisher Full Text

Allison DB, Gadbury GL, Heo M, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data.
Computational Statistics and Data Analysis 2002, 39:120. Publisher Full Text

Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems.
Bioinformatics 2005, 21:35873595. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Draghci S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression.
Genomics 2003, 81:98104. PubMed Abstract  Publisher Full Text

Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies.
PNAS 2005, 102:1354413549. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex.
Neurochemical Research 2004, 29:12131222. PubMed Abstract  Publisher Full Text

Kong SW, Pu WT, Park PJ: A multivariate approach for integrating genomewide expression data and biological knowledge.
Bioinformatics 2006, 22:237380. PubMed Abstract  Publisher Full Text

Delongchamp RR, Velasco C, Desai VG, Lee T, Fuscoe JC: Designing toxicogenomics studies that use DNA array technology.

Delongchamp RR, Lee T, Velasco C: A method for computing the overall statistical significance of a treatment effect among a group of genes.
BMC Bioinformatics 7(Suppl 2):S11.
2006 Sep 6
PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text 
Desai VG, Lee T, Delongchamp RR, Moland CL, Branham WS, Fuscoe JC, Leakey JEA: Development of mitochondriaspecific mouse oligonucleotide microarray and validation of data by realtime PCR.
Mitochondrion 2007, 7(5):3229. PubMed Abstract  Publisher Full Text

Desai VG, Lee T, Delongchamp RR, Leakey JEA, Lewis SM, Lee F, Moland CL, Branham WS, Fuscoe JC: NRTIinduced expression profile of mitochondrial genes in the mouse liver.
Mitochondrion 2008, 8(2):181195. PubMed Abstract  Publisher Full Text

Joseph A, Lee T, Moland CL, Branham WS, Fuscoe JC, Leakey JEA, Allaben W, Lewis SM, Ali AA, Desai VG: Effect of usnic acid on mitochondrial functions as measured by mitochondriaspecific oligonucleotide microarray in liver of B6C3F_{1 }mice.
Biochemical Pharmacology 2008.
submitted

Cherry CL, Wesselingh SL: Nucleoside analogues and HIV: the combined cost to mitochondria.
J Antimicrob Chemother 2003, 51:10911093. PubMed Abstract  Publisher Full Text

Barret B, Tardieu M, Rustin P, Lacroix C, Chabrol B, et al.: For the French perinatal cohort study group. Persistent mitochondrial dysfunction in HIV1exposed but uninfected infants: clinical screening in a large prospective cohort.
AIDS 2003, 17:17691785. PubMed Abstract  Publisher Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: A knowledgebased approach for interpreting genomewide expression profiles.
PNAS 2005, 102:1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R: On testing the significance of sets of genes. [http://wwwstat.stanford.edu/~tibs/GSA/] webcite