School of Medicine & Health Sciences, University of North Dakota, Grand Forks, ND 58202, USA

Department of Statistics, Kansas State University, Manhattan, KS 66506, USA

Department of Statistics, University of Kentucky, Lexington, KY 40506, USA

Department of Mathematical Sciences, University of Montana, Missoula, MT 59812, USA

Institut für Kulturpflanzenzüchtung, Universität Hohenheim, D70599 Stuttgart, Germany

Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612, USA

Abstract

Background

Gene set analysis (GSA) has become a successful tool to interpret gene expression profiles in terms of biological functions, molecular pathways, or genomic locations. GSA performs statistical tests for independent microarray samples at the level of gene sets rather than individual genes. Nowadays, an increasing number of microarray studies are conducted to explore the dynamic changes of gene expression in a variety of species and biological scenarios. In these longitudinal studies, gene expression is repeatedly measured over time such that a GSA needs to take into account the within-gene correlations in addition to possible between-gene correlations.

Results

We provide a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions. The limiting distributions of our statistics are derived when the number of genes goes to infinity while the number of replications can be small. When the number of genes in a gene set is small, we recommend permutation tests based on our nonparametric test statistics to achieve reliable type I error and better power while incorporating unknown correlations between and within-genes. Simulation results demonstrate that the proposed method has a greater power than other methods for various data distributions and heteroscedastic correlation structures. This method was used for an IL-2 stimulation study and significantly altered gene sets were identified.

Conclusions

The simulation study and the real data application showed that the proposed gene set analysis provides a promising tool for longitudinal microarray analysis. R scripts for simulating longitudinal data and calculating the nonparametric statistics are posted on the North Dakota INBRE website

Background

Molecular biology, which is targeted at studying biological systems at a molecular level, has provided rich information of individual cellular components and their contributions to biological functions over the last 50 years. Our understanding of genes and their functions has been accelerated in the last decade by microarray experiments, which identify genes that are induced or repressed in a specific biomedical condition

Instead of looking at individual genes, researchers started to interpret biological phenomena in terms of groups of genes, or gene sets. For example, Segal

GSEA relies on permutation tests to identify the significant gene sets that have distinct gene expression between treatment groups. It works in three steps. First, all genes are ranked according to their statistics for the treatment effect. For example, a t-statistic can be used to compare two classes of samples. A score is assigned to each gene set using a weighted Kolmogorov-Smirnov-like statistic that sums up the ranks of the genes. Secondly, the class labels of the samples are permuted for a number of times, and gene set scores are calculated for each new label assignment. The permutation of sample labels preserves the inherent correlation between genes. Because the permutation is conducted under the null hypothesis of no treatment differences, the P value of each observed score can be determined empirically by the null score distribution. Thirdly, if more than one gene set is tested, the P values should be adjusted for multiple tests. GSEA is often applied for hundreds of gene sets, for which the false discovery rate (FDR) is recommended.

Ever since GSEA was introduced, it has drawn a wide attention from the biomedical and biostatistical communities. A number of alternative and extended versions of gene set analysis method (GSA) have been proposed in the last few years that use a variety of score systems and randomization procedures to resample data

Despite their enormous success, all these aforementioned GSA methods have limited applications in microarray samples with dependence. A permutation test has to rely on the assumption of sample independence. This assumption presents a barrier to extend GSA to the fast-growing area of longitudinal microarray experiments, which repeatedly profiles the gene expression of a same object over time. Longitudinal microarray experiments allow researchers to investigate dynamic behavior of biological processes, such as cell cycles, cell proliferation, oncogenosis, and apoptosis. The temporal component is an inherent part of the study. Such time course experiments pose novel challenges for statistical analyses because effective methods have to take into account both a large number of genes and within-gene correlations. Most of the analyses in literature carry out repeated measures analysis of individual genes followed by FDR control

It is desirable to apply repeated measures analysis methods, such as a linear mixed effects model (LME) or generalized estimating equations (GEE), to gene sets. Tsai and Qu (2008) assessed subsets of genes by applying a non-parametric time-varying coefficient model

In this paper we propose a GSA method for assessing the expression patterns of gene sets from longitudinal microarray data. The method employs a couple of novel nonparametric statistics that work for small sample size as long as we maintain a relative large number of genes in a set (large p, small n). The method is robust with respect to non-normality and heteroscedastic correlation structures. To ensure extensive application, unbalanced designs are allowed in our model. For example, unbalanced data may occur when the data are pooled from different versions or manufacturers of arrays.

The genes in a signal transduction pathway are often highly correlated in that the expression of one gene is regulated by the other gene in this pathway. To ensure an unbiased analysis, we need to take into account the correlation among genes. Permutation method has been widely used in GSA to provide a robust test that preserves between-genes correlations. For example, Tsai and Chen (2009) used permutation test with the Wilks' Λ statistic for their multivariate analysis of GSA

The outline of this paper is as follows. Our main results are presented in section Results and Discussion. In subsection Model and Hypotheses, we describe the model and assumptions. In the subsection of Simulation study, we present the simulation results of type I error estimates and power analysis for our proposed methods. In subsection Results on real data, we describe an application of our method to a recent longitudinal microarray study in which the gene expression profiles of murine T cells in the presence or absence of interleukin-2 (IL-2) were repeatedly collected. A number of functional gene sets were tested to investigate IL-2 signaling over time. The test statistics and their asymptotic results for a large number of genes but small replications are provided in subsection Test statistics of section Methods. Subsection Permutation tests described the permutation-based test with our proposed nonparametric statistics. Finally, we provide mathematical proof for the asymptotic results of our test statistics in Appendix.

Results and Discussion

Model and hypotheses

In a longitudinal design for microarray studies, global transcriptional levels of each object were repeatedly measured at multiple time points under various conditions, such as different drug doses, genotypes, and chemical environments. Our goal is to find whether the transcription levels of a set of genes show a dynamic pattern that differs between conditions. We enumerate all the conditions using

For a gene set, let **X**
_{
ikl
}= (_{
i1kl
},..., _{
iJkl
})' be the transcriptional levels of the ^{th }
^{th }
_{ik}
^{th }
_{ik }
**X**
_{
ikl
}) and Σ_{
ik
}= Var(**X**
_{
ikl
}) = (_{
i, k, jj'
})_{
J×J
}to be the gene specific mean and covariance matrix. Each individual gene has its own transcriptional activity, therefore, each gene has its unique correlation structure. The heteroscedastic covariances for different treatments and different genes allow us to take into account of the different mechanisms that different genes respond to a treatment. This is more realistic than assuming a common covariance matrix in that many of the genes are not responsive to a specific stimulus while the responsive genes could exhibit different temporal dependence. An example is that a stimulus specific regulator gene or transcription factor tends to be activated at the early stage of the stimulus and the downstream genes of the regulator will respond at a later stage. We leave the joint distribution of **X**
_{ikl }

Let ^{th }
**
α
**be the

where **L**
_{1 }is a **1**
_{J }
**0**
_{p }
**L**
_{1 }= (**1**
_{
I-1}| - diag(**1**
_{
I-1}, a column vector of ones, and the remaining columns are -**L**
_{1 }is

This particular contrast matrix basically specifies that all the treatment means some treatments averaged over the whole time period and over all genes are identical. Differences could arise if the mRNA transcriptions of some genes are activated or inhibited by the treatment. Genes could have distinct expression trends over time.

The hypothesis of no effect for a contrast among the treatment by time interactions can be expressed as

where **P**
_{I }
**L**
_{2 }is a _{I }
_{J }
_{I }
**1**
_{
I -1}| - diag(

We present a summary of notations that are used in the rest of the manuscript. Denote

We consider a couple of novel nonparametric statistics for hypotheses testing. A linear mixed effects model (LME) and generalized estimating equations (GEE) are often used for testing hypotheses (0.1) and (0.2) by assuming an appropriate correlation structure. The statistics for both LME and GEE achieve their asymptotic distributions when the number of samples goes to infinity. Thus, theoretically LME and GEE are not suited to large p, small n problems such as microarray data. This motivated us to propose new statistics that converge to their limiting distributions when the number of genes goes to infinity. The statistics should be robust for non-normal distributions, heteroscedastic correlation structures, and unbalanced experiment designs. Two novel Wald statistics are proposed for null hypotheses (0.1) and (0.2) in the method section. Their asymptoticity is proved in Appendix.

Simulation study

This section will present our simulation study to evaluate the proposed nonparametric test statistics (NP) in various settings. First, we calculate the estimated type I error rate at level 0.05 for our nonparametric statistics. The type I error will be examined for samples generated from normal, exponential, Poisson and Cauchy distributions after introducing within-subject correlations. Second, we will compare the power of the NP statistics with linear mixed-effects model (LME) and generalized estimating equations (GEE). The type I error and the power analysis are used to validate our NP statistics. Thirdly, we will calculate the estimated type I error and power of the permutation test with our statistics for correlated genes and compare the results with GEE on data from normal, exponential, and Poisson distributions. All calculations and simulations were carried out with R programming and the results were based on 1000 iterations. The LME and GEE methods were implemented by using

(a) Type I error rate analysis based on asymptotic distribution with simulated data

In this section, we evaluate the specificity of our proposed test (NP) based on type I error rates for simulated data from various distributions. The number of time points per gene we simulated is either 2 or 5. As balanced design is only a special form of unbalanced design, here we only consider unbalanced design in that four fifths of genes having 4 replications and the remaining one fifth of genes having 6 replications. First, we examined the proposed test statistic for no gene expression variations across treatments. A data matrix **X **of

Identical unit variance is used for data under the null hypotheses. We used the Cholesky decomposition (via R function **h **for the covariance matrix Σ. Thus the data matrix **Xh **has the desired covariance structure and it is used for subsequent data analysis. The matrix Y had equal means across rows. However, at different time points (across columns), the values from the same gene could vary.

Table

Estimated Type I errors for the test of no treatment effect based on asymptotic distribution

**#time.points**

**#genes**

**normal**

**exponential**

**Poisson**

**Cauchy**

2

5

0.060

0.053

0.063

0.021

10

0.047

0.052

0.048

0.024

20

0.053

0.046

0.054

0.026

30

0.048

0.063

0.058

0.019

40

0.044

0.052

0.053

0.021

50

0.043

0.052

0.057

0.020

100

0.040

0.050

0.042

0.020

5

5

0.056

0.052

0.059

0.032

10

0.053

0.055

0.057

0.025

20

0.047

0.045

0.066

0.020

30

0.060

0.058

0.050

0.014

40

0.050

0.049

0.047

0.018

50

0.044

0.041

0.041

0.016

100

0.062

0.047

0.050

0.023

The data from the same gene have unstructured correlation.

The next test was concerned with the interaction of treatment and time effect. Under the null hypothesis of no interaction, we generated random data as follows. Given the value _{ij }
^{th }
^{th }

where _{
ij
}is a random variable with mean 2(1 - _{
i, j+1 }is 2, which is the same as that of _{
ij
}. For the Poisson distribution, we first generated the mean values with the iterative algorithm (0.3), and then used the means to generate random integer numbers. An unstructured correlation was introduced to the repeated measures for each gene similarly as was generated for the test of no treatment effects. The type I error rates at

Estimated type I error of the test of no treatment by time interaction at 0

**#genes**

**normal**

**exponential**

**Poisson**

**Cauchy**

5

0.087

0.103

0.099

0.046

10

0.074

0.082

0.064

0.035

20

0.061

0.063

0.050

0.024

30

0.070

0.071

0.063

0.019

40

0.071

0.060

0.065

0.019

50

0.064

0.052

0.056

0.011

100

0.037

0.051

0.048

0.012

200

0.043

0.050

0.052

0.018

500

0.048

0.040

0.051

0.022

1000

0.057

0.046

0.048

0.013

The data from the same gene followed unstructured correlation. For each simulation, there are two time points.

(b) Power analysis based on asymptotic distribution with simulated data

To evaluate the proposed NP statistics, we calculated the estimated power curves for three methods, NP, LME and GEE. Data were simulated for 4 treatment groups and 3 replicates. As shown in Tables

For LME and GEE, gene expression levels were modeled as the response variables with treatment group and time as fixed effects. The variable subject, which provides measurements for all genes at all time points, are modeled as a random effect. Unstructured correlation structure cannot be estimated in LME and GEE model fitting due to the number of replications being small. In this part of the simulation, compound symmetry correlation structure was assumed for LME and working independence correlation structure was used for GEE.

First, we conducted a power analysis for the treatment effect. The means of the normal distributions are different between the treatment groups under alternative hypothesis, and the standard deviation of the normal distribution for each gene is randomly generated by a uniform distribution in (0, 3). The mean differences Δ between groups range from 0 to 2.5 to generate the power curves. Thus in each experiment, the logarithm of the mean of treatment group 2 is Δ higher than that of group 1, and that of group 3 is Δ higher than group 2, and so on. The three power curves for NP, LME, and GEE were shown in Figure

The power curve of NP statistic based on the asymptotic distribution compared to LME and GEE

**The power curve of NP statistic based on the asymptotic distribution compared to LME and GEE**. The empirical powers of the NP statistics for testing of no treatment effect based on the asymptotic distribution compared to LME and GEE are given here. The powers were estimated at level 0.05. Δ is the log-scale mean difference between successive treatment groups.

Next, we conducted power simulation analysis for the test of no treatment and time interaction. The results were similar to that for the treatment effect. So we do not present the results here.

(c) Type I error and Power analyses for the permutation test

We further conducted simulation study for the permutation test with our NP statistics by generating random data that had both within-gene correlation over time and between-gene correlation within a gene set.

Random data were generated for two treatment groups with three time points. In order to show the effects of sample size on the power, the number of replicates for a group varied from 5 to 50. Random data were generated in the same way as for power analysis of NP statistics described earlier except that an AR(1) correlation structure with correlation coefficient 0.5 was introduced to gene-gene relationship. Gene sets with 20, 50 and 100 genes were generated following normal, exponential and Poisson distributions. Since linear mixed effects model is not valid for exponential or Poisson distributions, we compare the permutation-based NP statistics with GEE. For this part of the simulation, gene expression levels were modeled as the response variables while fixed effects of treatment, time, treatment by time interaction, and gene index are included in the GEE model. The variable subject is modeled as a random effect and AR(1) correlation structure was assumed for GEE. The type I error estimates are reported in Table

Estimated type I errors for the permutation test of no treatment effect compared to GEE

**distribution**

**n1**

**n2**

**G**

**permutation NP**

**GEE**

Normal

5

6

20

0.041

0.097

25

25

20

0.055

0.058

45

50

20

0.047

0.056

5

6

50

0.045

0.105

25

25

50

0.058

0.053

45

50

50

0.046

0.044

5

6

100

0.033

0.087

25

25

100

0.053

0.058

45

50

100

0.047

0.045

Poisson

5

6

20

0.040

0.096

25

25

20

0.058

0.055

45

50

20

0.058

0.049

5

6

50

0.041

0.109

25

25

50

0.058

0.061

45

50

50

0.052

0.062

5

6

100

0.028

0.075

25

25

100

0.053

0.063

45

50

100

0.050

0.048

Exponential

5

6

20

0.040

0.101

25

25

20

0.046

0.070

45

50

20

0.047

0.062

5

6

50

0.041

0.083

25

25

50

0.048

0.056

45

50

50

0.052

0.051

5

6

100

0.041

0.087

25

25

100

0.046

0.053

45

50

100

0.044

0.059

The data from different genes and repeated measurements from the same gene have AR(1) correlation with correlation coefficient 0.5. The n1 and n2 are the sample sizes for treatment groups 1 and 2, respectively. G is the number of genes in the gene set. The estimate is at 0.05 level.

Power comparisons for the permutation test of no treatment effect compared with GEE

**Power comparisons for the permutation test of no treatment effect compared with GEE**. The power curves for using permutation tests for treatment effect are given here. The powers were estimated at level 0.05.

Results on real data

We apply the proposed method to a recent time course microarray study of mouse immune response. Cytotoxic T lymphocyte (T cells) plays a key role in cell-mediated immune response. They destroy virally infected cells, tumor cells, and other disease cells. The fast immune response to a foreign antigen relies on rapid activation and proliferation of T cells that are stimulated by a cytokine molecule, Interleukin-2 (IL-2)

We used the C2 collection of gene sets from the Molecular Signature Database (MSigDB) of Broad Institute. C2 collection is curated from various sources such as online pathway database, biomedical literature, and knowledge of domain experts

The distribution of the gene set sizes

**The distribution of the gene set sizes**. The histogram showed the distribution of the size of the 548 gene sets used for data analysis.

The IL-2 regulated gene sets.

**Gene Set**

**FDR**

Ross cbf

0.020

Peart histone up

0.047

Rome insulin 2f up

0.038

Hivnefpathway

0.025

Cell adhesion

0.041

Haddad hsc cd7 up

0.010

Flechner kidney transplant rejection pbl up

0.009

Shepard pos reg of cell proliferation

0.029

Haddad cd45cd7 plus vs minus up

0.010

Hsiao liver specific genes

0.031

Takeda nup8 hoxa9 3d up

0.030

Cromer hypopharyngeal met vs non dn

0.028

Vanasse bcl2 targets

0.006

Gamma unique fibro dn

0.018

Tnfalpha adip dn

0.026

Gn camp granulosa dn

0.041

Aged mouse neocortex up

0.026

Adip diff up

0.006

Hsa04370 vegf signaling pathway

0.016

Hsa04520 adherens junction

0.008

T lymphocyte activation by IL-2 culminates many cellular processes, including blastogenesis, cell cycle progression, DNA replication and Mitosis

Conclusions

With the fast advancement of high throughput genomics technology and increased complexity of array experimental design, researchers need robust statistical tools to decipher the code of sophisticated gene-gene interaction and networking during biological processes. Gene set analysis has served as a useful tool to identify functional gene sets in recent years. To apply GSA to correlated microarray samples such as longitudinal studies, we developed a couple of novel nonparametric statistics for testing gene set variation. The proposed GSA methods assess the effects of main treatment and treatment by time interactions for a set of genes measured in longitudinal microarrays. Heteroscedastic covariance structures are assumed for a realistic modeling of complicated microarray data. The limiting distributions of the proposed test statistics were derived under the asymptotic setting of a large number of genes and small number of replications. When a gene set contains only a small number of genes, permutation test based on the proposed NP statistics has excellent power compared to GEE in our simulation study. The proposed tests were applied to a collection of gene sets from the Molecular Signature Database (MSigDB) of Broad Institute and identified a number of gene sets that are responsive to IL-2 stimulation.

Methods

Test statistics

(a) Heteroscedastic test of no treatment effect

To test _{0}(treatment), we consider a Wald-type test statistic:

where

_{A }
_{1}.

(b) Heteroscedastic test of no treatment and time interaction effect

The test statistic is for no contrast effect among the interactions of treatment and time is given by

where **D**
_{AB}
^{th }row and ((_{1 }- 1)_{1})^{th }column of _{1}, the values is zero. If _{1}, the value is given by

_{AB }
_{2}.

Permutation tests

The nonparametric statistics given in (0.4 and 0.5) take into account the within-gene correlations among multiple time points. The correlations among genes within a gene set are unknown. We are not able to incorporate them into our statistics unless the genes are ordered in a manner such that the correlations between genes diminishes with a certain rate as their distance increases. It is unrealistic to make such an assumption for a gene set whose member genes have no known ordering. Furthermore, it is possible that all genes in a gene set are highly correlated. For example, if gene A is a transcription factor and the other genes in the gene set are its downstream genes regulated by A in a pathway, all genes will have high correlations. Failure of incorporating between-genes correlations would bias our statistics.

We use a permutation-based test with the proposed nonparametric statistics to avoid bias. Specifically, we performed 400 permutations for the treatment group labels of the subjects. For each permutation, we randomly assign _{i }

Appendix

Asymptotic distribution of the NP test statistics

**Theorem 0.1 **
_{0}(_{A }be the statistic given in (0.4). If X_{ijkl }has a finite fourth central moment, then under H_{0}(

**Proof of Theorem 0.1**: Under _{0}(treatment), _{A}
**0**. Hence, we have **L**
_{1}
**D**
_{A }
**L**
_{1}(**D**
_{A }
**- **
**D**
_{A}

Let V_{A }
**D**
_{
A
}] = _{
A1},..., _{
AI
}), where

Because of the independence of

It is easily seen that (0.6) and (0.7) is true since Lyapounov condition is satisfied with the finite fourth central moment condition:

The convergence of (0.7) can be shown by Markov weak law of large number. Note that

Then

It is sufficient to show that

for fixed J and _{ik}
_{ijkl }

**Theorem 0.2 **
_{0}(_{AB }be the statistic given in (0.5). If X_{ijkl }has a finite fourth central moment, then under H_{0}(

**Proof of Theorem 0.2: **Under _{0}(interaction), **L**
_{2}
**D**
_{AB}
**0**, then **L**
_{2}
**D**
_{AB }
**L**
_{2}(**D**
_{AB }
**D**
_{AB}
_{AB }
_{AB}
**a **= (_{11}, _{12},..., _{ij}
_{IJ }

where the limit of _{ik}

Note that

where the inequalities follow from Hölder's inequality, and the last equality holds due to the finite moment condition. This completes the proof.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KZ and HW developed the methods, wrote the code, performed the simulation and analysis and drafted the manuscript. AB, SH, HP and YD contributed ideas and wrote the manuscript with valuable discussions. All authors have read and approved the final manuscript.

Acknowledgements

Zhang's research was supported in part by NIH grant 2P20RR016471-09, North Dakota INBRE.