Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ 07102, USA

Department of Mathematics, Central Michigan University, Mt. Pleasant, MI 48858, USA

Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA

Abstract

Background

Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate.

Results

We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant.

Conclusions

The proposed methodology; (a) is easy to implement, (b) does not require inverting potentially singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc.

Background

With the advent of high dimensional genomic data, researchers are able to study changes in the expression of several hundreds and thousands of variables such as genes or CpG’s under various experimental conditions (or phenotypes) in a given cell culture, tissue or an organism etc. Although identification of differentially expressed individual variables across experimental conditions is of general interest, in recent years there is considerable interest in analyzing sets of variables that belong to some pre-specified biological categories such as signaling pathways and biological functions. Numerous statistical and computational methods have therefore been developed for such analyses. Although the methods described in this paper are broadly applicable to any high dimensional data where the sets and subsets are pre-defined, for simplicity of exposition, we shall describe the methodology in the context of gene expression data. The available gene set analysis (GSA) methods can be broadly classified into two categories. Loosely speaking, the first category of methods, often referred to as competitive gene set methods, tries to answer questions such as “Given the collection of differentially expressed genes identified by a statistical/bioinformatics methodology, how enriched is a pre-specified set?” For example, suppose _{1} and _{2} are two pre-specified sets consisting of _{1}and _{2}genes respectively. Suppose an investigator identified a total of _{i} of which belong to set _{i}_{1}(_{2}) or more genes from the set _{1}(_{2}). Several variations and innovations to Fisher’s exact test, Kolmogorov-Smirnov test, etc, have been proposed in the literature for obtaining the corresponding

Most earlier methods (belonging to either of the two categories described above) are based on univariate statistical tests and thus ignore the underlying dependence in the gene expression data (c.f.

A natural multivariate extension of the classical t-test is the Hotelling’s ^{2} test which can be used for comparing a set of genes between two experimental conditions. Consequently, several GSA methods using Hotelling’s ^{2} test have been proposed in the literature such as ^{2} statistic requires the sample size to be larger than the number of variables. However, for GSA, it is common for the sample size to be much smaller than the number of genes in a set. As a consequence, the Hotelling’s ^{2} statistic is not uniquely defined. To deal with the singularity problem, several approaches have been proposed in the literature. For instance, Kong ^{2} statistic by replacing the inverse of sample covariance matrix by its Moore-Penrose inverse based on the first few eigenvalues. Although this procedure is appealing, there is arbitrariness in the choice of number of eigenvalues to be used. Recently ^{2} statistic by replacing the sample covariance matrix by a shrinkage estimator of the covariance matrix derived in ^{2}, for large gene sets (i.e. sets with a large number of genes), they still pose computational challenges. It is because that the test statistic involves the inversion of a high dimensional covariance matrix even though it may be non-singular. Lastly, all multivariate methodologies described above implicitly assume that the gene expression data in the two experimental conditions are homoscedastic across all genes. That is, for a given set of genes the covariance matrix of gene expression in the two groups is identical. This, in our opinion, is a very restrictive assumption and may be hard to verify in practice when dealing with microarray data consisting of several thousands of genes.

To gain deeper understanding of the differences between the two experimental/test groups (e.g. cancer and normal patients), there is considerable interest in identifying not only sets of genes involved in a pathway or a biological process, but also in identifying subsets of genes belonging to a particular biological process within each significant set. For example, genes in the Vascular Endothelial Growth Factor (VEGF) pathway are important for angiogenesis. There are about 31 genes in this pathway that are involved in various biological processes. These 31 genes can be further partitioned into different subsets of biological functions and the biologist may be interested in discovering not only the VEGF pathway but also various subsets of genes within this pathway. For example, MAP2K3, MAP2K6, p38, MAPKAPK2, MAPKAPK3, and HSP27 are involved in Actin reorganization, FAK and Paxillin are involved in Focal Adhesion Turnover, whereas GRB2, SHC, SOS, Ras, Raf1, MEK1, MEK2, ERK1, and ERK2 are involved in gene expression and cell proliferation. Similarly, other genes in VEGF pathway are involved in various other biological processes, such as cell survival, vascular cell permeability, prostaglandin production, and nitric oxide production.

In examples such as the above, we may (i) be interested in using the additional information about the subsets to improve the power of detecting gene sets (such as the VEGF pathway), and (ii) not only be interested in knowing if genes in the VEGF pathway are differentially expressed between control and treatment group, but also interested in identifying subset of genes in biological processes within VEGF pathway that are also differentially expressed between the two groups. Methods described above and other multivariate statistical methods, such as the methods based on principal component analysis

In this paper we introduce a novel methodology that (a) is computationally simple and does not require inversion of any matrix, (b) exploits the underlying dependence structure, (c) is useful for identifying significant gene sets and subsets within each significant set, (d) controls the overall familywise error rate (FWER) at the desired nominal level, and (e) is robust to potential heteroscedasticity in the data.

The basic idea of the proposed method is rather simple. Using the available biological knowledge, we partition the sets of genes into various subsets within sets. Within each gene subset so obtained, we perform a variation of Hotelling’s ^{2}test and calculate the corresponding

Methods

Notations

Suppose we are interested in comparing two experimental conditions on the basis of mean expression levels of genes belonging to _{1}, _{2},… _{K}. For instance, these gene sets may represent different pathways or biological functions, derived from databases such as GO, KEGG, IPA, etc. Furthermore, suppose each gene set _{k}, _{k} pre-specified subsets _{k,1},· · · ,_{k,mk} such that **X**_{ij} is a ^{jth}sample, _{i}, in the ^{ith} group, **X**_{ij})=_{i} and covariance matrix **X**_{ij})=_{i}, where _{i}=(_{i1},…,_{iG})^{″},

For set _{k}, we are interested in testing the following null and alternative hypotheses; _{k}:_{1,k}=_{2,k} versus _{1,k}≠_{2,k}, where _{i,k}=(_{i,j}:_{k}) denotes the mean vector of genes in the set _{k}for samples from the _{k,j}:_{1,kj}=_{2,kj}versus _{k,j}:** μ**1,

The test statistic and its null distribution

We shall now describe the test statistic using a generic notation. Suppose, for **X**_{i1}**X**_{i2},…, _{i}and covariance matrix _{i}. Let ^{ith}population, **S**denote the usual pooled sample covariance matrix. Samples randomly drawn from these two populations are independent. Then under the assumption of _{1}=_{2}, the Hotelling’s ^{2} statistic is proportional to ^{2} and Fisher’s linear discriminant function can be unstable since they involve the inversion of a high dimensional covariance matrix **S**. In the context of discriminant analysis **S**performed better than Fisher’s linear discriminant function that used the entire matrix **S**. In addition, in practice it may not be suitable to assume that _{1}=_{2}. Motivated by these reasons, we use the following test statistic for testing the null hypotheses described in the above subsection:

where **S**_{i}) is a diagonal matrix containing the diagonal elements of the sample covariance matrix **S**_{i}

Since the underlying gene expression data are not necessarily multivariate normally distributed and the covariance matrices of these two groups are potentially unequal, the exact distribution of the above test statistic under the null hypothesis cannot be determined easily. We therefore adopt bootstrap methodology for simulating the null distribution of the test statistic such that the resulting methodology is not only robust to heteroscedasticity but also preserves the underlying dependence structure among genes. To do so, we draw simple random sample (with replacement) of _{i} subjects from the ^{th}group, _{i} from the resampled subject _{i}, where ^{jth}subject selected. For more details regarding the residual bootstrap methodology we refer the reader to

The proposed strategy

For each _{k,j}. If we have only a single gene-set _{k}with _{k} gene-subsets _{k,1},· · ·, _{k}, the problem of testing the significance of _{k,1},· · ·, _{k} null hypotheses, _{k,j}’s. The gene-set _{k} is declared to be significant if and only if at least one _{k,j}is rejected in the above problem of multiple testing.

There are two popular notions of type I error rates when dealing with the problem of simultaneously testing multiple hypotheses, one is to control FWER, which is the probability of falsely rejecting at least one true null hypothesis, and the other is to control the FDR, which is the expected ratio of false rejections to the total number of rejections

There are several FWER controlling procedures available in the literature for testing the family of null hypotheses, _{k,j}≤_{k}. The corresponding Bonferroni-adjusted _{k},_{k}gene-subsets _{k,1},· · ·, _{k,1},…, _{k},_{k,j}in

For testing the

Step 1. Compute raw residual bootstrap

Step 2. Compute adjusted _{k}(adjusting for the number of subsets within the set) as described above.

Step 3. Declare a set _{k}to be significant if its adjusted _{k,j}within the set _{k}is declared to be significant if its raw _{k,j}is less than _{mk}.

It is easy to see that the above proposed procedure strongly controls the overall FWER for any dependent test statistics, the probability of falsely rejecting at least one true null hypothesis in some family.

When the number of gene sets and gene subsets is large, it might be preferable to control the FDR rather than the FWER. The above proposed testing strategy controlling the FWER can be easily modified to control the FDR by using the BH procedure to replace the Bonferroni procedure when simultaneously testing the significance of the gene sets. Such modified strategy is very similar to a two-stage test strategy developed in

Simulation study

We evaluate the performance of the proposed methodology in terms of power (the probability of rejecting at least one false null hypothesis) and the FWER control with Tsai and Chen’s method in

Study design

In the simulation study, we considered two patterns of total number of sets of genes, which were 5 and 10. Since, in practice, the number of subsets and the number of genes within a subset may be unknown a priori, we allowed the number of subsets within each set of genes to be uniformly distributed in the range 5 to 16 and the number of genes within each subset was generated according to a uniform distribution in the range 5 to 10. To understand the robustness of the two methods in terms of FWER control, we considered a variety of probability distributions for the gene expression as follows:

(1) Multivariate normal distribution, of appropriate dimension, with mean vectors **0**(for the control group) and ** μ**(for the treatment group), and covariance matrices

(2) Multivariate log normal distribution, where the vector of natural logarithm of each component follows multivariate normal distribution, with parameters as defined in the above setting of multivariate normal distribution, with _{1}=_{2}.

(3) Multivariate beta distribution. This distribution is motivated by CpG methylation data. Within each treatment group the multivariate beta vector was generated as follows. To generate _{1},_{2},…,_{p}with either 4 or 5 degrees of freedom and generated an additional independent chi-square random variable **Z**=(_{1},_{2},…,_{p})^{″}, where _{i}=_{i}/(_{i} +

(4) Mixtures of multivariate normal random vectors. For each treatment group we generated mixture of multivariate normally distributed data **Z**as follows:

where

All our simulation results are based on a total on 1,000 simulation runs and 5,000 bootstrap samples.

Results

In Table

**Distribution**

**Variance**

**Sample**

**Number**

**Proposed**

**Tsai-Chen’s**

**structure**

**size**

**of sets**

**method**

**method**

Normal

Homo.

5

0.027

0.019

10

0.032

0.027

Normal

Homo.

5

0.047

0.054

10

0.047

0.061

Normal

Hetero.

5

0.036

0.031

10

0.044

0.021

Normal

Hetero.

5

0.038

0.066

10

0.052

0.087

Log-Normal

Homo.

5

0.027

0.018

10

0.024

0.022

Log-Normal

Homo.

5

0.048

0.039

10

0.050

0.062

Mix. Normal

Homo.

n=10

5

0.018

0.009

10

0.020

0.005

Mix. Normal

Homo.

n=40

5

0.055

0.050

10

0.050

0.054

Mix. Normal

Hetero.

n=10

5

0.018

0.003

10

0.017

0.003

Mix. Normal

Hetero.

n=40

5

0.058

0.060

10

0.049

0.057

Multi. Beta

Var. func. mean

5

0.033

0.027

10

0.031

0.031

Multi. Beta

Var. func. mean

5

0.043

0.042

10

0.053

0.042

**Variance**

**Sample**

**Number**

**Proposed**

**Tsai-Chen’s**

**structure**

**size**

**of sets**

**method**

**method**

Homo.

5

0.5

0.117

0.056

5

1

0.809

0.298

5

1.5

0.991

0.338

Homo.

5

0.5

0.933

0.660

5

1

1.000

0.999

5

1.5

1.000

1.000

Homo.

10

0.5

0.068

0.040

10

1

0.703

0.268

10

1.5

0.977

0.296

Homo.

10

0.5

0.890

0.615

10

1

1.000

0.996

10

1.5

1.000

1.000

Hetero.

5

0.5

0.147

0.037

5

1

0.842

0.188

5

1.5

0.997

0.222

Hetero.

5

0.5

0.959

0.702

5

1

1.000

1.000

5

1.5

1.000

1.000

Hetero.

10

0.5

0.090

0.029

10

1

0.743

0.164

10

1.5

0.988

0.181

Hetero.

10

0.5

0.920

0.643

10

1

1.000

0.999

10

1.5

1.000

1.000

We also compared the performance of the proposed procedure based on (1) with that based on the following Hotelling’s ^{2} type statistic which uses the entire sample covariance matrices **S**_{1} and **S**_{2}

To ensure that the sample covariance matrices are non-singular, we chose the sample size in each group to exceed the total number of genes in each subset. In Table

**Number of**

**Non-diagonal**

**Diagonal**

**gene sets**

5

0.5

0.298

0.637

5

1

0.860

0.997

5

1.5

0.998

1.000

10

0.5

0.236

0.517

10

1

0.780

0.993

10

1.5

0.998

1.000

Illustration

Intramuscular injections among children often result in a variety of problems ranging from minor discomforts such as, rash and pain, to more serious complications resulting in emergency room visits

**Table S1.** Excel file containing gene sets, subsets and gene names.

Click here for file

According to our Bonferroni-based methodology, 36 out of 75 biological categories are significant at FWER level of 0.05 (see Additional file

**Table S2.** Excel file containing results of gene set analysis.

Click here for file

**Table S3.** Excel file containing results of gene subset analysis.

Click here for file

Conclusions

Since biologists are often interested in identifying a collection of genes involved in a biological function or a pathway rather than individual genes, there has been considerable interest in recent years to develop statistical methods for identifying significant sets of genes. Usually, each pathway or biological function consists of a collection of (not necessarily disjoint) sub-pathways or sub-functions. Thus, each set of genes can be further partitioned into biologically meaningful subsets of genes. In this paper we exploit such structure information and propose a two-stage test strategy for selecting significant sets and subsets of genes between two experimental conditions while controlling the overall FWER. The proposed strategy is a general hierarchical test methodology, in which significant sets of genes are first identified by using Bonferroni procedure and then within each significant gene set, significant subsets of genes are further identified.

Discussion

Although we do not discuss the problem of selecting significant gene sets and subsets when comparing multiple experimental conditions, the proposed methodology can be extended to such situations by replacing Hotelling’s ^{2} statistic by commonly used statistics such as the Hotelling-Lawley trace test or the Roy’s largest root test. Furthermore, if the experimental conditions are ordered, such as in a time-course or a dose-response study, one can exploit order-restricted inference based methods developed in

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

WGG and SP conceived the study and developed the methodology. SP, MAY and CHX designed and performed the simulation studies. SP and CHX analyzed the data. SP, WGG, MAY and CHX wrote the manuscript. All authors read and approved the manuscript.

Acknowledgements

The research of Wenge Guo is supported by NSF Grant DMS-1006021 and the research of Shyamal Peddada is supported [in part] by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01 ES101744). Authors thank Drs. Leping Li and Keith Shockley for carefully reading the manuscript and making numerous suggestions which substantially improved the presentation.