Abstract
Background
Detecting candidate markers in transcriptomic studies often encounters difficulties in complex diseases, particularly when overall signals are weak and sample size is small. Covariates including demographic, clinical and technical variables are often confounded with the underlying disease effects, which further hampers accurate biomarker detection. Our motivating example came from an analysis of five microarray studies in major depressive disorder (MDD), a heterogeneous psychiatric illness with mostly uncharacterized genetic mechanisms.
Results
We applied a random intercept model to account for confounding variables and casecontrol paired design. A variable selection scheme was developed to determine the effective confounders in each gene. Metaanalysis methods were used to integrate information from five studies and post hoc analyses enhanced biological interpretations. Simulations and application results showed that the adjustment for confounding variables and metaanalysis improved detection of biomarkers and associated pathways.
Conclusions
The proposed framework simultaneously considers correction for confounding variables, selection of effective confounders, random effects from paired design and integration by metaanalysis. The approach improved diseaserelated biomarker and pathway detection, which greatly enhanced understanding of MDD neurobiology. The statistical framework can be applied to similar experimental design encountered in other complex and heterogeneous diseases.
Background
Microarray experiment enables researchers to examine the expression of thousands of genes in parallel. Differentially expressed (DE) gene detection is one of the most common analyses in microarray. In such an analysis, genes differentially expressed under multiple conditions are detected and are used for generating further biological hypotheses, developing potential diagnostic tools, or investigating therapeutic targets. The extensive applications of microarray technology have led to an explosion of gene expression profiling studies publicly available. However, the noisy nature of microarray data, together with small sample size in each study, often results in inconsistent biological conclusions [13]. Therefore, metaanalysis, a set of statistical techniques to combine multiple studies under related research hypotheses, has been widely applied to microarray analysis to increase the reliability and robustness of results from individual studies. In the literature, three major categories of metaanalysis methods have been applied to genomic metaanalysis: combining effect sizes [4,5], combining pvalues [68] and combining rank statistics [9,10]. In general, different approaches have different underlying assumptions and pros and cons in the applications [11,12].
Major depressive disorder (MDD) is a heterogeneous illness with mostly uncharacterized pathology. Despite several gene expression studies of MDD [1317] published, the biological mechanisms of MDD remain mostly uncharacterized [18]. Although biomarkers and pathways have been identified in specific studies, the findings are not consistently observed from study to study. This variability may be due to several factors. Firstly, MDD is thought to be a complex and heterogeneous disease [19], associated with multiple genetic, posttranslational, and environmental factors. Furthermore, patients might have varying disease severity, with some having psychotic features as well as exposure to a variety of medications and dosage levels to control their illness. Secondly, the genetic disease effects are potentially confounded by many covariates, which include (1) demographic variables such as age, gender and race; (2) clinical variables such as antidepressant drug usage, death by suicide and alcohol dependence; (3) technical variables inherent in the use of postmortem brain samples, such as the pH level of brain tissues, brain region and postmortem interval (PMI). In statistical terms, confounding variables are defined as extraneous variables that can adversely affect the relationship between the independent variable (i.e. disease state) and dependent variable (i.e. gene expression). If the statistical models employed to identify differentially expressed genes fail to incorporate these sources of heterogeneity (potential confounding variables), not only can this reduce the statistical power, but also it will introduce sources of spurious signals to the gene detection. Finally, sample sizes for these studies are generally small (between 1025 pairs of MDDs and controls) due to the limited availability of suitable brain specimens and the significant costs associated with their collection.
In this paper, we propose a statistical framework to tackle overall weak signal expression profiles in MDD that have small sample size, casecontrol paired design and confounding covariates in each study. We use a set of five MDD expression profiles as an illustrative example. In the literature, most analyses of similar data structure either ignored the potentially confounding covariates by using paired or unpaired ttest [16,20,21] or applied simple linear regression model to incorporate all covariates [22,23]. The former approach undoubtedly ignored effects from confounding covariates; the latter approach was not efficient or even not applicable when the number of covariates is large and the number of samples in each study is small. In this paper, we will propose a framework that uses a random intercept model (RIM) to account for the casecontrol paired design and confounding covariates in single study analysis. An improved RIM with novel genespecific variable selection (namely RIM_minP or RIM_BIC to be introduced later) will be performed to accommodate the small sample size and relatively large number of covariates in individual studies. We will then apply and compare two popular metaanalysis methods: Fisher's method [24,25] and maximum pvalue method [2628]. The application of this approach to combine five MDD microarray studies show improved DE gene detection competency and superior statistical significance of pathway detection using our proposed method. Simulations considering various correlation structures among disease state, gene expression and covariates will be performed to demonstrate better performance of this framework. Our proposed framework is general and applicable in commonly encountered microarray metaanalysis of complex genetic diseases.
Methods
Description of motivating MDD data
This research is motivated from the metaanalysis of combining five MDD transcriptomic studies. Brain tissues of three patient cohorts (MD1, MD2 and MD3) obtained from different sources at different time were analyzed. For all three patient cohorts, tissues from the anterior cingulated cortex (ACC) brain region were analyzed by microarray experiments independently to generate three microarray studies: MD1_ACC, MD2_ACC and MD3_ACC. Similarly, tissues from the amygdala (AMY) brain region in MD1 and MD3 cohorts were analyzed to generate MD1_AMY and MD3_AMY. Details of the five patient cohorts and microarray studies are available in Additional file 1: Table S1. In each patient cohort, MDD patients were matched to control patients by three demographic variables: age, sex and race. Casecontrol paired design is proven to increase statistical power significantly, especially for expensive experiments that inhibit large sample size. This is exactly the case in most brain disorder studies using postmortem samples. Three additional clinical variables (alcohol dependence, evidence of taking antidepressant drugs and death by suicide) and two technical variables (pH level of brain tissues and postmortem interval PMI) were also available for each patient. Among the covariates described above, six variables (age, alcohol, drug, suicide, pH and PMI) were considered potential confounders in the DE gene detection of MDD. These six covariates were not highly correlated with each other in our analysis and thus the collinearity issue does not exist in the linear models below (see Additional file 1: Table S2).
Additional file 1. Supplement material.
Format: PDF Size: 336KB Download file
This file can be viewed with: Adobe Acrobat Reader
Data preprocessing, gene matching and gene filtering
Microarray images were scanned and summarized by manufacturers' defaults. Data from Affymetrix arrays were processed by RMA method and data from Illumina were processed by manufacturer's software for probe analysis. When samples in each study were processed in multiple batches, potential batch effects were evaluated and samples were normalized to have the same sample means and sample standard deviations to correct batch biases when necessary. Probes (or probe sets) were then matched to official gene symbols using Bioconductor packages. When multiple probes (or probe sets) matched to an identical gene symbol, the probe that presented the greatest interquartile range (IQR) was selected to represent the target gene symbol. Larger IQR represents greater variability (and thus greater information content) in the data and this probe matching method has been recommended in Bioconductor [29]. After genes were matched across five studies, 16,715 unique gene symbols were available across all five studies and intensities were all logtransformed (base 2). Two sequential steps of gene filtering were then performed. In the first step, we filtered out genes with very low gene expression that were identified with small average expression values across majority of studies. Specifically, mean intensities of each gene across all samples in each study were calculated and the corresponding ranks were obtained. The sum of such ranks across five studies of each gene was calculated and genes with the lowest 30% rank sum were considered unexpressed genes (i.e. small expression intensities) and were filtered out. Similarly, in the second step, we filtered out noninformative (small variation) genes by replacing mean intensity in the first step with standard deviation. Genes with the lowest 40% rank sum of standard deviations were filtered out. Additional file 1: Figure S1 shows the preprocessing diagram and the number of genes remained in each preprocessing step. Finally, 7,020 matched genes (16715 × 0.7 × 0.6 = 7020) in five studies were analysed. We note that the somewhat ad hoc gene filtering procedure is necessary and is commonly used in microarray analysis. Although it runs the risk to ignore important DE genes, it has benefits of reducing false positives from nonexpressed or noninformative genes and increasing statistical power in multiple comparison procedure. Filtering down to ~7000 genes is moderate and adequate since the size usually keeps majority of acting genes of the genome in the analysis.
Single study analysis for DE gene detection
Paired ttest Paired ttest was performed as a comparison to the method we developed below. This method took into the MDD and control paired design into consideration but ignored the confounding covariates. We also tested nonparametric Wilcoxon signed rank test. The results were similar to paired ttest and thus not shown in the result.
Random intercept model (RIM) and fixed effects model (FEM) To account for paired design (MDD samples paired with corresponding controls) and existence of MDD related covariates, we applied a random intercept model (RIM). For a given gene g, we fit the model
In the model, Y_{gik }was the gene expression value of gene g (1 ≤ g ≤ G) and disease status i (i = 1 for control and 2 for MDD) in pair k (1 ≤ k ≤ K). X_{0ik }was the disease label that took value one if the sample was MDD and zero if the sample was a control. X_{lik }represented values for potential confounding covariate l (1 ≤ l ≤ 6; 01 binary variables for alcohol, drug and suicide; numerical variables for age, pH and PMI). α_{k }was the random intercept from a normal distribution with mean zero and variance τ_{g}^{2}, which represented the deviation of averaged expression values in the k^{th }pair from the average in the whole population. Finally, ε_{gik }were independent random noises that followed a normal distribution with mean zero and variance σ_{g}^{2}. Under this model, β_{g0 }was the disease effect of gene g and was the parameter of major interest. To obtain an MDDassociated biomarker candidate list in a single study analysis, likelihood ratio test (LRT) was used to calculate the pvalues of testing H_{0}: β_{g0 }= 0 (vs H_{A}: β_{g0 }≠ 0). We denote this method as RIM_ALL (random intercept model containing all covariates) as opposed to the models with variable selection in the next section. The pvalues from RIM_ALL were then corrected by BenjaminiHochberg procedure [30] for multiple comparison to control false discovery rate (using "p.adjust" function in R).
Fixed effects model (FEM) below ignores the paired design while still considers the covariates in the model. It can be used when diseased and control samples are not paired. We used it to compare with RIM to evaluate the impact of casecontrol design.
RIM and FEM with variable selection Although RIM model can effectively adjust for confounding covariates in DE gene detection, the small sample size (1025 pairs) and relatively high number of potential confounders (6 covariates) can make the model inefficient and impractical. In this paper, we developed and evaluated two choices of variable selection procedures in the random intercept model (namely, RIM_BIC and RIM_minP). Specifically, all possible RIM models that included at most two (0, 1 or 2) clinical variables were computed and compared. In RIM_BIC, the model with the smallest Bayesian Information Criterion (BIC) [31] value was selected. For RIM_minP, we selected the model that yielded the smallest pvalue associated with the likelihood ratio test for testing the disease effect H_{0}: β_{g0 }= 0. Conceptually, BIC selected the model with the best overall model fitting and prediction while minP focused on the model that gave the best statistical significance of the disease effect β_{g0}. This additional variable selection avoided to include more than 2 clinical variables in the model and allowed assessment of biomarkers affected by different sets of covariates in each gene (e.g. disease effect of gene A may be confounded by alcohol and age while gene B may be confounded by antidepressant drug), which biologically gave more appealing interpretations and conclusions. Similar to RIM model, the likelihood ratio test was used to generate pvalues of testing H_{0}: β_{g0 }= 0 in each gene for the selected model by BIC or minP. The resulting pvalues from the LRT were, however, biased from the variable selection procedure and the type I error control was invalid. As a result, we performed a permutation test that randomly permuted the disease labels within each casecontrol pair to generate a null distribution for pvalue assessment. Additional file 1: Figure S2 shows the simulated null distribution from permutation analysis. The skewed distribution deviating from uniform distribution between 0 and 1 showed the need of the permutation analysis for pvalue correction. Similar to RIM_ALL, the pvalues from RIM_minP and RIM_BIC are corrected by BenjaminiHochberg procedure for multiple comparisons separately. Detailed algorithm of the permutation analysis and inference is described in Additional file 1: Part I.
Testing significance of interaction terms of each covariate: In the literature, age as well as other covariates has been found to be confounders of the disease effect with significant interaction term in some biomarkers [32,33]. In other words, the disease effect on gene expression may be affected by age differently in older and in younger cohorts. To evaluate the overall impact of the interaction terms in each covariate, we performed the following simple linear model
and random intercept model
where the notations were the same as in the FEM model and RIM model but now with only one covariate (i.e. a given variable l) included and a corresponding interaction term involved. We performed likelihood ratio test for H_{0}: γ_{gl }= 0 to test the statistical significance of the interaction term of gene g and covariate l. Additional file 1: Table S3 summarizes the number of significant interaction terms in the genome of each covariate. The result shows that the interaction terms between each covariate (Age, pH or PMI) and MDD disease effect were not significant in most of the genes under false discovery rate FDR = 5% (BenjaminiHochberg correction). As a result, we did not consider the interaction terms in our models throughout this paper.
Metaanalysis for DE gene detection
Among the many microarray metaanalysis methods used in the literature, all methods have their pros and cons depending on the data structure and biological goal [11,34]. According to Birnbaum [35], Li and Tseng [36], and Tseng et al. [12], metaanalysis methods can be categorized into two types: one detects gene markers differentially expressed "in one or more studies" and the other detects genes differentially expressed in "all studies". Fisher's method and maxP method are two popular methods in the two categories, respectively, and we will apply both methods in this paper. Fisher's method calculates the sum of logtransformed pvales in the metaanalysis: V_{g}^{Fisher }=  2Σ_{k}^{K }= _{1}log(Pgk) where K is the number of studies combined and P_{gk }is the pvalue of gene g and study k. Assuming independence among studies and that the model to derive pvalues is correctly specified, V_{g}Fisher follows a chisquared distribution with degree of freedom 2K under the null hypothesis. In contrast, the maxP method combines pvalues by taking the maximum of pvalues across studies: V_{g}^{maxP}= max_{1 ≤ k ≤ kPgk}V_{g}^{maxP}. It can be show that V_{g}^{maxP }follows a beta distribution with shape parameters K and 1 under the null hypothesis. Note that in Fisher's method, an extremely small pvalue of one study can drive strong statistical significance of that gene no matter whether pvalues of the other studies are small or not. On the other hand, the maxP method requires small pvalues of all studies for a gene to be detected.
Although the parametric inference by chisquared and beta distributions is convenient, we chose to perform permutation analysis for the inference to avoid violation of underlying assumptions. We also noted that two pairs of studies (MD1_ACC and MD1_AMY, MD3_ACC and MD3_AMY) used the same cohorts with different brain regions. The studies may be dependent from each other. To account for the dependence structure among studies from the same patients, we kept the same permutation order for each pair of studies of the same cohort in the permutation analysis of individual studies. BenjaminiHochberg procedure implemented in R is then used to control false discovery rate.
Evaluation by detection compentency and pathway analysis
For evaluation purpose, we compared three singlestudy DE gene detection methods (PT, RIM_minP and RIM_BIC). Each method was applied to five single studies and then the five single study results were combined by Fisher and maxP metaanalysis methods. As a result, a total of 3 × 7 = 21 DE gene detection results were generated and compared. To assess performance of these different methods, we applied two evaluation criteria. The first criterion compared the numbers of detected DE genes from different methods under different pvalue thresholds using detection competency curves (xaxis: pvalue threshold; yaxis: number of detected DE genes). This first criterion is a reflection of detection competency of different methods.
The detection competency curves alone, however, do not guarantee an improved biological finding. Thus, we evaluated the biological associations of detected DE genes from different methods with existing pathway knowledge databases in the second criterion. We collected 2,287 pathways from MsigDB [37]http://www.broadinstitute.org/gsea/msigdb/ webcite; 1,454 pathways from Gene ontology, 217 from Biocarta, 186 from KEGG and 430 from Reactome). KolmogorovSmirnov (KS) test was applied for each DE gene result and each pathway for pathway analysis (a.k.a. gene set analysis) [27,37]. KS test improved a commonly used Fisher's exact test in that it does not arbitrarily apply a DE gene cutoff but directly evaluate the DE evidence ordering in the genome. For the DE detection result of a given method, a smaller pvalue from KS test reflects a stronger biological validation of the detected DE genes. The detailed KS test algorithm and inference are described in the Additional file 1: Part II.
Since 2,287 pathways were tested, aggregating the total information to conclude one method being better than the other was not a trivial task. We denote by p_{rm }the pathway enrichment pvalue of method m for pathway r. Since the majority of the 2,287 pathways were irrelevant to MDD, we first identified the top V most "diseaserelated" pathways by committee decision of the 21 DE gene detection results under comparison. Specifically, disease relatedness of a pathway r in method m was derived as the rank of pvalues . The disease relatedness score of pathway r was then defined as , where M = 21 was the total number of DE gene detection methods under comparison. A small S_{r }reflected an overall significant pathway enrichment pvalue for pathway r in the 21 DE gene detection results and thus was believed to be a disease related pathway. The V pathways with the smallest S_{r }are selected as surrogates of "gold standard" diseaserelated pathways for evaluation purpose. For any two selected DE gene detection results, Wilcoxon signed rank test can be used to compare pathway enrichment pvalues of the V selected pathways to determine if one method is statistically better than the other method, in terms of association with the V diseaserelated pathways. In this paper, we use V = 100.
Post hoc analysis on confounding variables after metaanalysis
An important advantage of the variable selection scheme is the availability of post hoc analysis to compare selected confounders across studies. Three questions can be explored and answered: (1) Which variable(s) is the most or least frequently included in the model selection to confound with disease effect? (2) Are variables repeatedly selected across studies more frequently than by chance (e.g. alcohol is selected in most or all studies in a given gene)? (3) Are the directions of effect sizes of a variable consistent across studies (e.g. alcoholdependent patients have higher expression than nonalcoholdependent in most studies for a given gene)?
For the first question, we first generated a list of DE genes under a given FDR threshold and counted the frequency of each variable being selected in the gene list. The variables were ranked according to the frequencies in each study and a rank average of each variable was calculated across five studies. A small averaged rank of a covariate showed frequent appearance of the variable in the model selection and thus a frequent confounder.
For question (2), we computed a pairwise coappearance score (T_{1}) for a given gene set and assessed its statistical significance. For example, gene VGF in Table 1 had detected age effects in 2 studies, alcohol effects in 3 studies, antidepressant effect in 1 study and suicide effect in 3 studies using RIM_minP variable selection. By summing up coappearing pairs of the five studies in each variable, we obtained a T_{1, g }statistics of 7 for g = VGF where C_{2}^{k }is the binomial coefficient for the number of pairs out of k elements. Summing up all 9 genes, we obtained . Permutation test was then performed to assess the statistical significance of T_{1}.
Table 1. The direction of covariates effect in RIM_minP (Table 1A) and RIM_BIC (Table 1B) models for 9 MDD related genes selected from the literature
To answer question (3), we further computed rate of expression concordance among all coappearance pairs. Specifically, we examined all coappearing pairs that contributed to T_{1 }and count the number of pairs that are concordant (upregulation in both studies or downregulation in both studies). The total aggregated score for pairwise concordance was denoted as T_{2 }and the ratio of concordance was R= T_{2}/T_{1}. In the example of Table 2, 40 (T_{2}) out of 60 (T_{1}) coappearing pairs were concordant and R = 0.67. Similarly, permutation test was performed to assess the statistical significance of observed R scores. Detailed mathematical notation and permutation algorithm are outlined in Additional file 1: Part III.
Table 2. Number of detected DE genes using different single study analysis methods (PT, RIM_ALL, RIM_minP and RIM_BIC) in the five individual studies and by two metaanalysis methods (Fisher and maxP)
Simulation
In the first evaluation criterion previously described, we implicated that with adequate modelling and multiple comparison correction, detecting more DE genes showed better statistical power of a method and should be a preferred method. There was, however, no rigorous proof that detecting more DE genes guaranteed better performance of a method, in terms of its type I error control and statistical power. Since the type I error and statistical power could not be evaluated in real data analysis, we performed extensive simulations below to facilitate the evaluation. For a given gene, we considered three variables: one continuous variable of gene expression Y, a corresponding binary variable of disease state X and ten variables of potential binary confounding covariate Z (Z = (z_{1},...,z_{10})). Figure 1 shows three correlation structures of interest among (X, Y, Z) that are simulated. Scenario I demonstrated that both disease state X and the first two confounding variables in Z (z_{1 }and z_{2}) affect gene expression, a model of most interest in this paper. Scenario II and III showed situations when confounding variables Z did not directly affect gene expression Y. In these latter two scenarios, including confounding variables Y in the model should not improve performance. The detailed simulation scheme and evaluation criteria are available in the Additional file 1: Part IV. For each scenario, we simulated a data set with 1000 independent genes and 50 samples (25 diseased and 25 controls). Among the 1000 genes, 100 are true DE genes and 900 are nonDE genes. ttest, FEM_minP, FEM_BIC and FEM_ALL were applied to evaluate the effect of modelling confounding variables and variable selection in each correlation structure. We repeated the simulation 50 times. Type I error and statistical power were calculated for each method in each data set and averaged over 50 repeated simulations. By definition, the type I error was calculated as the average number of genes detected among the 900 nonDE genes. Conversely, the statistical power was obtained as the average number of genes detected among the 100 true DE genes. Note that for simplicity, we ignored paired design in the simulation and thus applied FEM instead of RIM. Through the simulation, we expect to answer whether including confounding variables improves statistical power (ttest versus FEM_minP, FEM_BIC and FEM_ALL) and whether variable selection in the model improves power (FEM_minP and FEM_BIC versus FEM_ALL) under different scenarios.
Figure 1. Three correlation structures of interest among disease variables X, gene expression variable Y and putative confounding covariates Z that are used in the simulation. Scenario I: gene expression depends on both disease state and covariates. Scenario II: gene expression depends only on disease state. Scenario III: gene expression depends on disease state directly and depends on covariates indirectly through disease state.
Results
Recommended statistical framework
From the motivating MDD example, we showed in Figure 2 a diagram of the statistical framework to consider potential confounding covariates, paired design and genespecific variable selection in the metaanalysis modelling. The framework consisted of four major steps: individual study analysis, metaanalysis, pathway analysis and post hoc analysis. In the first "individual study analysis" step, collinearity of confounders was assessed and RIM_minP or FEM_minP method with variable selection was applied depending on paired or unpaired design. One or multiple metaanalysis methods were applied and compared in Step II. Pathway analysis was then performed on the detected DE gene list(s) to identify enriched pathways in Step III. Finally, post hoc analysis was performed to summarize importance of each confounding variables and to evaluate the consistency of disease effects and confounders' effects across studies. This framework is general and applicable to any multistudy weaksignal data from a complex disease similar to the motivating MDD example.
Figure 2. An illustrative diagram of the proposed statistical framework.
Comparison of various methods in single study analysis
Confounder adjustment and variable selection to improve DE gene detection For each single study analysis, we compared the number of detected DE genes under different pvalue thresholds (p = 0.001, 0.005, 0.01 and 0.05) from different methods. In Figure 3, RIM_minP and RIM_BIC detected more DE genes than RIM_ALL in most studies, showing the fact that variable selection helped to ignore irrelevant clinical variables when sample size was small. Among the two variable selection methods, RIM_minP detected more genes than RIM_BIC, supporting that the focus of RIM_minP to obtain the most significant disease effect outperformed RIM_BIC's focus for best model fitting in this MDD example. Under p = 0.005, RIM_minP detected (0.8 to 1.3) times of DE genes than RIM_BIC and (0.8 to 5.5) times than RIM_ALL. The result suggested that RIM_minP is the most effective method in this data set to incorporate confounding variables in the model. In Figure 3, RIM_minP and RIM_BIC were also compared to paired ttest (PT) and were found to detect more DE genes, showing the advantage of incorporating confounding covariates in the model. RIM_minP identified (0.9 to 4.6) times DE genes than PT and RIM_BIC identified (0.8 to 4.4) times DE genes than PT under p = 0.005.
Figure 3. Comparison of number of detected DE genes in individual study analyses of RIM_minP, RIM_BIC, RIM_ALL, and PT. The result showed that RIM_minP detected the largest number of DE genes among the four methods.
Paired design to improve DE gene detection To evaluate the improvement of including paired design in the model, we compared RIM_minP and FEM_minP in Figure 4. We observed increased DE gene detection competency of RIM_minP compared to FEM_minP in three studies (MD2_ACC, MD3_AMY and MD3_ACC) but not in the other two studies (MD1_ACC and MD1_AMY). The result showed that pairing cases to controls by age, race and sex often helped increase statistical power but not always.
Figure 4. Comparison of number of detected DE genes in individual study analyses of RIM_minP and FEM_minP. The result showed that RIM_minP usually detected more DE genes.
Conclusion Incorporation of potential confounding covariates with variable selection and considering paired design in the model of single study analysis generally increased the detection competency of disease related biomarkers.
Comparing individual study analysis and metaanalysis
Smaller sample size in each study often results in a smaller statistical power of DE gene detection. In Table 2, the first five columns show the number of biomarkers detected by RIM_minP, RIM_BIC and PT under different false discovery rate (FDR) thresholds. After multiple comparison correction by BenjaminiHochberg procedure, the first four single study analyses detected almost none DE genes under FDR = 5, 10 or 15%. This motivated us to perform metaanalysis to increase statistical power and provide validated findings on DE gene detection. In Table 2, the last two columns show the number of biomarkers detected by Fisher method and maxP method, respectively. The two metaanalysis methods detected more biomarkers than individual study analysis based on RIM_minP except for MD3_AMY.
To further evaluate the biological implication of the detected DE genes by various methods, pathway analysis was performed. Figure 5 showed boxplots of the minus logtransformed pvalues (base 10) from pathway analysis in the top 100 diseaserelated surrogate pathways. DE gene detection methods were ordered by the median of the logtransformed pvalues in the plot. The squares and numbers in the upper part of the figure demonstrate the pvalues from Wilcoxon signedrank test when comparing two neighbouring DE gene detection methods (numbers show the pvalues and filled squares represent that the corresponding pvalue is smaller than 0.05). The result showed a clear pattern that both metaanalysis methods generally produced better DE gene detection results than the five single study analyses, no matter PT, RIM_minP or RIM_BIC was used in single study analyses. Interestingly, although MD3_AMY generated more DE genes than that produced by metaanalysis methods (Fisher and maxP) using either RIM_minP or RIM_BIC (see Table 2), its pathway analysis performed worse than the two metaanalyses result and even worse than the other four single studies (Figure 5). This evidence may raise concern of quality in the MD3_AMY study that will need additional investigation. RIM_BIC and RIM_minP did not appear to generate more biologically validated results than PT, probably because of the currently limited knowledge of MDD neurobiology and the still largely inaccurate pathway information.
Figure 5. Comparison of metaanalyse and individual analysis based on pathway analysis criterion across RIM_minP, RIM_BIC and paired ttest. The results showed that metaanalysis produced DE analysis results with stronger association with the top 100 diseaserelated surrogate pathways.
Comparing fisher and maxP
In the literature, many microarray metaanalysis methods have been proposed and compared [11,34,38]. As was discussed in the method section, different methods have different strength for detecting different types of differentially expressed genes. In Li and Tseng [8], genes that are differentially expressed in all studies were termed as HS_{A }type (hypothesis setting A) while genes differentially expressed in at least one study were called HS_{B }type. Among the three methods compared in this paper, maxP were methods that detect HS_{A }type DE genes, while Fisher's method detected HS_{B }type DE genes. The result showed that the two metaanalysis methods detected different sets of DE genes, suggesting different algorithms and assumptions behind the methods. Additional file 1: Figure S3 shows heatmaps on genes detected by Fisher alone (Additional file 1: Figure 3A), maxP alone (Additional file 1: Figure S3B) or both (Additional file 1: Figure S3C). In Additional file 1: Figure S3A, majority of DE genes detected by Fisher but not by maxP were dominated by strong differential expression in one or two studies (many in MD3_AMY and some in MD2_ACC or MD3_ACC). Although Fisher's method has been popularly applied in the microarray metaanalysis literature, the result showed its weakness to be dominated by single strong signal studies that included potential false positives. For example, Fisher's method detected 810 DE genes among which 445 DE genes (about 55%) could also be detected in MD3_AMY) while only 169 (about 24%) among 683 DE genes detected by maxP method could be detected in MD3_AMY. On the other hand, maxP had better statistical power to detect many genes with weak DE evidence in all studies (Additional file 1: Figure 3B) that Fisher's method cannot detect. Conceptually, we were more interested in identifying genes differentially expressed across all studies through maxP.
Post hoc analysis for confounding covariates
To evaluate the impact of covariates on the gene expression values and degree of confounding with disease effect, especially among DE genes, we counted the number of appearances of covariates in the RIM_minP model for DE genes detected by maxP method under FDR = 15%. We calculated the rank of each covariate in each study and computed rank averages of each covariate to indicate relative degree of frequency that a covariate impacted gene expression and confounded with disease effect (see Table 3). PMI (appeared in 1320% RIM_minP models of 683 DE genes) and pH (appeared 1230% in RIM_minP models) consistently had high rank, indicating that they seldom confounded and influenced the disease effect estimate. Suicide (appeared 2250% in RIM_minP models), alcohol (appeared 2954% in RIM_minP models) and antidepressant (appeared 1753% RIM_models) were three factors that consistently ranked among the most influential factors. Finally, age (appeared 2132% in RIM_minP models) was an intermediate confounding factor. The ranking of MD3_ACC and MD3_AMY was highly correlated (Spearman correlation = 0.89) and the correlation between rankings of MD1_ACC and MD1_AMY was also high (Spearman correlation = 0.54). The high within cohort correlations showed a cohort dependent structure and suggested that more studies may be needed to provide empirical evidence on the covariate impacts, particularly for the impact of antidepressant and suicide.
Table 3. Frequency of covariates appearing in RIM_minP model selection among 683 DE genes detected by maxP method under threshold FDR = 15%
To further explore effects of covariates, we analysed a set of 9 genes that have been previously associated with MDD in the literature (see Table 1A and 1B). Intuitively, we expected that a covariate should be included in the model across studies more frequently than by random and effects of a covariate should have consistent differential expression direction across studies. We constructed two hypothesis testing using the coappearing statistics T_{1 }and concordant ratio statistics R described in the Method section and performed the tests on the set of 9 MDDrelated genes. The result showed weak to marginal statistical significance of the first hypothesis (p = 0.397 based on RIM_minP and p = 0.011 based on RIM_BIC), suggesting covariates were consistently selected across studies by RIM_BIC but not RIM_minP. For the second hypothesis, tests for both 9 MDD gene list was statistically significant (p = 0.014 and 0.005). The effects sizes of selected confounding variables have concordant direction across studies. For example, age was found a confounding variable in gene NPY and TAC1 in three out of five studies and the effect sizes were all negative (in log scale). The moderate statistical significance is reasonable since the hypothesis tests were performed only on the nine selected genes. This result demonstrated that covariates overall impacted gene expression changes consistently and confounded with disease effects among the 9 MDDrelated gene list.
Simulation results
The results of simulations to evaluate Type I error control and statistical power of different methods are shown in Table 4. In Scenario I simulation, the effect of disease state X on gene expression Y was confounded by two out of ten clinical variables in Z (Z = (z_{1},...,z_{10}); z_{1 }and z_{2 }are confounders while the other eight variables are not). The result showed that ttest had low statistical power due to the confounders (power = 0.679). FEM_ALL also had low power due to the inclusion of all ten clinical variables in the model while in fact, only two of the ten were effective confounders (power = 0.697). Both FEM models with variable selection performed well. FEM_BIC performed slightly better than FEM_minP (power = 0.746 versus 0.729). The type I errors for all methods were close to the nominal 5% rate, showing adequacy of the models and statistical inference. For Scenario II, all clinical variables were independent from the gene expression. Not surprisingly, ttest performed the best with statistical power 0.938. FEM_minP and FEM_BIC both had similar high power at 0.929 and 0.925. FEM_ALL forced all variables in the model and obtained a low statistical power at 0.85. From Scenario I and Scenario II simulation, FEM_BIC and FEM_minP performed well in both extreme cases, demonstrating its sensitivity and robustness. Scenario III examined a special situation that variables in Z impacted gene expression Y through disease state X. Similar to Scenario II, ttest performed the best in this situation since Z is not confounded (power = 0.938). Both FEM_BIC and FEM_minP had similar high power (power = 0.925 and 0.916) but FEM_ALL again had low power (power = 0.851). For scenario I, we simulated two confounding variables, z_{1 }and z_{2}, where z_{1 }had a strong association with Y (gene expression) while z_{2 }had a weaker association with Y. In Table 4, an average of 1.97 variables was selected by RIM_minP and 1.27 variables were selected by RIM_BIC. Among them, 0.97 (by RIM_minP) and 0.78 (by RIM_BIC) variables belong to the true confounding variables (z_{1}, z_{2}). The result showed effectiveness of RIM_minP and RIM_BIC in variable selection compared to paired ttest (always contains no confounding variable) and RIM_ALL (always include all ten variables in Z). Overall, the simulation results confirmed our findings in MDD data analysis that modelling confounding variables with variable selection had better sensitivity and robustness in DE gene detection.
Table 4. Evaluation of ttest, FEM_minP, FEM_BIC and FEM_ALL methods by simulations
Discussion
In this paper, we described a statistical approach adjusted for confounding variables (i.e. a random intercept model with variable selection), to tackle weak signal expression profiles that have small sample size, casecontrol paired design and confounding covariates in each study. The results showed increased statistical power from confounding variable adjustment, paired design modelling and metaanalysis in this genomic setting and improved biological findings and interpretations have been discovered in MDD neurobiology. Pathway analysis and post hoc analysis of variable selection revealed insightful biological conclusions. Simulations under three correlation structures were performed to verify improved performance of the proposed model. In the literature, most psychiatric disease related microarray studies of similar design either ignored the clinical variables or applied simple linear regression to include all variables in the model. Our simulations clearly show limits to those two approaches. Our approach systematically considers the critical elements in the data structure in order to obtain more accurate DE gene and pathway detection. The framework is general and can be applied to microarray metaanalysis of other complex diseases with similar data structure. Specifically, this approach will be of great use in human postmortem studies of the brain, where confounding factors are intrinsic (1) to the nature of the cohorts (demographic parameters), (2) to their method of collection (postmortem interval) and (3) to the illness per se (clinical heterogeneity). Since dilution of expression signal is likely to occur in complex tissue such as the brain, DE genes often show small and weak effects and eliminating the effects of confounding factors is critical to detect disease associated markers.
In the variable selection of the RIM model, we tested both BIC and minP approaches. The real data analysis showed that minP seemed to identify more DE genes in the MDD example while simulations showed similar performance and statistical power of the two methods. Another potential alternative is to apply popular regularization and shrinkage methods, such as Lasso or ridge regression, in the variable selection. A prohibitive downside of such approaches is its expensive computational load for genomewide analysis, particularly in the estimation of the tuning parameters. In our analysis, BIC and minP procedures were limited to up to two covariates in the model, which balanced well in biological interpretation and computation feasibility.
The goal of this study was to determine optimal analytical approaches for complex datasets with multiple putative confounding variables. For this purpose, we focused on datasets produced by a single laboratory, in order to avoid additional confounding factors due to differences in laboratory protocols, brain bank collection, tissue treatment and sample handling. Now that we have established such analytical guidelines, the next step will be to increase the scope of metaanalyses by including additional datasets that are progressively made available in the literature. However, as expected, this also comes with added variability, which necessitates the development of complementary mathematical tools. For instance, we have designed a datadriven "metaQC" quality control approach to rigorously assess the quality of microarray datasets to be combined [39]. The quality control test is critical to assess whether the inclusion of additional datasets will increase the analytical power, or be detrimental to the metaanalysis. Finally, as briefly elucidated in this report, mechanisms underlying neurological and neuropsychiatric disorders are likely to involve a distributed sets of brain regions linked in functional neural networks. The detection of molecular pathologies associated with those disorders will thus also critically depend on a priori hypotheses for converging or opposing effects in selected brain regions, for the presence (or not) of control brain regions. For instance, genetic risk factors may be hypothesized to similarly affect biological pathways across brain regions, while compensatory mechanisms leading to pathological dysfunction may display regional specificity, depending on the respective activation or inhibition of different components of neural networks. Hence, the biological impact of the studies performed here will be investigated, validated and discussed more indepth elsewhere.
The studies combined in this paper have significant cohort features that may introduce significant heterogeneity. The five studies came from three distinct cohorts (MD1, MD2 and MD3), different sexes (male and female), array platforms (Affymetrix and Illumina) and brain regions (ACC and AMY). Future research is currently pursued to decipher such studyspecific features. A future direction is to collect more studies and apply metaregression techniques to identify sexspecific or brainregionspecific genes in a unified metaanalysis.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
The motivating data were generated from ES's lab. GCT and ES conceived the general concept and analytical approaches. XW, ES and GCT brainstormed the detailed framework, method and algorithms. XW performed most of the programming and data analysis. YL, XW and GCT discussed and performed the simulation. XW, CS and GCT discussed and performed the pvalue correction of the RIM_minP and RIM_BIC methods. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank David A. Lewis for access to brain specimen and Allan Sampson for discussion. This research was supported in part by Computational Resources on PittGrid http://www.pittgrid.pitt.edu. This work was support by National Institute of Health (NIH) grants MH094862, MH077159 and MH084060.
References

Tan PK, Downey TJ, Spitznagel EL Jr, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC: Evaluation of gene expression measurements from commercial microarray platforms.
Nucleic Acids Res 2003, 31(19):5676. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

EinDor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?
Bioinformatics 2005, 21(2):171178. PubMed Abstract  Publisher Full Text

Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D, Wang C, Guo Z: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes.
Bioinformatics 2009, 25(13):16621668. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation.
Bioinformatics 2003, 19(Suppl 1):i84i90. PubMed Abstract  Publisher Full Text

Marot G, Foulley JL, Mayer CD, Jaffrezic F: Moderated effect size and Pvalue combinations for microarray metaanalyses.
Bioinformatics 2009, 25(20):26922699. PubMed Abstract  Publisher Full Text

Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM: Metaanalysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer.
Cancer Res 2002, 62(15):44274433. PubMed Abstract  Publisher Full Text

Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Largescale metaanalysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression.
Proc Natl Acad Sci USA 2004, 101(25):93099314. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li J, Tseng GC: An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies.
Annals of Applied Statistics 2011.
accepted

DeConde RP, Hawley S, Falcon S, Clegg N, Knudsen B, Etzioni R: Combining results of microarray experiments: a rank aggregation approach.
Stat Appl Genet Mol Biol 2006., 5
Article15

Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in metaanalysis.
Bioinformatics 2006, 22(22):28252827. PubMed Abstract  Publisher Full Text

Ramasamy A, Mondry A, Holmes CC, Altman DG: Key issues in conducting a metaanalysis of gene expression microarray datasets.
PLoS Med 2008, 5(9):e184. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tseng GC, Ghosh D, Feingold E: Comprehensive literature review and statistical considerations for microarray metaanalysis.

Sibille E, Wang Y, JoeyenWaldorf J, Gaiteri C, Surget A, Oh S, Belzung C, Tseng GC, Lewis DA: A molecular signature of depression in the amygdala.
Am J Psychiatry 2009, 166(9):10111024. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sequeira A, Gwadry FG, FfrenchMullen JM, Canetti L, Gingras Y, Casero RA Jr, Rouleau G, Benkelfat C, Turecki G: Implication of SSAT by gene expression and genetic variation in suicide and major depression.
Arch Gen Psychiatry 2006, 63(1):3548. PubMed Abstract  Publisher Full Text

Kang HJ, Adams DH, Simen A, Simen BB, Rajkowska G, Stockmeier CA, Overholser JC, Meltzer HY, Jurjus GJ, Konick LC, et al.: Gene expression profiling in postmortem prefrontal cortex of major depressive disorder.
J Neurosci 2007, 27(48):1332913340. PubMed Abstract  Publisher Full Text

Aston C, Jiang L, Sokolov BP: Transcriptional profiling reveals evidence for signaling and oligodendroglial abnormalities in the temporal cortex from patients with major depressive disorder.
Mol Psychiatry 2005, 10(3):309322. PubMed Abstract  Publisher Full Text

Sibille E, Arango V, Galfalvy HC, Pavlidis P, ErrajiBenchekroun L, Ellis SP, John Mann J: Gene expression profiling of depression and suicide in human prefrontal cortex.
Neuropsychopharmacology 2004, 29(2):351361. PubMed Abstract  Publisher Full Text

Belmaker RH, Agam G: Major depressive disorder.
N Engl J Med 2008, 358(1):5568. PubMed Abstract  Publisher Full Text

Nestler EJ, Barrot M, DiLeone RJ, Eisch AJ, Gold SJ, Monteggia LM: Neurobiology of depression.
Neuron 2002, 34(1):1325. PubMed Abstract  Publisher Full Text

Iwamoto K, Kakiuchi C, Bundo M, Ikeda K, Kato T: Molecular characterization of bipolar disorder by comparing gene expression profiles of postmortem brains of major mental disorders.
Mol Psychiatry 2004, 9(4):406416. PubMed Abstract  Publisher Full Text

Tochigi M, Iwamoto K, Bundo M, Sasaki T, Kato N, Kato T: Gene expression profiling of major depression and suicide in the prefrontal cortex of postmortem brains.
Neurosci Res 2008, 60(2):184191. PubMed Abstract  Publisher Full Text

Park T, Yi SG, Shin YK, Lee S: Combining multiple microarrays in the presence of controlling variables.
Bioinformatics 2006, 22(14):16821689. PubMed Abstract  Publisher Full Text

Mistry M, Pavlidis P: A crosslaboratory comparison of expression profiling data from normal human postmortem brain.
Neuroscience 2010, 167(2):384395. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fisher R: Statistic Methods for Research works. 4th edition. Edinburgh: Oliver and Boyd; 1932.

Fisher R: Combining independent tests of significance.
American Statistician 1948, 2(5):30. Publisher Full Text

Wilkinson B: A statistical consideration in psychological research.
Psychol Bull 1951, 48(3):156158. PubMed Abstract

Shen K, Tseng GC: Metaanalysis for pathway enrichment analysis when combining multiple genomic studies.
Bioinformatics 2010, 26(10):13161323. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lu S, Li J, Song C, Shen K, Tseng GC: Biomarker detection in the integration of multiple multiclass genomic studies.
Bioinformatics 2010, 26(3):333340. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S (Eds): Bioinformatics and Computational Biology Solutions Using R and Bioconductor: Springer; 2005.

Benjamini Y, Hochberg Y: Controlling the false discovery rate  a practical and powerful approach to multiple testing.
Journal of the Royal Statistical Society Series BMethodological 1995, 57(1):289300.

Schwarz GE: Estimating the dimension of a model.
Annals of Statistics 1978, 6(2):461464. Publisher Full Text

Glorioso C, Oh S, Douillard GG, Sibille E: Brain molecular aging, promotion of neurological disease and modulation by Sirtuin5 longevity gene polymorphism.

Glorioso C, Sibille E: Between destiny and disease: Genetics and molecular pathways of human central nervous system aging.

Hong F, Breitling R: A comparison of metaanalysis methods for detecting differentially expressed genes in microarray experiments.
Bioinformatics 2008, 24(3):374382. PubMed Abstract  Publisher Full Text

Birnbaum A: Combining independent tests of significance.
J Am Stat Assoc 1954, 49:559574. Publisher Full Text

Li J, Tseng GC: An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies.
Ann Appl Stat 2011, 5(2):9941019. Publisher Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al.: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proc Natl Acad Sci USA 2005, 102(43):1554515550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Campain A, Yang YH: Comparison study of microarray metaanalysis methods.

Kang DD, Sibille E, Kaminski N, Tseng GC: MetaQC: objective quality control and inclusion/exclusion criteria for genomic metaanalysis.