Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline

Chang, Lun-Ching; Lin, Hui-Min; Sibille, Etienne; Tseng, George C

doi:10.1186/1471-2105-14-368

Research article
Open access
Published: 21 December 2013

Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline

Lun-Ching Chang¹,
Hui-Min Lin¹,
Etienne Sibille² &
…
George C Tseng^1,3

BMC Bioinformatics volume 14, Article number: 368 (2013) Cite this article

7563 Accesses
88 Citations
4 Altmetric
Metrics details

Abstract

Background

As high-throughput genomic technologies become accurate and affordable, an increasing number of data sets have been accumulated in the public domain and genomic information integration and meta-analysis have become routine in biomedical research. In this paper, we focus on microarray meta-analysis, where multiple microarray studies with relevant biological hypotheses are combined in order to improve candidate marker detection. Many methods have been developed and applied in the literature, but their performance and properties have only been minimally investigated. There is currently no clear conclusion or guideline as to the proper choice of a meta-analysis method given an application; the decision essentially requires both statistical and biological considerations.

Results

We performed 12 microarray meta-analysis methods for combining multiple simulated expression profiles, and such methods can be categorized for different hypothesis setting purposes: (1) HS_A: DE genes with non-zero effect sizes in all studies, (2) HS_B: DE genes with non-zero effect sizes in one or more studies and (3) HS_r: DE gene with non-zero effect in "majority" of studies. We then performed a comprehensive comparative analysis through six large-scale real applications using four quantitative statistical evaluation criteria: detection capability, biological association, stability and robustness. We elucidated hypothesis settings behind the methods and further apply multi-dimensional scaling (MDS) and an entropy measure to characterize the meta-analysis methods and data structure, respectively.

Conclusions

The aggregated results from the simulation study categorized the 12 methods into three hypothesis settings (HS_A, HS_B, and HS_r). Evaluation in real data and results from MDS and entropy analyses provided an insightful and practical guideline to the choice of the most suitable method in a given application. All source files for simulation and real data are available on the author’s publication website.

Background

Microarray technology has been widely used to identify differential expressed (DE) genes in biomedical research in the past decade. Many transcriptomic microarray studies have been generated and made available in public domains such as the Gene Expression Omnibus (GEO) from NCBI (http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress from EBI (http://www.ebi.ac.uk/arrayexpress/). From the databases, one can easily obtain multiple studies of a relevant biological or disease hypothesis. Since a single study often has small sample size and limited statistical power, combining information across multiple studies is an intuitive way to increase sensitivity. Ramasamy, et al. proposed a seven-step practical guidelines for conducting microarray meta-analysis [1]: "(i) identify suitable microarray studies; (ii) extract the data from studies; (iii) prepare the individual datasets; (iv) annotate the individual datasets; (v) resolve the many-to-many relationship between probes and genes; (vi) combine the study-specific estimates; (vii) analyze, present, and interpret results". In the first step although theoretically meta-analysis increases the statistical power to detect DE genes, the performance can be deteriorated if problematic or heterogeneous studies are combined. In many applications, the data inclusion/exclusion criteria are based on ad-hoc expert opinions, a naïve sample size threshold or selection of platforms without an objective quality control procedure. Kang et al. proposed six quantitative quality control measures (MetaQC) for decision of study inclusion [2]. Step (ii)-(v) are related to data preprocessing. Finally, Step (vi) and (vii) involve the selection of meta-analysis method and interpretation of the result and are the foci of this paper.

Many microarray meta-analysis methods have been developed and applied in the literature. According to a recent review paper by Tseng et al. [3], popular methods mainly combine three different types of statistics: combine p-values, combine effect sizes and combine ranks. In this paper, we include 12 popular as well as state-of-the-art methods in the evaluation and comparison. Six methods (Fisher, Stouffer, adaptively weighted Fisher, minimum p-value, maximum p-value and rth ordered p-value) belonged to the p-value combination category, two methods (fixed effects model and random effects model) belonged to the effect size combination category and four methods (RankProd, RankSum, product of ranks and sum of ranks) belonged to the rank combination category. Details of these methods and citations will be provided in the Method section. Despite the availability of many methods, pros and cons of these methods and a comprehensive evaluation remain largely missing in the literature. To our knowledge, Hong and Breitling [4], Campain and Yang [5] are the only two comparative studies that have systematically compared multiple meta-analysis methods. The number of methods compared (three and five methods, respectively) and the number of real examples examined (two and three examples respectively with each example covering 2-5 microarray studies) were, however, limited. The conclusions of the two papers were suggestive with limited insights to guide practitioners. In addition, as we will discuss in the Method section, different meta-analysis methods have different underlying hypothesis setting targets. As a result, the selection of an adequate (or optimal) meta-analysis method depends heavily on the data structure and the hypothesis setting to achieve the underlying biological goal.

In this paper, we compare 12 popular microarray meta-analysis methods using simulation and six real applications to benchmark their performance by four statistical criteria (detection capability, biological association, stability and robustness). Using simulation, we will characterize the strength of each method under three different hypothesis settings (i.e. detect DE genes in "all studies", "majority of studies" or "one or more studies"; see Method section for more details). We will compare the similarity and grouping of the meta-analysis methods based on their DE gene detection results (by using a similarity measure and multi-dimension scaling plot) and use an entropy measure to characterize the data structure to determine which hypothesis setting may be more adequate in a given application. Finally, we give a guideline to help practitioners select the best meta-analysis method under the choice of hypothesis setting in their applications.

Methods

Real data sets

Six example data sets for microarray meta-analysis were collected for evaluations in this paper. Each example contained 4-8 microarray studies. Five of the six examples were of the commonly seen two-group comparison and the last breast cancer example contained relapse-free survival outcome. We applied the MetaQC package [2] to assess quality of the studies for meta-analysis and determined the final inclusion/exclusion criteria. The principal component analysis (PCA) bi-plots and the six QC measures are summarized in Additional file 1: Figure S1, Tables S2 and S3. Details of the data sets are available in Additional file 1: Table S1.

Underlying hypothesis settings

Following the classical convention of Brinbaum [6] and Li and Tseng [7] (see also Tseng et al. [3]), meta-analysis methods can be classified into two complementary hypothesis settings. In the first hypothesis setting (denoted as HS_A), the goal is to detect DE genes that have non-zero effect sizes in all studies:

H_{0} : \cap_{k = 1}^{K} \{θ_{k} = 0\} versus H_{a} : \cap_{k = 1}^{K} \{θ_{k} \neq 0\} (H S_{A})

where θ_k is the effect size of study k. The second hypothesis setting (denoted as HS_B), however, aims to detect a DE gene if it has non-zero effect size in "one or more" studies:

H_{0} : \cap_{k = 1}^{K} \{θ_{k} = 0\} versus H_{a} : \cap_{k = 1}^{K} \{θ_{k} \neq 0\} (H S_{B})

In most applications, HS_A is more appropriate to detect conserved and consistent candidate markers across all studies. However, different degrees of heterogeneity can exist in the studies and HS_B can be useful to detect study-specific markers (e.g. studies from different tissues are combined and tissue specific markers are expected and of interest). Since HS_A is often too conservative when many studies are combined, Song and Tseng (2012) proposed a more practical and robust hypothesis setting (namely HS_r) that targets on DE genes with non-zero effect sizes in "majority" of studies, where majority of studies is defined as, for example, more than 50% of combined studies (i.e. r ≥ 0.5⋅K). The robust hypothesis setting considered was:

H_{0} : \cap_{k = 1}^{K} \{θ_{k} = 0\} versus H_{a} : \sum_{k = 1}^{K} I \{θ_{k} \neq 0\} \geq r (H S_{r})

A major contribution of this paper is to characterize meta-analysis methods suitable for different hypothesis settings (HS_A, HS_B and HS_r) using simulation and real applications and to compare their performance with four benchmarks to provide a practical guideline.

Microarray meta-analysis data pre-processing

Assume that we have K microarray studies to combine. For study k (1 ≤ k ≤ K), denote by x_gsk the gene expression intensity of gene g (1 ≤ g ≤ G) and sample s (1 ≤ s ≤ S_k; S_k the number of samples in study k), and y_sk the disease/outcome variable of sample s. The disease/outcome variable can be of binary, multi-class, continuous or censored data, representing the disease state, severity or prognosis outcome (e.g. tumor versus normal or recurrence survival time). The goal of microarray meta-analysis is to combine information of K studies to detect differentially expressed (DE) genes associated with the disease/outcome variable. Such DE genes serve as candidate markers for disease classification, diagnosis or prognosis prediction and help understand the genetic mechanisms underlying a disease. In this paper, before meta-analysis we first applied penalized t-statistic to each individual study to generate p-values or DE ranks [8] for a binary outcome. In contrast to traditional t-statistic, penalized t-statistic adds a fudge parameter s₀ to stabilize the denominator $(T = (\bar{X} - \bar{Y}) / (\hat{s} + s_{0})$ ; $\bar{X}$ and $\bar{Y}$ are means of case and control groups) and to avoid a large t-statistic due to small estimated variance $\hat{s}$ . The p-values were calculated using the null distributions derived from conventional non-parametric permutation analysis by randomly permuting the case and control labels for 10,000 times [9]. For censored outcome variables, Cox proportion hazard model and log-rank test were used [10]. Meta-analysis methods (described in the next subsection) were then used to combine information across studies and generate meta-analyzed p-values. To account for multiple comparison, Benjamini and Hochberg procedure was used to control false discovery rate (FDR) [11]. All methods were implemented using the "MetaDE" package in R [12]. Data sets and all programming codes are available at http://www.biostat.pitt.edu/bioinfo/publication.htm.

Microarray meta-analysis methods

According to a recent review paper [3], microarray meta-analysis methods can be categorized into three types: combine p-values, combine effect sizes and combine ranks. Below, we briefly describe 12 methods that were selected for comparison.

Combine p-values

Fisher The Fisher’s method [13] sums up the log-transformed p-values obtained from individual studies. The combined Fisher’s statistic $χ_{Fisher}^{2} = - 2 \sum_{i = 1}^{k} log (P_{i})$ follows a χ² distribution with 2 k degrees of freedom under the null hypothesis (assuming null p-values are un;iformly distributed). Note that we perform permutation analysis instead of such parametric evaluation for Fisher and other methods in this paper. Smaller p-values contribute larger scores to the Fisher’s statistic.

Stouffer Stouffer’s method [14] sums the inverse normal transformed p-values. Stouffer’s statistics $T_{Stouffer} = \sum_{i = 1}^{k} z_{i} / \sqrt{k} (z_{i} Φ^{- 1} (p_{i}),$ where Φ is standard normal c.c.f) follows a standard normal distribution under the null hypothesis. Similar to Fisher’s method, smaller p-values contribute more to the Stouffer’s score, but in a smaller magnitude.

Adaptively weighted (AW) Fisher The AW Fisher’s method [7] assigns different weights to each individual study $T_{AW} = - \sum_{k = 1}^{K} w_{k} \cdot log (P_{i}), w_{k} = 0 or 1$ and it searches through all possible weights to find the best adaptive weight with the smallest derived p-value. One significant advantage of this method is its ability to indicate which studies contribute to the evidence aggregation and elucidates heterogeneity in the meta-analysis. Details can be referred to the Additional file 1.

Minimum p -value (minP) The minP method takes the minimum p-value among the K studies as the test statistic [15]. It follows a beta distribution with degrees of freedom α = 1 and β = k under the null hypothesis. This method detects a DE gene whenever a small p-value exists in any one of the K studies.

Maximum p -value (maxP) The maxP method takes maximum p-value as the test statistic [16]. It follows a beta distribution with degrees of freedom α = K and β = 1 under the null hypothesis. This method targets on DE genes that have small p-values in "all" studies.

r-th ordered p -value (rOP) The rOP method takes the r-th order statistic among sorted p-values of K combined studies. Under the null hypothesis, the statistic follows a beta distribution with degrees of freedom α = r and β = K - r + 1. The minP and maxP methods are special cases of rOP. In Song and Tseng [17], rOP is considered a robust form of maxP (where r is set as greater than 0.5∙K) to identify candidate markers differentially expressed in "majority" of studies.

Combine effect size

Fixed effects model (FEM) FEM combines the effect size across K studies by assuming a simple linear model with an underlying true effect size plus a random error in each study.

Random effects model (REM) REM [18] extends FEM by allowing random effects for the inter-study heterogeneity in the model. Detailed formulation and inference of FEM and REM are available in the Additional file 1.

Combine rank statistics

RankProd (RP) and RankSum (RS) RankProd and RankSum are based on the common biological belief that if a gene is repeatedly at the top of the lists ordered by up- or down-regulation fold change in replicate experiments, the gene is more likely a DE gene [19]. Detailed formulation and algorithms are available in the Additional file 1.

Product of ranks (PR) and Sum of ranks (SR) These two methods apply a naïve product or sum of the DE evidence ranks across studies [20]. Suppose R_gk represents the rank of p-value of gene g among all genes in study k. The test statistics of PR and SR methods are calculated as $P R_{g} = \prod_{k = 1}^{K} R_{gk}$ and $S R_{g} = \sum_{k = 1}^{K} R_{gk},$ respectively. P-values of the test statistics can be calculated analytically or obtained from a permutation analysis. Note that the ranks taken from the smallest to largest (the choice in the method) are more sensitive than ranking from largest to smallest in the PR method, while it makes no difference to SR.

Characterization of meta-analysis methods

MDS plots to characterize the methods

The multi-dimensional scaling (MDS) plot is a useful visualization tool for exploring high-dimensional data in a low-dimensional space [21]. In the evaluation of 12 meta-analysis methods, we calculated the adjusted DE similarity measure for every pair of methods to quantify the similarity of their DE analysis results in a given example. A dissimilarity measure is then defined as one minus the adjusted DE similarity measure and the dissimilarity measure is used to generate an MDS plot of the 12 methods. In the MDS plot, methods that are clustered in a neighborhood indicate that they produce similar DE analysis results.

Entropy measure to characterize data sets

As indicated in the Section of "Underlying hypothesis settings", selection of the most suitable meta-analysis method(s) largely depends on their underlying hypothesis setting (HS_A, HS_B and HS_r). The selection of a hypothesis setting for a given application should be based on the experimental design, biological knowledge and the associated analytical objectives. There are, however, occasions that little prior knowledge or preference is available and an objective characterization of the data structure is desired in a given application. For this purpose, we developed a data-driven entropy measure to characterize whether a given meta-analysis data set contains more HS_A-type markers or HS_B-type markers [22]. The algorithm is described below:

1.
Apply Fisher’s meta-analysis method to combine p-values across studies to identify the top H candidate markers. Here we used H = 1,000, H represents the rough number of DE genes (in our belief) that are contained in the data.
2.
For each selected marker, the standardized minus p-value score for gene g in the k-th study is defined as $l_{gk} = - log (p_{gk}) / - \sum_{k = 1}^{K} log (p_{gk}) .$ Note that 0 ≤ l _gk ≤ 1, large l _gk corresponds to more significant p-value p _gk, and $\sum_{k = 1}^{K} l_{gk} = 1 .$
3.
The entropy of gene g is defined as $e_{g} = - \sum_{k = 1}^{K} l_{gk} log (l_{gk})$ . Box-plots of entropies of the top H genes are generated for each meta-analysis application (Figure 1(b)).

Intuitively, a high entropy value indicates that the gene has small p-values in all or most studies and is of HS_A or HS_r-type. Conversely, genes with small entropy have small p-values in one or only few studies where HS_B-type methods are more adequate. When calculating l_gk in step 2, we capped -log(p_gk) at 10 to avoid contributions of close-to-zero p-values that can generate near-infinite scores. The entropy box-plot helps determine an appropriate meta-analysis hypothesis setting if no pre-set biological objective exists.

Evaluation criteria

For objective quantitative evaluation, we developed the following four statistical criteria to benchmark performance of the methods.

Detection capability

The first criterion considers the number of DE genes detected by each meta-analysis method under the same pre-set FDR threshold (e.g. FDR = 1%). Although detecting more DE genes does not guarantee better "statistical power", this criterion has served as a surrogate of statistical power in previous comparative studies [23]. Since we do not know the underlying true DE genes, we refer to this evaluation as "detection capability" in this paper. An implicit assumption underlying this criterion is that the statistical procedure to detect DE genes in each study and the FDR control in the meta-analysis are accurate (or roughly accurate). To account for data variability in the evaluation, we bootstrapped (i.e. sampled with replacement to obtain the same number of samples in each bootstrapped dataset) the samples in each study for B = 50 times and show the plots of ean with standard error bars. In the bootstrapping, the entire sample is either selected or not so the gene dependence structure is maintained. Denote by r_meb the rank of detection capability performance (the smaller the better) of method m (1 ≤ m ≤ 12) in example e (1 ≤ e ≤ 6) and in the b^th (1 ≤ b ≤ 12) bootstrap simulation. The mean standardized rank (MSR) for method m and example e is calculated as ${MSR}_{me} = \sum_{b = 1}^{B} (r_{meb} / # of methods compared) / B$ and the aggregated standardized rank (ASR) is calculated as ${ASR}_{m} = \sum_{e = 1}^{6} {MSR}_{m}_{e} / 6,$ representing the overall performance of method m across all six examples. Additional file 1: Table S4 shows the MSR and ASR of all 12 methods and Figure 2 (in the Result section) shows plot of mean with standard error bars for each method ordered by ASR. We note that MSR and ASR are both standardized between 0 and 1. The standardization in MSR is necessary because in the breast cancer survival example we cannot apply FEM, REM, RankSum and RankProd as they are developed only for a two group comparison.

Biological association

The second criterion requires that a good meta-analysis method should detect a DE gene list that has better association with pre-defined "gold standard" pathways related to the targeted disease. Such a "gold standard" pathway set should be obtained from biological knowledge for a given disease or biological mechanism under investigation. However, since most disease or biological mechanisms are not well-studied, obtaining such "gold standard" pathways is either difficult or questionable. To facilitate this evaluation without bias, we develop a computational and data-driven approach to determine a set of surrogate disease-related pathways out of a large collection of pathways by combining pathway enrichment analysis results from each single study. Specifically, we first collected 2,287 pathways (gene sets) from MSigDB (http://www.broadinstitute.org/gsea/msigdb/): 1,454 pathways from "GO", 186 pathways from "KEGG", 217 pathways from "BIOCARTA" and 430 pathways from "REACTOME", respectively. We filtered out pathways with less than 5 genes or more than 200 genes and 2,113 pathways were left for the analysis. DE analysis was performed in each single study separately and pathway enrichment analysis was performed for all the 2,113 pathways by the Kolmogorov-Smirnov (KS) association test. Denote by p_uk the resulting pathway enrichment p-value from KS test for pathway u (1 ≤ u ≤ 2,113) and study k (1 ≤ k ≤ K). For a given study k, enrichment ranks over pathways were calculated as r_uk = rank_u(p_uk). A rank-sum score for a given pathway u was then derived as $S_{u} = \sum_{k = 1}^{K} r_{uk} .$ Intuitively, pathways with small rank-sum scores indicate that they are likely associated with the disease outcome by aggregated evidence of the K individual study analyses. We choose the top |D| pathways that had the smallest rank-sum scores as the surrogate disease-related pathways and used these to proceed with the biological association evaluation of meta-analysis methods in the following.

Given the selected surrogate pathways D, the following procedure was used to evaluate performance of the 12 meta-analysis methods for a given example e (1 ≤ e ≤ 6). For each meta-analysis method m (1 ≤ m ≤ M = 12), the DE analysis result was associated with pathway u and the resulting enrichment p-value by KS-test was denoted by ${\tilde{P}}_{med} (1 \leq d \leq | D |) .$ The rank of ${\tilde{P}}_{med}$ for method m among 12 methods was denoted by $v_{med} = {rank}_{m} ({\tilde{P}}_{med}) .$ Similar to the detection capability evaluation, we calculated the mean standardized rank (MSR) for method m and example e as ${MSR}_{me} = \sum_{d = 1}^{D} (v_{med} / # of the methods compared) / D$ and the aggregated standardized rank (ASR) as ${ASR}_{m} = \sum_{e = 1}^{6} {MSR}_{me} / 6,$ representing the overall performance of method m. To select the parameter |D| for surrogate disease-related pathways, Additional file 1: Figure S4 shows the trend of MSR_me (on the y-axis) versus |D| (on the x-axis) as |D| increases. The result indicated that the performance evaluation using different D only minimally impacted the conclusion when D > 30. We choose D = 100 throughout this paper.

Note that we used KS test, instead of the popular Fisher’s exact test because each single study detected variable number of DE genes under a given FDR cutoff and the Fisher’s exact test is usually not powerful unless a few hundred DE genes are detected. On the other hand, the KS test does not require an arbitrary p-value cutoff to determine the DE gene list for enrichment analysis.

Stability

The third criterion examines whether a meta-analysis method generates stable DE analysis result. To achieve this goal, we randomly split samples into half in each study (so that cases and controls are as equally split as possible). The first half of each study was taken to perform the first meta-analysis and generate a DE analysis result. Similarly, the second half of each study was taken to perform a second meta-analysis. The generated DE analysis results from two separate meta-analyses were compared by the adjusted DE similarity measure (to be described in the next section). The procedure is repeated for B = 50 times. Denote by S_meb the adjusted DE similarity measure of method m of the b^th simulation in example e. Similar to the first two criteria, MSR and ASR were calculated based on S_meb to evaluate the methods.

Robustness

The final criterion investigates the robustness of a meta-analysis method when an outlying irrelevant study is mistakenly added to the meta-analysis. For each of the six real examples, we randomly picked one irrelevant study from the other five examples, added it to the specific example for meta-analysis and evaluated the change from the original meta-analysis. The adjusted DE similarity measure was calculated between the original meta-analysis and the new meta-analysis with an added outlier. A high adjusted DE similarity measure shows better robustness against inclusion of the outlying study. This procedure was repeated until all irrelevant studies were used. The MSR and ASR are then calculated based on the adjusted DE similarity measures to evaluate the methods.

Similarity measure between two ordered DE gene lists

To compare results of two DE detection methods (from single study analysis or meta-analysis), a commonly used method in the literature is to take the DE genes under certain p-value or FDR threshold, plot the Venn diagram and compute the ratio of overlap. This method, however, greatly depends on the selection of FDR threshold and is unstable. Another approach is to take the generated DE ordered gene lists from two methods and compute the non-parametric Spearman rank correlation [24]. This method avoids the arbitrary FDR cutoff but gives, say, the top 100 important DE genes and the bottom 100 non-DE genes equal contribution. To circumvent this pitfall, Li et al. proposed a parametric reproducibility measure for ChIP-seq data in the ENCODE project [25]. Yang et al. introduced an OrderedList measure to quantify similarity of two ordered DE gene lists [26]. For simplicity, we extended the OrderedList measure into a standardized similarity score for the evaluation purpose in this paper. Specifically, suppose G₁ and G₂ are two ordered DE gene lists (e.g. ordered by p-values) and small ranks represent more significant DE genes. We denote by O_n(G₁, G₂) the number of overlapped genes in the top n genes of G₁ and G₂. As a result, 0 ≤ O_n(G₁, G₂) ≤ n and a large O_n(G₁, G₂) value indicates high similarity of the two ordered lists in the top n genes. A weighted average similarity score is calculated as $S (G_{1}, G_{2}) = \sum_{n = 1}^{G} e^{- an} \cdot O_{n} (G_{1}, G_{2}),$ where G is the total number of matched genes and the power α controls the magnitude of weights emphasized on the top ranked genes. When α is large, top ranked genes are weighted higher in the similarity measure. The expected value (under the null hypothesis that the two gene rankings are randomly generated) and maximum value of S can be easily calculated: $E_{null} (S (G_{1}, G_{2})) = \sum_{n = 1}^{G} e^{- α n} \cdot n^{2} / G$ and $max (S (G_{1}, G_{2})) = \sum_{n = 1}^{G} e^{- an} \cdot n .$ We apply an idea similar to adjusted Rand index [27] used to measure similarity of two clustering results and define the adjusted DE similarity measure as

S^{*} (G_{1}, G_{2}) = \frac{S (G_{1}, G_{2}) - E_{null} (S (G_{1}, G_{2}))}{Max (S (G_{1}, G_{2})) - E_{null} (S (G_{1}, G_{2}))}

This measure ranges between -1 to 1 and gives an expected value of 0 if two ordered gene lists are obtained by random chance. Yang et al. proposed a resampling-based and ROC methods to estimate the best selection of α. Since the number of DE genes in our examples are generally high, we choose a relatively small α = 0.001 throughout this paper. We have tested different α and found that the results were similar (Additional file 1: Figure S7).

Results

Simulation setting

We conducted simulation studies to evaluate and characterize the 12 meta-analysis methods for detecting biomarkers in the underlying hypothesis settings of HS_A, HS_B or HS_r. The simulation algorithm is described below:

1.
We simulated 800 genes with 40 gene clusters (20 genes in each cluster) and other 1,200 genes do not belong to any cluster. The cluster indexes C _g for gene g (1 ≤ g ≤ 2, 000) were randomly sampled, such that ∑ I{C _g = 0} = 1, 200 and ∑ I{C _g = c} = 20, 1 ≤ c ≤ 40.
2.
For genes in cluster c (1 ≤ c ≤ 40) and in study k (1 ≤ k ≤ 5), we sampled $\sum_{ck}^{'} ~ W^{- 1} (Ψ, 60),$ where Ψ = 0.5I _20 × 20 + 0.5J _20 × 20, W ^- 1 denotes the inverse Wishart distribution, I is the identity matrix and J is the matrix with all elements equal 1. We then standardized $Σ_{ck}^{'}$ into Σ_ck where the diagonal elements are all 1’s.
3.
20 genes in cluster c was denoted by the index of g _c1, …, g _c20, i.e. $C_{g_{cj}} = c, where 1 \leq c \leq 40 and 1 \leq j \leq 20 .$ We sampled gene expression levels of genes in cluster c for sample n as ${(X_{g_{c 1} nk}^{'}, \dots, X_{g_{c 20} nk}^{'})}^{T} ~ MVN (0, \sum_{ck})$ where 1 ≤ n ≤ 100 and 1 ≤ k ≤ 5, and sample expression level for the gene $g ~ N (0, σ_{k}^{2})$ which is not in any cluster for sample n, where 1 ≤ n ≤ 100, 1 ≤ k ≤ 5 and $σ_{k}^{2}$ was uniformly distributed from [0.8, 1.2], which indicates different variance for study k.
4.
For the first 1,000 genes (1 ≤ g ≤ 1, 000), k _g (the number of studies that are differentially expressed for gene g) was generated by sampling k _g = 1, 2, 3, 4 and 5, respectively. For the next 1,000 genes (1, 001 ≤ g ≤ 2, 000), k _g = 0 represents non-DE genes in all five studies.
5.
To simulate expression intensities for cases, we randomly sampled δ _gk ∈ {0, 1}, such that ∑ _k δ _gk = k _g. If δ _gk = 1, gene g in study k was a DE gene, otherwise it was a non-DE gene. When δ _gk = 1, we sampled expression intensities μ _gk from a uniform distribution in the range of [0.5, 3], which means we considered the concordance effect (up-regulated) among all simulated studies. Hence, the expression for control samples are $X_{gnk} = X_{gnk}^{'},$ and case samples are $Y_{gnk} = X_{g (n + 50) k}^{'} + μ_{gk} \cdot δ_{gk},$ for 1 ≤ g ≤ 2, 000, 1 ≤ n ≤ 50 and 1 ≤ k ≤ 5.

In the simulation study, we had 1,000 non-DE genes in all five studies (k_g = 0), and 1,000 genes were differentially expressed in 1 ~ 5 studies (k_g = 1, 2, 3, 4, 5). On average, we had roughly the same number (~200) of genes in each group of k_g = 1, 2, 3, 4, 5. See Additional file 1: Figure S2 for the heatmap of a simulated example (red colour represents up-regulated genes). We applied the 12 meta-analysis method under FDR control at 5%. With the knowledge of true k_g, we were able to derive the sensitivity and specificity for HS_A and HS_B, respectively. In HS_A, genes with k_g = 5 were the underlying true positives and genes with k_g = 0 ~ 4 were the underlying true negatives; in HS_B, gene with k_g = 1 ~ 5 were the underlying true positives and genes with k_g = 0 were the true negatives. By adjusting the decision cut-off, the receiver operating characteristic (ROC) curves and the resulting area under the curve (AUC) were used to evaluate the performance. We simulated 50 data sets and reported the means and standard errors of the AUC values. AUC values range between 0 and 1. AUC = 50% represents a random guess and AUC = 1 reaches the perfect prediction. The above simulation scheme only considered the concordance effect sizes (i.e. all with up-regulation when a gene is DE in a study) among five simulated studies. In many applications, some genes may have p-value statistical significance in the meta-analysis but the effect sizes are discordant (i.e. a gene is up-regulation in one study but down-regulation in another study). To investigate that effect, we performed a second simulation that considers random discordant cases. In step 5, the μ_gk became a mixture of two uniform distributions: π_gk Unif ⋅[-3, -0.5]+ (1 - π_gk)⋅ Unif[0.5, 3], where π_gk is the probability of gene g (1 ≤ g ≤ 2, 000) in study k(1 ≤ k ≤ 5) to have a discordant effect size (down-regulated). We set π_gk = 0.2 for the discordant simulation setting.

Simulation results to characterize the methods

The simulation study provided the underlying truth to characterize the meta-analysis methods according to their strengths and weaknesses for detecting DE genes of different hypothesis settings. The performances of 12 methods were evaluated by receiver operating characteristic (ROC) curves, which is a visualization tool that illustrates the sensitivity and specificity trade-off, and the resulting area under the ROC curve (AUC) under two different hypothesis settings of HS_A and HS_B. Table 1 shows the detected number of DE genes under nominal FDR at 5%, the true FDR and AUC values under HS_A and HS_B for all 12 methods. The values were averaged over 50 simulations and the standard errors are shown in the parentheses.

Table 1 The detected number of DE genes (at FDR = 5%), the true FDR, AUC values under HS _A and HS _B and the concluding characterization of targeted hypothesis setting of each method

Full size table

Figure 3 shows the histogram of the true number of DE studies (i.e. k_g) among the detected DE genes under FDR = 5% for each method. It is clearly seen that minP, Fisher, AW, Stouffer and FEM detected HS_B-type DE genes and had high AUC values under HS_B criterion (0.98-0.99), compared to lower AUC values under HS_A criterion (0.79-0.9). For these methods, the true FDR for HS_A generally lost control (0.41- 0.44). On the other hand, maxP, rOP and REM had high AUC under HS_A criterion (0.96-0.99) (true FDR = 0.068-0.117) compared to HS_B (0.75-0.92). maxP detected mostly HS_A-type of markers and rOP and REM detected mostly HS_r-type DE genes. PR and SR detected mostly HS_A-type DE genes but they surprisingly had very high AUC under both HS_A and HS_B criteria. The RankProd method detected DE genes between HS_r and HS_B types and had a good AUC value under HS_B. The RankSum detected HS_B-type DE genes but had poor AUC values (0.5) for both HS_A and HS_B. Table 1 includes our concluding characterization of the targeted hypothesis settings for each meta-analysis method (see also Additional file 1: Figure S5 of the ROC curve and AUC of HS_A-type and HS_B-type in 12 meta-analysis methods). Additional file 1: Figure S3 shows the result for the second discordant simulation setting. The numbers of studies with opposite effect size are represented by different colours in histogram plot (green: all studies with concordance effect size; blue: one study has opposite effect size with the remaining; red: two studies have opposite effect size with the remaining). In summary, almost all meta-analysis methods could not avoid inclusion of genes with opposite effect sizes. Particularly, methods utilizing p-values from two-sided tests (e.g. Fisher, AW, minP, maxP and rOP) could not distinguish direction of effect sizes. Stouffer was the only method that accommodated the effect size direction in its z-transformation formulation but its ability to avoid DE genes with discordant effect sizes seemed still limited. Owen (2009) proposed a one-sided correction procedure for Fisher’s method to avoid detection of discordant effect sizes in meta-analysis [28]. The null distribution of the new statistic, however, became difficult to derive. The approach can potentially be extended to other methods and more future research will be needed for this issue.