Abstract
Background
Metaanalysis has become increasingly popular in recent years, especially in genomic data analysis, due to the fast growth of available data and studies that target the same questions. Many methods have been developed, including classical ones such as Fisher’s combined probability test and Stouffer’s Ztest. However, not all metaanalyses have the same goal in mind. Some aim at combining information to find signals in at least one of the studies, while others hope to find more consistent signals across the studies. While many classical metaanalysis methods are developed with the former goal in mind, the latter goal has much more practicality for genomic data analysis.
Results
In this paper, we propose a class of metaanalysis methods based on summaries of weighted ordered pvalues (WOP) that aim at detecting significance in a majority of studies. We consider weighted versions of classical procedures such as Fisher’s method and Stouffer’s method where the weight for each pvalue is based on its order among the studies. In particular, we consider weights based on the binomial distribution, where the median of the pvalues are weighted highest and the outlying pvalues are downweighted. We investigate the properties of our methods and demonstrate their strengths through simulations studies, comparing to existing procedures. In addition, we illustrate application of the proposed methodology by several metaanalysis of gene expression data.
Conclusions
Our proposed weighted ordered pvalue (WOP) methods displayed better performance compared to existing methods for testing the hypothesis that there is signal in the majority of studies. They also appeared to be much more robust in applications compared to the rth ordered pvalue (rOP) method (Song and Tseng, Ann. Appl. Stat. 2014, 8(2):777–800). With the flexibility of incorporating different pvalue combination methods and different weighting schemes, the weighted ordered pvalues (WOP) methods have great potential in detecting consistent signal in metaanalysis with heterogeneity.
Keywords:
Fisher’s combined probability test; Metaanalysis; Ordered pvalues; Weighted order statisticBackground
Metaanalysis has long been used to integrate data and/or results from multiple studies targeting the same questions. It is commonly used in many areas of statistical applications such as clinical studies and psychology experiments. In recent years, metaanalysis has been frequently adopted in genomic data analysis, due to the fast development of highthroughput technology and the vast amounts of data available in public databases.
Many metaanalysis methods have been developed throughout the years. Roughly speaking, there are two main approaches to metaanalysis methods Song and Tseng [1]. The first directly combines the pvalues from the studies, while the second attempts to model the data or the effect sizes from the combined studies. The former includes methods such as the Fisher’s combined probability test [2] and the Stouffer’s Ztest [3], as well as weighted variations of these classical tests. The latter includes a variety of fixed effects and random effects models, such as GeneMeta [4]. Each approach has its advantages and disadvantages. The pvalue combining methods are relatively flexible in that they require minimal information and assumptions from the studies. In this paper we will mainly focus on methods that directly combine the pvalues.
Most of the traditional metaanalysis methods (e.g. Fisher’s and Stouffer’s methods) aim at testing the alternative hypothesis that at least one of the studies is nonnull. While this aligns with earlier goals of metaanalyses to gain power in detecting signals by combining multiple studies, it is frequently not the case with genomic data. In metaanalysis of genomic studies for example, the goal is often to identify genes that are differentially expressed in a consistent pattern across multiple studies. The extreme of this would be to test for the alternative that the null can be rejected for all the studies. A solution to this extreme alternative dates back to the maxP method by Wilkinson [5]. However, the maxP method is often considered too conservative. A recent approach by Phillips and Ghosh [6] improves the power of testing for this disjunction of nulls when the rejection of all pvalues associated with a gene is required. In practical metaanalyses, the goal of rejecting all studies may be considered too extreme. Ideally, we would want to target at detecting consistent signals across studies while avoiding being overly exclusive. This issue has gained attention in recent years, and a number of authors have tried to address this problem. Benjamini and Heller [7] discussed a framework for testing partial conjunction hypotheses, where they test for the alternative that at least u out of n null hypotheses are false against the partial conjunction null that no more than n−u+1 of the null hypotheses are true. Song and Tseng [1] proposed the rth ordered pvalue (rOP) method that aims at testing the alternative hypothesis that there is signal in at least a given percentage of studies. Other methods exist that address this problem from different approaches, such as RankProd by Hong et al. [8] that looks for consistently highly ranked genes, and a weighted approach by Li and Ghosh [9] that weights genes by its expression consistency across studies.
We consider the problem of detecting signals in the majority of studies. Our approach adopts one aspect of the rOP method Song and Tseng [1] in that we also consider ordered pvalues. But instead of using a single rth ordered pvalue as the statistic (as rOP does), we combine all or a subset of the ordered pvalues while weighting them based on their order (weighted ordered pvalues, WOP). Pvalues closer to the median are highly weighted and the smallest/largest pvalues are downweighted. The idea is that among the collection of pvalues, the median pvalues are likely to be a better reflection of the behavior of the majority of studies than the smallest or largest pvalues. Olkin and Saner [10] discussed a trimmed Fisher’s procedure that leaves out a number of the smallest and/or largest pvalues from the calculation of Fisher’s statistic to remove the effect of possible aberrant extremes. In our consideration, we still keep the smallest/largest pvalues because they do carry certain information, but we downweight them because they may be relatively less relevant when considering the “majority” of studies. To reflect the upweighting of the medians and downweighting of the extremes, we calculated our weights based on the binomial distribution. To summarize the weighted ordered pvalues, we mainly considered Fisher’s statistic and Stouffer’s Ztest. We also explored some other statistics, such as generalized Fisher’s test by Lancaster [11]. In general, other summary statistics can be used under this framework as well.
While many weighted variations of Fisher’s statistic and Stouffer’s statistic have been developed throughout the years, most of them distribute weights according to the sample sizes and/or effect sizes of the studies, or other similar considerations (e.g. Mosteller and Bush [12], Won et al. [13], Makambi [14]). Li and Tseng [15] proposed an interesting adaptively weighted statistic, where the weights are used to maximize the significance of the summary statistic. However, to the best of our knowledge, none of the weighting schemes are based on the ordered pvalues, which is what makes our method unique. Xie et al. [16] discussed a metaanalysis approach using confidence distributions, where they incorporated the use of medians and kernel functions, but their approach is under a completely different framework from ours.
By incorporating more than one order statistic into our summary statistic, our method can be considered an expansion of the rth ordered pvalue (rOP) method Song and Tseng [1]. In general, both the rOP method and the original summary statistic we use (e.g. Fisher’s or Stouffer’s method) are special cases under the WOP framework  one having all the weight on a single ordered pvalue and the other having evenly distributed weights. However, our WOP methods have respective advantages over both the traditional summary statistics and the rOP method. Compared to the traditional summary statistics, the WOP methods better focus on detecting signal in a majority of studies. On the other hand, the WOP methods appear to be more robust compared to the rOP method. These observations are based upon results from simulation studies as well as data applications.
Methods
Hypothesis settings for metaanalysis
Before performing any metaanalysis, it is always important to figure out the goal of combining multiple studies. When a single hypothesis test is conducted, it is clear what the null and alternative hypotheses are. In metaanalysis we usually combine studies designed to test the same set of null and alternative hypotheses. However, the null and alternative hypotheses of the metaanalysis test are not always obvious and largely depends on the researcher’s goals. Here we consider the example of metaanalysis of differential expression studies of genomic data. Sometimes a gene is of interest as long as it is differentially expressed in at least one study, while other times we hope to target genes that are differentially expressed in all studies. Li and Tseng [15] and Song and Tseng [1] both provided extensive discussions on the different scenarios that lead to different hypothesis settings. Following their notation, let θ_{gk} be the true effect size for gene g (1≤g≤G) in study k (1≤k≤K). As in Song and Tseng [1], for a given 1≤r≤K, a general hypothesis setting for the metaanalysis of the K studies for a given gene g can be formulated as:
When r=1, HS_{1} is the classical setting of testing for nonzero effect size in at least one study against the conjunction of nulls. HS_{1} is the hypothesis setting that Fisher’s method, Stouffer’s Ztest and many other traditional methods test for. When r=K, HS_{K} tests for the alternative that all the studies have nonzero effect size. For instance, the maxP method [5] tests for HS_{K}. When 1<r<K,HS_{r} provides a compromise between the two aforementioned hypothesis settings, and tests for at least a prespecified number of nonzero effects. For a given r, the rOP (rth ordered pvalue) method Song and Tseng [1] is used to test for HS_{r}.
In this paper, we test for nonzero effect sizes in a majority of studies against the null that the effects sizes are zero in all studies. Thus we are testing against the conjunction of nulls while trying to focus on a certain subset of the nonnull space. In general, the hypothesis for our metaanalysis approach can be considered to fall under the HS_{r} setting. Song and Tseng [1] suggested a few datadriven methods for selecting r. To prevent any potential issues of post hoc choices of r, we choose to fix r before any analysis is conducted. While it is hard to specify what the term “majority” exactly means, as a general rule, we choose to test the hypothesis setting HS_{m}, where m=⌈K/2⌉, ⌈x⌉ being the smallest integer no less than x. Essentially we are targeting the alternative that at least half of the studies have nonzero effect sizes. Under our weighted ordered pvalues (WOP) framework, we are able to develop methods for other hypothesis settings with r ranging from m+1 to K. But in general we hope to provide a simple to use method without having to put too much effort in selecting a particular r, and therefore we will focus mostly on testing HS_{m}. We will later show through data applications that using our WOP methods for testing HS_{m} provides more robust results compared to using the rOP method with different choices of r.
A framework for weighted ordered Pvalues (WOP) methods
We first describe the general framework for the weighted ordered pvalues (WOP) methods. Suppose we have pvalues p_{1},p_{2},⋯,p_{K} for testing the hypothesis of interest for each of the K studies. Let p_{(1)},p_{(2)},⋯,p_{(K)} be the ordered pvalues. Now consider a set of weights w_{1},w_{2},⋯,w_{K} associated with the corresponding ordered pvalues. Summary statistics of weighted ordered pvalues can be expressed in the following general form:
As mentioned by previous authors, many traditional pvalue combination methods can be expressed in the general form of (for example, see Zaykin [17]). For instance, Stouffer’s Ztest takes H(·) to be the inverse normal function, while Fisher’s method has H(p_{i})=−2 log(p_{i}). The difference between T^{′} and T, albeit subtle in notation, is the essence of the WOP framework. In the WOP framework, the weight w_{i} is associated with the ith ordered pvalue p_{(i)}. In other words, the ranking of a pvalue in relation to the pvalues from the other studies determines its weighting. In traditional weighted pvalue combining methods, the weight assigned to a pvalue is associated with the characteristics of that particular study, be it the sample size, the effect size, or other features.
Allowing the weights to depend on the ordering of the pvalues opens up a whole new area of considerations when combining multiple pvalues. We can consider giving more weights to the pvalues that are closer to the center of the distribution of the pvalues since they might hold more credibility, and at the same time downweight the outlying pvalues. For instance, if the entirety of weights is placed on the rth ordered pvalue, the statistic reduces to the rOP method. However, using the WOP framework, if we highly weight the rth ordered pvalue but still distribute some weight to the other pvalues, the method can be viewed as a more robust version of the rOP method for testing HS_{r}.
In this paper, we shall develop a few specific methods under the WOP framework. We consider weights based on the binomial distribution, which will be described in more detail in the next section. As for the pvalue combining methods, we shall focus on Fisher’s method, with H_{F}(p_{(i)})=−2 log(p_{(i)}), and Stouffer’s method, with H_{Z}(p_{(i)})=ϕ^{−1}(1−p_{(i)}), ϕ(·) being the standard normal distribution function.
Binomial weights and halfbinomial weights
In this section, we discuss two possible weighting schemes for the WOP framework. We mainly consider testing the alternative hypothesis that the effect sizes are nonzero in the majority of studies, which is the hypothesis setting HS_{m}. We will also briefly discuss extending the weighting schemes to testing other hypothesis settings HS_{r}, for m<r≤K.
Inspired by the rOP method, which uses the rth ordered pvalue for testing the hypothesis setting HS_{r}, we consider placing the highest weight on the median pvalues for testing HS_{m}. This makes intuitive sense, since if a consensus does exist among the studies, we have reason to believe that the behavior of the majority of studies should be best captured by the pvalues that are closer to the center of the distribution. Since we do not insist on nonzero effect sizes for every single study, we consider downweighting the largest pvalues among the studies. On the other hand, pvalue combining methods such as Fisher’s method are known to be very sensitive to single extremely small pvalues, thus the smallest pvalues should also be downweighted, to avoid a small number of extremely small pvalues biasing the results of the majority of studies. In summary, we would like our weighting scheme w_{i}, as a function of i, to reflect a unimodal shape, with the highest weights being w_{m} (when K is odd) or w_{m−1} and w_{m} (when K is even), and such that w_{i} decreases as i goes to 1 or K.
To reflect the above properties of the weights, we constructed the weights based on the binomial distribution. Let f(x;n,p) be the probability mass function of the binomial distribution B(n,p), for x=0,1,⋯,n. We define the binomial weighting scheme such that
In the binomial weighting scheme, all the weights are nonzero, thus every pvalue contributes to the combined statistic. To further reduce the influence of the smallest pvalues on the summary statistic, we may argue that only p_{(m)},p_{(m+1)},⋯,p_{(K)} matters in testing the alternative that at least m studies have nonzero effect size. With these considerations, we define what we call the halfbinomial weighting scheme such that
We will discuss more on the effects of these two different weighting schemes through simulation studies in later sections.
So far we have constructed the binomial weighting scheme and halfbinomial weighting scheme for testing the hypothesis setting HS_{m}. We can extend the ideas of these two weighting schemes to testing HS_{r}, for m<r≤K. Instead of placing the highest weights on the medians, the highest weight can now be assigned to p_{(r)}. When r≠m, we lose the natural symmetry of the weights. However, we can still base the weights on the binomial distribution. A few possible weighting schemes are defined below:
The weighting scheme w^{b1} is based on the same binomial distribution as w^{b}, except that the values are shifted so that the center of the distribution falls on r instead of m. Because of the shift, the first few weights are set to be 0, while the last few values in the probability mass function of the binomial distribution are truncated. The weighting scheme w^{b2} increases the parameter n to K+2(r−m)−1 in the binomial distribution to ensure that the first few weights are nonzero. The weighting scheme w^{b3} decreases the parameter n to K−2(r−m)−1 in the binomial distribution so that the distribution is not truncated on the right. See Figure 1 for an illustration of the idea of these three weighting schemes. Corresponding halfbinomial weights for these three binomial weighting schemes: w^{hb1}, w^{hb2} and w^{hb3}, can be easily constructed by setting the weights to be 0 for i=1,⋯,r−1.
Figure 1. Illustration of weighting schemes. Intuition behind the three weighting schemes w^{b1}, w^{b2} and w^{b3} for m<r≤K. The plot reflects an example of K=9. The weighting scheme w^{b} for testing HS_{5} is plotted as a reference. The other three weighting schemes are for testing HS_{7}. From the plot we can easily see that w^{b1} is a direct shift of w^{b}, which is based on the distribution of B(8,0.5). w^{b2} is based on the distribution of B(12,0.5), and w^{b3} is based on the distribution of B(4,0.5).
Implementation of the WOP methods
Under the null hypothesis, the p_{k}’s are assumed to follow a uniform (0,1) distribution. For Stouffer’s Ztest, H_{Z}(p_{i})=ϕ^{−1}(1−p_{i}) follows a standard normal distribution. Therefore the traditional weighted Ztest in the form of still follows a normal distribution. For Fisher’s method, H_{F}(p_{i})=−2 log(p_{i}) has a chisquare distribution with two degrees of freedom. Therefore the distribution of the traditional weighted Fisher’s test is essentially weighted sums of exponential distributions. The distribution of weighted sums of exponential variables is not as straightforward, though many authors have researched on both the exact and approximations of this distribution, a summary of which can be found in Olkin and Saner [10].
When we consider weighted ordered pvalues in the form of , however, the problem becomes much more complicated. Even for Stouffer’s method, the distribution of the sum of weighted ordered normal variables is not readily available. As for Fisher’s method, Olkin and Saner [10] studied the distribution of the trimmed Fisher’s statistic, which is for the special case of w_{i}=0 for i=1,⋯,s_{1} and i=s_{2},⋯,K. They transformed the distribution of the sum of ordered chisquared variables back to a weighted sum of exponential variables without order. But as discussed in their paper, the exact distribution of weighted sums of exponential variables are generally impractical for practitioners, and we would therefore have to use approximation methods.
Considering the complexity of the exact distributions of weighted sums of ordered variables, as well as the fact that the uniformness of the original pvalues is not always guaranteed in practice, we recommend two methods for obtaining the pvalues for the WOP statistics: (1) permutation analysis, in the case that original data from all the studies are available; and (2) comparing to the numerical distribution, in the case that only the pvalues for each gene and each study are known.
We first explain the steps of obtaining the WOP pvalues through permutation analysis:
•Let denote the WOP statistic for gene g, where p_{g(i)} is the ith ordered pvalue of p_{g1},⋯,p_{gK}.
•Permute group labels in each study B times, and recalculate the pvalues for the permuted data , for 1≤k≤K, 1≤g≤G, 1≤b≤B.
•Calculate the WOP statistics for the permuted pvalues for 1≤g≤G, 1≤b≤B.
•The pvalue for the WOP statistic T_{g} (1≤g≤G) is then computed as
Once the pvalues for the WOP statistics for each gene are obtained, we may apply the BenjaminiHochberg (BH) method [18] on the ’s (1≤g≤G) to account for multiple testing across the genes and control the false discovery rate (FDR).
If the original data are not available, we can simulate the distribution of the WOP statistics numerically, by simulating U(0,1) random variables, the distribution of pvalues under the null distribution. The WOP statistics calculated from the data can then be compared to the numerical distribution to obtain the WOP pvalues. We simulated numerical distributions of the WOP statistics for testing HS_{m} based on the Fisher’s and Stouffer’s combination methods with binomial and halfbinomial weighting schemes respectively, for study numbers ranging from 4 to 23. We conducted simulation studies to compare the WOP pvalues obtained either though comparing with the numerical distribution or by performing permutation analysis. Results show that the two methods provide perfectly correlated pvalues and that the number of rejections obtained from the two methods after applying the BenjaminiHochberg adjustment are very similar (data not shown). Therefore both methods are reliable choices for obtaining the WOP pvalues in practice. The numerical distribution provides an option for when the original data are not available and is also more timeefficient. The permutation analysis can be used if the uniformness of the original pvalues are questionable but that the original data is available.
Considerations of one or twosided tests
Previously we used twosided alternatives as an example when setting up the hypothesis. The hypothesis setup HS_{r} can be similarly developed for onesided alternatives. In fact, the interpretation of the metaanalysis results is easier for onesided tests, since we do not need to worry about the concordance of the directions of the effect sizes as we do for twosided tests. Since the WOP methods directly combine the pvalues, the direction of the effect sizes are not taken into account for twosided tests. Thus a significant result from the WOP metaanalysis of twosided tests indicate that there are at least r studies with nonzero effect size, but without any implications about the concordance or discordance of the directions of the effects. This may not be an issue in the case that the direction of effect is not of great importance. However, in genomic data analysis, it is often desirable to distinguish between up and downregulated genes, and a result stating that the gene is differentially expressed across many studies but with possible opposite directions of expression change may be confusing. In these cases, it might be problematic to directly apply the WOP methods to the twosided pvalues. On the other hand, since both up and downregulated genes may be of interest at the same time, we cannot prespecify one particular onesided test for all genes. For such scenarios we recommend using the test of Pearson [19] in combination with the WOP methods. To do so, for each gene we need to conduct two WOP metaanalyses on onesided pvalues: one on the lefttailed pvalues for all studies, and the other on the righttailed pvalues for all studies. Let and be the WOP metaanalysis pvalues for the lefttailed and righttailed tests respectively. We shall then adopt the idea of Pearson’s test and define , where the superscript “C” stands for “concordant”. As discussed in Owen [20], the equation for obtaining provides a conservative pvalue for Pearson’s test. By adopting the Pearson’s test, the results are more interpretable. A significant result now indicates that the gene is consistently up or downregulated in at least r studies.
Results and discussion
A simulation study
We conducted a simulation study to compare the performances of the WOP methods with the original Fisher’s and Stouffer’s method, as well as with the rOP method Song and Tseng [1]. We shall also demonstrate the differences between the binomial and halfbinomial weighting schemes through the simulation.
We simulate the setting of a metaanalysis of differential expression studies, with 2000 genes and 7 studies. Out of the 2000 genes, 1650 genes are assumed to be not differentially expressed in any study, while 50 genes are assumed to be differentially expressed in 1,2,⋯,7 studies respectively. The sample sizes for the treatment and control groups are randomly generated for each study, varying from 5 to 20. Gene expression values are randomly generated from normal distributions. Control samples are generated from a N(0,1) distribution, as well as treatment samples that are not differentially expressed. Treatment samples that are differentially expressed are generated from a N(1,1) distribution. Twosample Ttests are used to obtain the pvalues p_{gk} for each gene and each study. Our WOP methods aim at testing the hypothesis that the gene is differentially expressed in the majority of studies. In this case, m=4, corresponding to the hypothesis setting HS_{4}. The rOP statistic Song and Tseng [1] for testing HS_{4} is the 4th ordered pvalue. Note that the original Fisher’s and Stouffer’s method are supposed to test for HS_{1}. We used permutation analysis to obtain the WOP pvalues for binomial and halfbinomial weighted Fisher’s and Stouffer’s statistic. Pvalues for the rOP method were also computed by permutation analysis as recommended in Song and Tseng [1]. Pvalues for the original Fisher’s and Stouffer’s method are computed directly via their respective distributions. To obtain a list of significant genes, the BenjaminiHochberg procedure is applied to the pvalues with the FDR controlled at the 0.05 level. Results are averaged over 100 replications.
Figure 2 compares the power of the methods for different categories of genes. Genes are categorized based on the number of studies that they are differentially expressed in (0 to 7). The proportion of genes rejected within each category are plotted for the different methods. Before we compare the methods, it is important to note that the reading of these plots may be different from traditional power plots. Normally when looking at a power plot, the higher the power the better. In our case, however, this is not true for all categories. Remember that our goal is to focus on a particular subset of the nonnull space  genes that are differentially expressed in at least 4 studies. Therefore, all nonnull genes are not created equal. In particular, rejections of genes in categories 4 through 7 are desirable, whereas rejections of genes in categories 1 through 3 are undesirable. Rejections of genes in category 0 are of course considered false discoveries. As expected, regardless of the method, the proportion of genes rejected within a category increases from 0 to close to 1 as the number of studies that the genes are differentially expressed in increases from 0 to 7 (out of 7). Fisher’s method, being a very powerful method for testing HS_{1}, has the highest proportion of rejections for every category from 1 to 7. Stouffer’s method also has the highest proportion of rejections when compared to the corresponding weighted versions and the rOP method. However, as mentioned earlier, rejections in categories 0 to 3 are considered undesirable when testing HS_{4}. Both the WOP methods and the rOP method reject much smaller proportions of genes in categories 1 to 3 compared to the original Fisher’s or Stouffer’s method. Less rejections in categories 1 to 3 come at the expense of less power for categories 4 to 7, which is true for both WOP and rOP methods, but particularly so for the rOP method. For the binomial weighted WOP methods, the rejections for categories 1 to 3 are much lower compared to Fisher’s or Stouffer’s method, but the power for categories 4 to 7 gradually increases to catch up with Fisher’s or Stouffer’s. When it comes to categories 6 and 7, especially category 7, the binomial weighted WOP methods have virtually the same power as Fisher’s and Stouffer’s. On the other hand, the rOP method has the lowest power for categories 6 and 7, which are the genes that are differentially expressed in all or almost all the studies. The halfbinomial weighted WOP methods have the lowest rejection rates for categories 1 to 3, even lower than the rOP method, but their power surpasses the power of the rOP method when it comes to categories 6 and 7.
Figure 2. Comparison of power. Power comparison for the different methods. The xaxis denotes the 8 categories of genes, categorized by the number of studies that the genes are differentially expressed in. There are 1650 genes in the category 0 (no differential expression in any studies). The rest of the categories contain 50 genes each. For each category, the proportion of genes found significant within that category are plotted for each method.
To better look at the trade off between rejection rates for categories 0 to 3 versus categories 4 to 7, we plotted the ROC curves for the methods, as seen in Figure 3. Since our goal is to test for HS_{4}, we treat rejections in categories 0 to 3 to be false positives, while rejections in categories 4 to 7 are considered true positives. From the ROC curves we can clearly see that in terms of the trade off between true and false positives the binomial weighted WOP methods beat the original Fisher’s or Stouffer’s method, while the halfbinomial weighted WOP methods beat the rOP method. We also gain better insight into the comparison between the binomial and halfbinomial weighting schemes. The halfbinomial weighting scheme is relatively more conservative, with relatively higher true positive rates at very low false positive rates. On the other hand, the binomial weighting scheme achieves higher power when slightly higher false positive rates are allowed.
Figure 3. ROC curves. ROC curves for the different methods. Rejections of genes differentially expressed in less than 4 studies are considered false positives. Rejections of genes differentially expressed in 4 or more studies are consider true positives.
In summary, the binomial weighted WOP methods show improvement over the original Fisher’s or Stouffer’s method for testing differential expression in a majority of studies, with lower rejection rates for genes that are differentially expressed in a small number of studies, and just as high power for genes that are differentially expressed in almost all studies. On the other hand, the halfbinomial weighted WOP methods are more robust versions of the rOP method, having similar properties to the rOP method, but even lower rejections rates for categories 1 to 3 and higher power for categories 6 and 7. In practice, the binomial weighting scheme is recommended if the user wishes to have a larger pool of significant genes, and when the control of false positives is relatively less important. For better false positive control, the halfbinomial weighting scheme is recommended. We note that because of our hypothesis setup, false positives are not the same as type I error. A type I error would be rejecting a gene that is not differentially expressed in any studies. On the other hand, a false positive would be rejecting a gene that is differentially expressed in less than r studies.
An application to metaanalysis of a set of stem cell studies
As an application of the proposed methodology, we conduct metaanalysis on a set of microarray data studies from four stem cell papers: Chin et al. [21], Guenther et al. [22], Newman and Cooper [23] and Chin et al. [24]. We wish to find probesets that are differentially expressed between human induced pluripotent stem (hiPS) cells and human embryonic stem (hES) cells in the majority of studies. Some of the studies contain other samples such as human fibroblasts, but we only used samples from hiPS cells and hES cells. We included studies that had at least two samples for each group (hiPS and hES), giving us a total of 9 studies. All the studies used the Affymetrix Human Genome U133 Plus 2.0 Array platform, which contains 54675 probesets. We directly used the data preprocessed by the original contributors and did not perform any additional normalization, except for taking the log for data that were not already on the log2 scale. We performed a two sample Ttest between the hiPS cell samples and the hES cell samples to obtain the original pvalues for differential expression for each probeset and each study. The hypothesis setting for the metaanalysis is HS_{5}, i.e. we aim at testing the alternative that the probeset is differentially expressed in at least 5 out of the 9 studies. We applied the proposed WOP methods, in particular the binomial and halfbinomial weighted Fisher’s and Stouffer’s statistic to the pvalues. The pvalues for the WOP statistics are obtained by comparing the statistics to the corresponding numerical distributions. We also applied the original Fisher’s and Stouffer’s method, and the rOP method (in this case the 5th ordered pvalue). The pvalues for the rOP method are obtained via its theoretical distribution. To adjust for multiple testing, we applied the BenjaminiHochberg (1995) procedure afterwards, controlling the false discovery rate at the 0.05 level.
The number of probesets significant for the metaanalysis using the different methods are summarized in Table 1. The original Fisher’s and Stouffer’s methods found the most number of significant probesets. It is likely that they picked out many probesets that are differentially expressed in a few, but less than half the studies. The halfbinomial weighted WOP methods and the rOP method are relatively more conservative and focused. Figure 4 shows a Venn diagram of the probesets found significant by the binomial and halfbinomial weighted Fisher’s method and the rOP method. We can see that the probesets detected by both the halfbinomial weighted Fisher’s method and the rOP method are mostly detected by the binomial weighted Fisher’s method as well. However, there are still a number of unique probesets that are only detected by either the halfbinomial weighted Fisher’s method or the rOP method. To get an idea of the types of probesets detected by only one of the three aforementioned methods, we randomly selected some of these probesets and plotted the ordered original pvalues from the 9 studies for these probesets. As seen in Figure 5, probesets exclusively detected by the binomial weighted Fisher’s method tend to have very small values for the two or three smallest pvalues, but relatively larger values starting the 5th ordered pvalue. This shows that the binomial weighted Fisher’s method is more prone to influences by the smallest pvalues, since it takes into account all the pvalues in the statistics. The rOP method, which uses the 5th ordered pvalue as the statistic, exclusively identifies probesets that are guaranteed to have relatively small pvalues up to the 5th ordered pvalue, but tend to have very large values starting the 6th ordered pvalue. This shows the sensitivity of the rOP method to the particular value of r chosen. On the other hand, the halfbinomial weighted Fisher’s method weights in the 5th through the 9th ordered pvalues, and thus is able to identify probesets that have relatively small values through larger ordered pvalues. In other words, probesets that have relatively small pvalues for most of the studies can be exclusively identified by the halfbinomial weighted method, even if the smallest pvalues are not very small.
Table 1. Number of significant probesets for the metaanalysis of stem cell studies by different methods
Figure 4. Venn diagram for stem cell studies. Venn diagram for the probesets found significant by the binomial weighted Fisher’s method, the halfbinomial weighted Fisher’s method and the rOP method.
To look at the pathways associated with the significant lists of probesets, we performed functional annotation clustering analysis using DAVID (Huang et al, 2009), which is available at http://david.abcc.ncifcrf.gov/home.jsp webcite. Some of the top functions that show up include metalbinding, nucleoplasm, ubl conjugation, vasculature development and head/face development. We also looked at the pathways for the probesets that were exclusively detected by one of the methods. Functions such as neuron projection, neuron differentiation and development, which are meaningful in the stem cell study setting, were found to be associated with the probesets exclusively identified by the halfbinomial weighted Fisher’s method. These functions did not show up in pathway analyses of the other lists.
Figure 5. Pvalue patterns for different methods. Pattern of the original ordered pvalues from the 9 studies for probesets detected by one of the three methods only. The xaxis is the order of the pvalues from the 9 studies. The yaxis is the pvalues. The plot includes a random subset of 20 probesets that are detected exclusively by each of the three methods.
Further applications and comparisons with the rOP method
To compare the performances of the WOP methods and the rOP method Song and Tseng [1] in real data application, we applied our WOP methods to the three microarray metaanalysis applications in Song and Tseng [1]. The first application consists of comparisons of two subtypes of brain tumors  anaplastic astrocytoma (AA) and glioblastoma multiforme (GBM), from 7 studies. The second application combines 9 studies comparing postmortem brain tissues between MDD patients and control samples. In the third application, 16 diabetes microarray studies consisting of different organisms and tissues were combined. See Song and Tseng [1] for more details on the contexts of these three metaanalysis applications.
To ensure that the results are directly comparable, for each metaanalysis we directly used the twosided pvalues for each gene and each individual study calculated in Song and Tseng [1]. In Song and Tseng [1], permutation analysis is used for the brain cancer studies and the MDD studies, while theoretical distributions are used to obtain results for the diabetes studies. We follow Song and Tseng [1] and also directly use the twosided pvalues for the permuted datasets provided by Song and Tseng [1] for the brain cancer studies and the MDD studies. See Song and Tseng [1] for more details on the preprocessing of the data and the calculation of the original pvalues.
We applied our WOP methods, namely the binomial and halfbinomial weighted Fisher’s and Stouffer’s statistic, to the three sets of studies. For the brain cancer studies, we test for HS_{4} out of 7 studies. For the MDD studies we test for HS_{5} out of 9 studies. For the diabetes studies, we test for HS_{9} out of 16 studies. We also applied the corresponding rOP methods for testing the same hypotheses, using r=m=4, 5 and 9 respectively for the three metaanalyses. In addition, we applied the rOP methods using the selected r values in Song and Tseng [1] for comparison. To be specific, Song and Tseng [1] used r=5 for the brain cancer studies, r=7 for the MDD studies, and r=12 for the diabetes studies. Permutation analysis is used for the brain cancer studies and the MDD studies for all methods. For the diabetes studies, we used our numerically simulated distribution to obtain the pvalues for the WOP statistics. Table 2 shows the numbers of significant genes found using the different methods for the three metaanalyses with the FDR controlled at the 0.05 level. We can observe that the WOP methods using the binomial weighting scheme generally detects more significant genes than the halfbinomial weighting scheme. In most cases, the rOP methods (using either r=m or selected r) detect less genes than the WOP methods, although in general closer in number to the WOP methods using the halfbinomial weighting scheme.
Table 2. Number of significant genes for the three metaanalyses by different methods
One interesting observation is that for the MDD studies, the rOP method based on r=m=5 detects less genes than based on the selected r=7. This result is counterintuitive, since one would expect that genes that are differentially expressed in at least 7 studies would be a subset of the genes that are differentially expressed in at least 5 studies. The result could be due to the fact that r is selected in Song and Tseng [1] to optimize the number of significant genes and therefore outperforms a general choice of r=m. Nonetheless, this reflects the fact that the rOP method is sensitive to the choice of r. To further investigate this problem, we looked at the overlap of the detected genes by the rOP methods using either r=m or selected r, as well as with the detected genes by the halfbinomial weighted Fisher’s method. See Figure 6 for a Venn diagram of the genes detected by the aforementioned three methods for the MDD studies. As shown in Figure 6, only 269 genes overlap between the rOP methods with two different choices of r, which is less than half of the genes detected by either method. However, all 269 genes are detected by the halfbinomial weighted Fisher’s method. In addition, the WOP method also picked up 251 of the genes only detected by rOP based on selected r and 239 of the genes only detected by rOP based on r=m, which accounts for most of the genes detected by either method. Further, we noticed that even for the brain cancer studies and the diabetes studies, where rOP based on r=m did detect more genes than rOP based on selected r, there are still a large number of genes detected by rOP based on selected r that are not detected by rOP based on r=m. On the other hand, most of these genes that are detected by only one of the rOP methods are detected by the halfbinomial weighted Fisher’s method.
Figure 6. Venn diagram for MDD studies. Venn diagram for the genes found significant in the metaanalysis of the MDD studies by the rOP method based on r=m, the rOP method based on selected r, and the half binomial weighted Fisher’s method. In this case, m=5 and the selected r=7.
For comparison, to see how the results would differ for different choices of r for the WOP methods, we applied the WOP method to the MDD studies again  this time testing HS_{7}. We used the halfbinomial versions of the three weighting schemes for m<r≤K that were discussed earlier: w^{hb1}, w^{hb2} and w^{hb3}. We used Fisher’s summary statistic in the WOP method. The numbers of genes found by using the three weighting schemes are 760, 770 and 774 respectively. The three weighting schemes are fairly consistent with each other, since for each weighting scheme more than 90% of the genes found were also found by the other two weighting schemes. Comparing to the number of genes found by the halfbinomial weighted Fisher’s method for testing HS_{5}, which is 930, notice that the corresponding WOP methods for testing HS_{7} yield smaller numbers, conforming to our expectations. As discussed earlier, the ideal result would be that the genes detected for HS_{7} be a subset of the genes detected for HS_{5}. In reality, over 70% of the genes detected by WOP methods for HS_{7} were also detected for HS_{5} (the percentage being 72.6%, 71.4% and 74.2% for the three weighting schemes respectively). Recall that for the rOP method, only 43.6% of genes detected for r=7 were also detected for r=5. Even though the WOP methods are not perfect, we can still see the great improvement in robustness compared to the rOP method.
In summary, our observations confirm that the results of the rOP method are indeed heavily dependent on the choice of r. Whereas our WOP methods show much higher robustness. In particular, the WOP methods for testing HS_{m} has shown superior robustness by being able to cover most of the genes detected by the rOP methods using different r. Since in practice it is not often clear which particular HS_{r} should be tested, we believe the WOP methods for testing HS_{m} is a better choice when the goal is to detect signal in the majority of studies.
Conclusions
Metaanalysis is a useful tool in integrating data from different sources to test a particular hypothesis. While this paper mainly discussed the application of metaanalysis on microarray differential expression studies, other areas of genomic studies have increasingly relied on the use of metaanalysis, such as genomewide association studies (GWAS). Some seminal studies in this area include Scott et al. [25] and Willer et al. [26]. Metaanalysis is also frequently used in clinical studies, psychological studies and statistical applications in other social sciences. More and more metaanalyses nowadays aim at detecting consistent findings across a number of studies. While most of the traditional metaanalysis methods test for significance in at least one of the studies, it is important to develop new metaanalysis methods that focus on testing for significance in the majority of studies.
The weighted ordered pvalue (WOP) method provides such a framework. It is unique in its use of weights that are based on the order of the pvalues. The rOP method Song and Tseng [1], which is also based on ordered pvalues, can be considered a very special case under the WOP framework, where all the weight is placed on one single ordered pvalue. The WOP methods do not require prespecification of r and is less sensitive to the choice of its value. The halfbinomial weighted WOP methods have been shown to be more robust and have better receiver operating characteristics compared to the corresponding rOP method.
As pointed out by a reviewer, one of our previously published metaanalysis methods also considers the issue of heterogeneity and utilizes weights. However, these two methods are quite different in concept. The method in [9] applies weights to the genes, essentially to rerank the genes by adding in information about heterogeneity. On the other hand, the WOP method applies weights to the multiple pvalues for each gene. Although, conceptually, we can first apply the WOP method to each gene and then use the method in [9] on top of that to rerank the genes.
One advantage of the WOP framework is its flexibility. The framework allows for different weighting schemes and summary statistics to be used. Even though this paper mainly focused on two particular weighting schemes based on the binomial distribution and two summary statistics (Fisher’s and Stouffer’s statistics), in general, other summary statistics and weighting schemes can be used. Future research can be done to try to optimize the weighting scheme to suit specific metaanalysis purposes.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YL developed the method’s framework, carried out the simulation studies and data analysis, and drafted the manuscript. DG conceived of the weighting schemes, supervised the research process and revised the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We acknowledge the support of the NSF grant ABI1262538.
References

Song C, Tseng GC: Hypothesis setting and order statistic for robust genomic metaanalysis.

Fisher RA: Statistical methods for research workers. Oliver and Boyd: Edinburgh; 1925.

Stouffer SA, Suchman EA, Devinney LC, Star SA, Williams RM: The American soldier: adjustment during army life. Princeton, NJ: Princeton University Press; 1949.

Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation.

Wilkinson B: A statistical consideration in psychological research.

Phillips D, Ghosh D: Testing the disjunction hypothesis using Voronoi diagrams with applications to genetics.

Benjamini Y, Heller R: Screening for partial conjunction hypotheses.

Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in metaanalysis.

Li Y, Ghosh G: Assumption weighting for incorporating heterogeneity into metaanalysis of genomic data.

Olkin I, Saner H: Approximations for trimmed Fisher procedures in research synthesis.

Lancaster H: The combination of probabilities: an application of orthonormal functions.

Mosteller F, Bush RR: Selected quantitative techniques. In Handbook of Social Psychology. Cambridge, MA: AddisonWesley; 1954.

Won S, Morris N, Lu Q, Elston R: Choosing an optimal method to combine Pvalues.

Makambi KH: Weighted inverse chisquare method for correlated significance tests.

Li J, Tseng GC: An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies.

Xie M, Singh K, Strawderman WE: Confidence distributions and a unifying framework for metaanalysis.

Zaykin DV: Optimally weighted Ztest is a powerful method for combining probabilities in metaanalysis.

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Pearson K: On a new method of determining “goodness of fit”.

Chin MH, Mason MJ, Xie W, Volinia S, Singer M, Peterson C, Ambartsumyan G, et al.: Induced pluripotent stem cells and embryonic stem cells are distinguished by gene expression signatures.

Guenther MG, Frampton GM, Soldner F, Hockemeyer D, Mitalipova M, Jaenisch R, Young RA: Chromatin structure and gene expression programs of human embryonic and induced pluripotent stem cells.

Newman AM, Cooper JB: Labspecific gene expression signatures in pluripotent stem cells.

Chin MH, Pellegrini M, Plath K, Lowry WE: Molecular analyses of human induced pluripotent stem cells and embryonic stem cells.

Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU, ProkuninaOlsson L, Ding CJ, Swift AJ, Narisu N, Hu T, Pruim R, Xiao R, Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M, White PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins L, Xiang F, Saramies J, et al.: A genomewide association study of type 2 diabetes in Finns detects multiple susceptibility variants.

Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM, Strait J, Duren WL, Maschio A, Busonero F, Mulas A, Albai G, Swift AJ, Morken MA, Narisu N, Bennett D, Parish S, Shen H, Galan P, Meneton P, Hercberg S, Zelenika D, Chen WM, Li Y, Scott LJ, Scheet PA, et al.: Newly identified loci that influence lipid concentrations and risk of coronary artery disease.