Abstract
Background
Network metaanalysis can be used to combine results from several randomized trials involving more than two treatments. Potential inconsistency among different types of trial (designs) differing in the set of treatments tested is a major challenge, and application of procedures for detecting and locating inconsistency in trial networks is a key step in the conduct of such analyses.
Methods
Network metaanalysis can be very conveniently performed using factorial analysisofvariance methods. Inconsistency can be scrutinized by inspecting the design × treatment interaction. This approach is in many ways simpler to implement than the more common approach of using treatmentversuscontrol contrasts.
Results
We show that standard regression diagnostics available in common linear mixed model packages can be used to detect and locate inconsistency in trial networks. Moreover, a suitable definition of factors and effects allows devising significance tests for inconsistency.
Conclusion
Factorial analysis of variance provides a convenient framework for conducting network metaanalysis, including diagnostic checks for inconsistency.
Keywords:
Analysis of variance; Baseline contrast; Heterogeneity; Inconsistency; Linear mixed model; Network metaanalysis; Pairwise treatment contrast; PRESS residual; Studentized residualBackground
Results from several randomized trials can be combined by metaanalysis methods. In the simplest case, all trials comprise the same set of treatments, typically only two, i.e., a new treatment and a control or baseline treatment. When trials differ in design, i.e., in the sets of treatments tested, joint analysis may be done by what has come to be called network metaanalysis (NMA). Such analyses combine different sources of pairwise treatment comparisons across trials, i.e., direct comparisons from trials that jointly test both treatments of interest and indirect comparisons from trials that only test one of the two treatments, but are connected through other treatments via the trial network. A key assumption of many methods for NMA is consistency of treatment effect estimates across designs, defined by the set of treatments tested. In particular, consistency implies agreement between direct and indirect evidence on a treatment contrast. Several methods have been proposed for detecting inconsistency in trial networks [14].
Most methods for analysis of NMA operate on pairwise contrasts of treatments with a baseline treatment or control, henceforth denoted as baseline contrasts. Some methods for detecting inconsistency in metaanalysis networks based on baseline contrasts are relatively complex on account of the fact that baseline treatments may vary among trials and sources of inconsistency have to be traced through loops of the network [35]. It has been shown by Piepho et al. [6] that NMA can be greatly simplified by modelling treatment means rather than treatment contrasts using factorial analysisofvariance (ANOVA) models, and that such analyses can produce identical or essentially the same results as analyses using baseline contrasts. The present paper will therefore focus on the ANOVA approach and illustrate its versatility. Specifically, we will explore ways to detect inconsistency using standard procedures for linear models available in most statistical packages. The methods will be illustrated using the diabetes example published by Senn et al. [7]. This example has also been used by Krahn et al. [1] to illustrate their proposed methods for detection of inconsistency using a baseline contrast parameterization, so our results can be compared directly to that paper in order to appreciate the degree of agreement between both model formulations and the resulting tests and diagnostic checks for inconsistency. The presentation assumes that the reader has access to a mixed model package using restricted maximum likelihood (REML) to estimate variance components and is familiar with the essentials of the underlying theory [8]. Program code in SAS for all analyses presented is given in Additional file 1.
Additional file 1. Contains all SAS code that was used to obtain the results presented in this paper.
Format: ZIP Size: 4KB Download file
Methods
In this section, we describe the basic models we are using. In the Results section, minor extensions and associated statistics derived from the various models, such as influence diagnostics, are introduced as needed.
A twoway ANOVA model for metaanalysis can be written as
where η_{ij} is the expected value of the jth treatment in the ith trial, β_{i} is the main effect of the ith trial, γ_{j} is the main effect of the jth treatment, and u_{ij} is a trial × treatment interaction effect, which models heterogeneity between trials. For implementation it is convenient to represent the linear model (1a) in symbolic notation that is akin to model syntax used in linear model packages. We here use a notation originally proposed by Wilkinson and Rogers [9], which has hence been used by many authors [10,11] and has also been implemented in some linear model packages. The factors used for representing the model are given in Table 1.
Table 1. Description of factors used for representing factorial ANOVA models for NMA
The twoway ANOVA model (1) can be represented as [9]
where × is an operator for crossing two factors or model terms, S is a factor identifying the individual trial, and T denotes the treatment factor. Effects in (1b) can be equated with those in (1a) as follows: S ≡ β_{i}, T ≡ γ_{j}, and S.T ≡ u_{ij}.
In NMA, trials can be classified into groups of trials according to the set of treatments tested. These categories will henceforth be denoted as designs (Table 1). Procedures for detecting inconsistency can be easily implemented by expanding the ANOVA model (1) to reflect the nesting of trials within designs. The extended model is
where α_{h} is a main effect for the hth design, β_{hi} is an effect for the ith trial nested within the hth design, γ_{j} is the main effect of the jth treatment, v_{hj} is an interaction effect for the hth design and the jth treatment, and u_{hij} is an interaction effect for the ith trial (nested within the hth design) and the jth treatment. The interaction effect v_{hj} represents inconsistency, whereas u_{hij} represents heterogeneity as in model (1) [6,12]. Using the factor G to represent the design (Table 1), the symbolic version of the extended model (2a) is
where / is a nesting operator [9]. Note that the nesting relation G/S in (2b) is resolved as G/S = G + G.S. This structure is then fully crossed with T, as indicated by the crossing operator × on the lefthand side of eq. (2b). Effects in (2a) and (2b) can be equated as follows: G ≡ α_{h}, G.S ≡ β_{hi}, T ≡ γ_{j}, G.T ≡ v_{hj}, and G.S.T ≡ u_{hij}.
Linear predictors (1) and (2) can be used either in models for individual patient data or in models for treatment summaries per trial (e.g., empirical logits or treatment means) [6]. When individual patient data are modelled, then depending on the outcome variable it may be appropriate to use the linear predictor in a generalized linear (mixed) model [GL(M)M], e.g., when the response is binomial so that a logit or probit link is required. When summary measures are available, it is customary to model the response by a linear (mixed) model assuming normality and accounting for possible heterogeneity in precision by weighting. In the diabetes example by Senn et al. [7], we have at our disposal mean responses per treatment and trial as well as the associated sample standard deviations and sample sizes, from which the variance of a mean can be computed. Thus, the models used for our analyses are of the form
where y is the observed treatment mean in a trial, η is the linear predictor, modelled, e.g., using (1) or (2), and e is the random normal error associated with the observed mean. The errors are modelled to have zero mean and variance var(e) equal to the observed squared standard error of a mean, assumed to be a known constant when fitting (3). This analysis is easily performed using linear mixed model software with weighting facility by defining the inverse of var(e) as weight and fixing the residual variance at unity [13]. The approach is fully efficient, because the variancecovariance matrix of the vector of means y is diagonal with elements equal to var(e) [14].
Following Krahn et al. [1], we will initially consider analyses by models (1) and (2) when all effects in the linear predictor (eq. 1b or 2b) are taken as fixed. Subsequently, we will consider analyses that model heterogeneity [i.e., the interaction effects S.T and G.S.T in eqs. (1b) and (2b), respectively] as random, which is common practice (see, e.g., [6] and [15]). One may argue that if heterogeneity is detected, then the effect for heterogeneity may be used as an error term for testing inconsistency because heterogeneity effects are nested within the effects for inconsistency. This leads to an analysis with random interaction effects S.T or G.S.T. Conversely, one may insist that heterogeneity be modelled as a fixed effect. Then if heterogeneity is detected, it may be concluded that there is no further basis for testing inconsistency because of the nested structure of effects for heterogeneity in relation to inconsistency. In this situation, one may try to find subsets of trials that do not display heterogeneity and analyse these subsets separately [2]. This philosophy is in agreement with that put forward by Nelder [16], who argued that testing main effects in a twoway fixedeffects ANOVA is justified only when the interaction is deemed to be absent and the model is reduced accordingly. Here, we will present results for both approaches (interactions for heterogeneity fixed or random) and compare the results. Our favoured approach is to model heterogeneity as random when performing checks and tests for inconsistency as well as when comparing treatment means.
Results
The diabetes data comprise a total of 26 trials, most of which involve a glucose lowering agent added to a baseline sulfonylurea therapy. The continuous outcome variable is blood glucose change as measured by the marker HbA1c in patients with type two diabetes. There were fifteen different designs, including one threearmed trial and fourteen trials involving only two treatments. The network provides direct evidence for fifteen out of 45 possible pairwise contrasts. Eight of these contrasts involve the placebo. The ten different treatments are given in Table 2.
Fitting models (1) and (2)
We start by fitting models (1) and (2) as purely fixed effects models, which is equivalent to the models used by Krahn et al. [1]. Generally, throughout the example, we adhere to the order of effects as stated in models (1) and (2) and use sequential (incremental) fitting of terms, corresponding to Type I hypotheses in linear model procedures of the SAS system, which is used for all analyses presented in this paper. There are five designs that have more than one trial and so allow testing for heterogeneity. Thus, we first fit model (1) separately to each of these designs. The resulting Waldtype chisquared statistics for significance of the trial × treatment interaction (u_{ij}), along with the associated pvalues, are shown in Table 3. There is significant heterogeneity for four out of five designs.
Table 3. Waldtype chisquared tests for heterogeneity (u_{ij})
An overall test for heterogeneity is obtained by fitting model (2). The trial × treatment interaction (G.S.T) yields a chisquared statistic of 74.45 on 11 d.f. (p < 0.0001). This chisquared statistic for overall heterogeneity is equal to the sum of the chisquared statistics for heterogeneity for the five designs in Table 3. When dropping the effect G.T from the model, the Waldtest for the effect G.S.T becomes a joint test for inconsistency and heterogeneity. The chisquared statistic for this test equals 96.98 on 18 d.f. (p < 0.0001), and it is equal to Generalized Cochran’s Q [1]. Further note that the model T + S + T.S produces the same overall Q of 96.98. At this point, we can conclude that there is significant heterogeneity.
All chisquared statistics presented so far are identical to those in Table 3 of Krahn et al. [1], who used a model based on baseline contrasts. We also obtain their chisquared statistic for inconsistency, when we fit G.S.T as fixed and test the effect G.T (chisquared = 22.53, d.f. = 7, p = 0.0021). But we favor a mixed model analysis with random trial × treatment interaction (G.S.T), because we consider it the major error term for testing the design × treatment interaction (G.T), which assesses inconsistency. At the same time, the trial effect needs to be modelled as fixed in order to maintain equivalence with the baseline contrast approach [6,7]. When we take the interaction effect for heterogeneity (G.S.T) as random, assuming a constant variance for this effect, the chisquared statistic for inconsistency (G.T) drops to 2.27. The REML estimate of the variance for heterogeneity is . Note that this estimate corresponds to half the variance for heterogeneity with the baseline contrast approach [6,15] (usually denoted as τ^{2}). Since the test for inconsistency now involves an estimated variance component, we use the KenwardRoger method for approximating the denominator d.f. of a Waldtype Fstatistic [17]. We find F = 0.32 on 7 numerator and 11 denominator d.f. and p = 0.9268. By this analysis, there is no significant inconsistency, which is in contrast to the analysis with fixed effects for G.S.T. Note that this analysis treats the residual variances of the individual trials as known constants, although they are, in fact, estimated when analysing individual trials. The added uncertainty associated with these variance estimates could be accounted for by using the KenwardRoger method in a singlestage analysis modelling individual patient data [14], but differences compared to the twostage analysis employed here are expected to be small so long as the sample sizes per treatment and trial are large enough, as is usually the case.
A very simple further check for inconsistency is to fit both G.T and G.S.T as random. The best linear unbiased predictors (BLUPs) of the G.T effects give a direct indication which treatment × design combinations contribute most to the inconsistency. With the diabetes example, the variance component for G.T is estimated to be zero, so the BLUPs for all G.T effects are zero, which is in agreement with the nonsignificant Waldtest for inconsistency.
Locating inconsistency by detachment of individual designs
Locating inconsistency in the network may be based on a detachment of an individual design from the others by a suitable model formulation that allows testing the contribution of that individual design to inconsistency in the network as well as the inconsistency that remains after detaching that design. Krahn et al. [1] showed how to code a detachment model for baseline contrasts. Here, we show how to implement this approach based on a straightforward extension of the factorial model (2).
To illustrate, consider the first design in the diabetes example, which has fifteen designs, coded by a factor G. We may define a new factor D1 for the first design, which has two levels, one for the first design and another common level for the remaining fourteen designs (Table 4). Obviously, factors D1 and G have a hierarchical relationship, with G nested in D1. Thus, the interaction effect G.T, which assesses inconsistency, may be partitioned as
Table 4. Definition of detachment factors for testing inconsistency [D k .T; k ∈ (1,…,15)]
Fitting both effects on the righthand side of (4) simultaneously, the effect D1.T captures the contribution of the first design to the overall design × treatment interaction, i.e., to overall inconsistency, while the remainder of the interaction/inconsistency after detachment of the first design is captured by the effect D1.G.T. Using the syntax of Wilkinson and Payne [9] and observing the nesting of factors D1, G and S, the full model can be developed as follows:
The same partitioning can be done, in turn, for each of the other fourteen designs. The coding for factors Dk [k ∈ (1,…,15)], where k indexes the designs, is shown in Table 4. Table 5 shows the results of analysis by model (5) for the eleven out of fifteen designs which contribute to the design × treatment interaction of the network. The analysis was done taking the interaction for heterogeneity (D1.G.S.T) either as fixed or random. Note that in case of a fixed effect for heterogeneity the Waldtype chisquared statistics for D1.T and D1.G.T in (5) (see Table 5) add up to the chisquared statistic for overall inconsistency (G.T) in (2) (chisquared = 22.53), but not when heterogeneity is modelled as random. When the heterogeneity effect is modelled as fixed, there are five designs with a significant contribution to the inconsistency (Table 5). The strongest contribution comes from the design rosi:SUal, which was also detected by Krahn et al. [1] as being the major source of inconsistency. For this design, as well as for the design metf:SUal, the test of the remainder of the inconsistency (Dk.G.T) is nonsignificant, suggesting that one of these designs could be removed to instate consistency. When heterogeneity is modelled as random, however, there is no indication of inconsistency for any of the designs.
Table 5. Waldtype chisquared tests for inconsistency using detachment factors [D k .T; k ∈ (1,…,15)]
Using influence diagnostics for design × treatment means
In order to detect influential or outlying observations in the network, we use a twostage approach. In the first stage, we compute design × treatment means using model (2). In the second stage, we fit an additive twoway model of the form G + T to the design × treatment means. From this analysis, we can obtain residual and influence diagnostics [18,19] by standard procedures with most linear mixed model packages. The key idea is that observations contributing substantially to inconsistency will display strong G.T interaction effects, which in turn will be captured by the residuals of the additive model G + T.
Three different options are considered for handling the effect for heterogeneity (G.S.T) in the firststage analysis based on model (2): (i) taking it fixed, (ii) taking it random and (iii) dropping it. It turns out that with options (ii) and (iii), the treatment means of designs 3, 4, 7, 11 and 12 are correlated, meaning that weighting by the inverse of the squared standard errors is only an approximate method (note that the designs in question are precisely the ones represented by several trials). An exact analysis requires carrying the full variancecovariance matrix of design × treatment means forward and specifying this as the residual variancecovariance matrix of the model fitted at the second stage [14]. This is easily done in SAS using the REPEATED statement with the option TYPE = LIN(1). Note that option (iii) is in line with common practice when the baseline contrast formulation is used [1] and heterogeneity is deemed absent. But heterogeneity was found to be significant for the diabetes data, so one may argue that this effect should be in the model for checking consistency. If the effect is in the model and taken as fixed (option i), effectively all trials are given the same weight, whereas when the effect is dropped (option iii), each trial is weighted according to the variances of the means, which explains the differences in results. Both analyses are not fully satisfactory because heterogeneity is not appropriately taken into account. Taking heterogeneity as random (option ii) is common practice in metaanalysis [6,15], and this is also our preferred approach over option (i) for the reasons stated at the end of the Methods section.
To compute influence diagnostics in stage two, we here use the output generated by the INFLUENCE option to the MODEL statement of the MIXED procedure of SAS (Version 9.4). The PRESS residuals and studentized residuals are shown in Table 6. The PRESS residual for the mth observation is the raw residual when the mth observation has been deleted for estimating the fitted value. Large residuals indicate design × treatment combinations contributing substantially to the overall inconsistency. When in the first stage the effect for heterogeneity in model (2) (G.S.T) is modelled as fixed, or when the effect for heterogeneity is dropped, then designs 6 and 13 stand out as having conspicuously large studentized residuals relative to the standard normal distribution and relatively large PRESS residuals, which is in agreement with the tests in Table 5 and also with the results by Krahn et al. [1]. When heterogeneity is modelled as random, however, the studentized residuals of all designs are inconspicuous, which also agrees with our results in Table 5. The agreement of results based on PRESS residuals with the tests in Table 5 is expected because the Dk factor essentially invokes a deletion operation that separates the effect of the kth design from the rest, which is exactly the effect of PRESS residuals computed here (Table 6). It is noted that the residuals of twotreatment designs are equal in absolute value and of opposite sign, as expected. Observations with no residuals correspond to designs that do not contribute to the design × treatment interaction in the network.
Table 6. Studentized residuals and PRESS residuals
A further set of useful diagnostic statistics are the casedeletion estimates of treatment means. Figure 1 shows a casedeletion plot for all treatment means against observations that contribute to the design × treatment interaction. The analysis is based on design × treatment means computed with random effects G.S.T in (2). The plot identifies the same observations as influential that also showed up by relatively large studentized and PRESS residuals in Table 6. For example, the treatment mean for SUal is largely driven by observations 12 and 13 from design 6 (metf:SUal) and observations 26 and 27 from design 13 (rosi:SUal). Also, the mean of treatment piog is mostly governed by observations 16 to 19 from designs 8 (piog:plac) and 9 (piog:metf).
Figure 1. Casedeletion plot of treatment means. Casedeletion means based on a fit of the model G + T using design × treatment mean estimates obtained from fitting model (2) taking heterogeneity G.S.T as random. To obtain diagnostics for treatment means (factor T), we prevented an intercept from being fitted and imposed a sumtozero restriction on the design effects G.
Presenting multiple comparisons of treatment means
Since the inconsistency has been found to be nonsignificant when modelling heterogeneity as a random effect, we drop the inconsistency interaction (design × treatment) from model (2) and then compute adjusted treatment means. We perform all pairwise comparisons using the simulationbased method of Edwards and Berry [20] at a familywise significance level of 5%. Results are shown in Table 7. For ease of interpretation, we also compute a letter display using the algorithm described in Piepho [21]. According to the letter display, means sharing a common letter are not significantly different. It is seen that treatments acar, metf, migl, piog and rosi are significantly different from the placebo. Among these superior treatments, rosi has the smallest estimated mean, but this is not significantly different from the other treatments outperforming the placebo.
Table 7. Adjusted means for the ten treatments
In order to emphasize that the ANOVA implementation also yields estimates of pairwise treatment contrasts and the associated standard errors, as does the baseline contrast implementation, we report these statistics in Table 8. This information is part of the standard output of mixed model packages, but is not convenient for routine reporting in case of a larger number of treatments. The table of treatment means and the associated letter display provide a more compact summary of the network metaanalysis.
Table 8. Pairwise differences of the ten treatment means
Discussion and conclusions
This paper has illustrated how a factorial ANOVA approach can be used to perform NMA and to locate inconsistency in a given network. It was shown in Piepho et al. [6] that this analysis is either fully equivalent (summary measures, normal response in case of individual patient data) or very similar (individual patient data with nonnormal responses and nonidentity link functions in a GL(M)M framework) to the more commonly used approach to metaanalysis based on baseline contrasts. We think that the ANOVA approach has some practical advantages. Interpretation of results is facilitated by the focus on t treatment means rather than on t(t  1)/2 pairwise contrasts. Standard procedures for multiple comparison of treatment means further aid the communication of results. Also, the approach may be appealing to those familiar with ANOVA of factorial experiments. It has been demonstrated that standard diagnostic procedures for linear models can be used to identify influential designs in the network and to detect sources of inconsistency. The results obtained for the diabetes example agree very closely with those obtained using recently proposed procedures based on a baselinecontrast approach [1]. We hope that this paper will help to popularize the ANOVA approach as a viable and easytouse approach to NMA.
Competing interests
The author declares that there are no competing interests.
References

Krahn U, Binder H, König J: A graphical tool for locating inconsistency in network metaanalysis.
BMC Med Res Methodol 2013, 13:35. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Donegan S, Williamson P, D’Allesandro U, Smith CT: Assessing key assumptions of network metaanalysis: a review of methods.
Res Synth Meth 2013, 4(4):291323. Publisher Full Text

Lu G, Ades AE: Combining direct and indirect evidence in mixed treatment comparisons.
Stat Med 2004, 23:31053124. PubMed Abstract  Publisher Full Text

Higgins JPT, Jackson D, Barrett JK, Lu G, Ades AE, White IR: Consistency and inconsistency in network metaanalysis: concepts and models for multiarm studies.
Res Synth Meth 2012, 3:98110. Publisher Full Text

White IR, Barrett JK, Jackson D, Higgins JPT: Consistency and inconsistency in network metaanalysis: model estimation using multivariate metaregression.
Res Synth Meth 2012, 3:111125. Publisher Full Text

Piepho HP, Williams ER, Madden LV: The use of twoway mixed models in multitreatment metaanalysis.
Biometrics 2012, 68:12691277. PubMed Abstract  Publisher Full Text

Senn S, Gavini F, Magrez D, Scheen A: Issues in performing network metaanalysis.
Stat Methods Med Res 2013, 22(2):169189. PubMed Abstract  Publisher Full Text

Searle SR, Casella G, McCulloch CE: Variance Components. New York: John Wiley & Sons; 1992.

Wilkinson GN, Rogers CE: Symbolic description of factorial models for analysis of variance.
Appl Stat 1973, 22:392399. Publisher Full Text

McCullagh P, Nelder JA: Generalized Linear Models. 2nd edition. London: Chapman & Hall; 1989.

Piepho HP, Büchse A, Richter C: A mixed modelling approach to randomized experiments with repeated measures.
J Agron Crop Sci 2004, 190:230247. Publisher Full Text

Lu G, Welton NJ, Higgins JPT, White IR, Ades AE: Linear inference for mixed treatment comparison metaanalysis: A twostage approach.
Res Synth Meth 2011, 2:4360. Publisher Full Text

Möhring J, Piepho HP: Comparison of weighting in twostage analyses of series of experiments.
Crop Sci 2009, 49:19771988. Publisher Full Text

Piepho HP, Möhring J, SchulzStreeck T, Ogutu JO: A stagewise approach for analysis of multienvironment trials.
Biom J 2012, 54:844860. PubMed Abstract  Publisher Full Text

Whitehead A: MetaAnalysis of Controlled Clinical Trials. New York: Wiley & Sons; 2002.

Nelder JA: Linear models: back to basics.
Stat Comput 1994, 4:221234. Publisher Full Text

Kenward MG, Roger JH: Small sample inference for fixed effects from restricted maximum likelihood.
Biometrics 1997, 53:983997. PubMed Abstract  Publisher Full Text

Belsley DA, Kuh E, Welsch RE: Regression Diagnostics; Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons; 1980.

Cook RD, Weisberg S: Residuals and Influence in Regression. New York: Chapman and Hall; 1982.

Edwards D, Berry JJ: The efficiency of simulationbased multiple comparisons.
Biometrics 1987, 43:913928. PubMed Abstract  Publisher Full Text

Piepho HP: A SAS macro for generating letter displays of pairwise mean comparisons.
Prepublication history
The prepublication history for this paper can be accessed here: