Abstract
Background
In network metaanalyses, several treatments can be compared by connecting evidence from clinical trials that have investigated two or more treatments. The resulting trial network allows estimating the relative effects of all pairs of treatments taking indirect evidence into account. For a valid analysis of the network, consistent information from different pathways is assumed. Consistency can be checked by contrasting effect estimates from direct comparisons with the evidence of the remaining network. Unfortunately, one deviating direct comparison may have side effects on the network estimates of others, thus producing hot spots of inconsistency.
Methods
We provide a tool, the net heat plot, to render transparent which direct comparisons drive each network estimate and to display hot spots of inconsistency: this permits singling out which of the suspicious direct comparisons are sufficient to explain the presence of inconsistency. We base our methods on fixedeffects models. For disclosure of potential drivers, the plot comprises the contribution of each direct estimate to network estimates resulting from regression diagnostics. In combination, we show heat colors corresponding to the change in agreement between direct and indirect estimate when relaxing the assumption of consistency for one direct comparison. A clustering procedure is applied to the heat matrix in order to find hot spots of inconsistency.
Results
The method is shown to work with several examples, which are constructed by perturbing the effect of single study designs, and with two published network metaanalyses. Once the possible sources of inconsistencies are identified, our method also reveals which network estimates they affect.
Conclusion
Our proposal is seen to be useful for identifying sources of inconsistencies in the network together with the interrelatedness of effect estimates. It opens the way for a further analysis based on subject matter considerations.
Keywords:
Network metaanalysis; Inconsistency; Cochran’s Q; Hat matrixBackground
Evidence from various treatment comparisons in different randomized trials can be combined by a network metaanalysis. This method not only aggregates evidence from direct comparisons, but also involves indirect comparisons, i.e. relative effect inferences for previously observed or not observed contrasts. References [14] give an overview of the recent methodological development. The validity of a network metaanalysis and, in particular, that of the indirect comparisons depends on a consistent network of treatment effects. However, there might be specific treatment effects in the network that lead to inconsistency, e.g. due to being based on studies with divergent patient or trial characteristics [5,6] or due to bias [7]. Perturbed treatment effects can strongly affect other network estimates, which induces further inconsistency between direct and indirect estimates. This calls for tools that can identify the flow of evidence in the network, i.e. that can highlight direct comparisons that strongly drive other treatment effect estimates and hot spots of network inconsistency.
In this context, inconsistency means disagreement between direct and indirect evidence that can occur in addition to heterogeneity between studies with the same treatment arms. A network metaanalysis can be visualized by a graph, whereby the set of nodes corresponds to the considered treatments and the edges display the treatment comparisons of all included trials. If corresponding treatment effect estimates of various connections, or so called paths, differ between two treatments, there is inconsistency. Since the start and end point for different alternative network paths are the same, inconsistency can only be detected in such network loops [8,9]. It is not possible to trace inconsistency back to a single comparison in a network that only includes one loop, but comparisons that are included in several loops may be identifiable as a unique source for a hot spot of inconsistency.
In the following, we therefore provide methods for identifying such hot spots, which might consist of loops, parts of loops or even just single comparisons. We also investigate the influence of individual comparisons on the network estimates that might drive further perturbation and invalid network estimates due to the network design.
Different approaches to assess inconsistency have been discussed. The series of Technical Support Documents produced by the NICE Decision Support Unit [10] provides a detailed review of methods on this topic. The oldest method to assess inconsistency is to examine it in threetreatment loops [11]. For example, Cipriani et al. [12] apply it to every threetreatment loop in the network. While including larger loops as well, Salanti et al. [6] systematically repeat the method for every loop in the network. Another method to assess inconsistency is to set up a mixed model with a sparse covariance structure that allows for one extra variance component to capture inconsistency; this was performed in a classical likelihood framework [8] and in a Bayesian framework [9,13].
Finally, consistency can be assessed by comparing a model that satisfies only some consistency restrictions (or no restrictions at all) with the consistency model. The nodesplitting method [14] extends the consistency model by only one parameter that captures the difference between a contrast, e.g. treatment A versus treatment B, that is assessed in all direct comparisons and the same contrast assumed to be valid from the indirect evidence. Unfortunately, the definition of the indirect evidence is not quite clear for multiarmed studies, and the nodesplitting methods were recognized as depending on the choice of reference treatment in multiarmed studies [15,16]. Recently, Higgins et al. [15] and White et al. [16] have set up a modeling paradigm where studies are distinguished by design, i.e. by the full set of treatments compared. In this case, the effect of a contrast, e.g. between treatment A and treatment B, may differ in the full inconsistency model depending on being estimated in twoarmed studies or e.g. in threearmed studies containing additionally treatment C, or treatment D. In their model, inconsistency is no longer a violation of some equations that reflect loops, but rather model parameters reflecting designbytreatment interactions. Lu et al. [9] and White et al. [16] have used the term inconsistency degrees of freedom for the difference in the number of parameters between the full inconsistency model and the consistency model, but Lu et al. [9] defined them without distinguishing direct evidence from twoarmed and multiarmed studies.
Lu and Ades [9] gave preference to a Bayesian approach and favored randomeffects models that include inconsistency factors as random effects. Senn et al. [17] cautioned against randomeffects analysis and pointed out (as did [18]) that in fixedeffects models with variances assumed to be known, a Cochrantype chisquared statistic results for the overall heterogeneity in the network. Caldwell et al. [19] proposed a chisquared statistic for testing the consistency of independent network paths between one pair of treatments. White et al. [16] proposed a global Wald chisquared test for all designbytreatment interaction parameters (treated as fixed effects), applied in a model with random effects for heterogeneity within designs that is fitted via restricted maximum likelihood method. This test may lack power by implicitly attributing part of the inconsistency to heterogeneity.
In this paper, we will define another global chisquared test for inconsistency that results by comparing a fixedeffects model for inconsistency with a consistency model. It will emerge as a part of the decomposition of Cochran’s Q statistic into components accounting for heterogeneity among studies sharing the same design and inconsistency.
Once inconsistency has been assessed globally, means are needed to find its sources. Senn et al. [17] inspected the squared Pearson residuals at the study level, which sum up to the overall Q chisquared statistic. The designbytreatment interaction parameters introduced by [15,16] may be used to spot inconsistency. Unfortunately, the definition of these parameters relies on the ordering of treatments. More generally, all regression diagnostic methods (see [20]) can be applied. References [21,22] have discussed this for classical metaanalysis and [23] for network metaanalysis. There are some attempts to visualize network metaanalysis for assessing heterogeneity, including inconsistency [6,9,14,17,23,24]. None of these have met with general acceptance as yet, and they do not address the needs as well as the forest plot does in classical metaanalysis, which simultaneously discloses each study’s weight and deviation from the pooled estimate. Notably, for this purpose, the Galbraith plot [25], although far less commonly used, is even better suited.
In the following, we systematically develop a graphical tool for highlighting hot spots of inconsistency by considering the detailed change in inconsistency when detaching the effect of studies with the same treatment arms. Furthermore, we identify drivers for the network estimates. Highlighting of inconsistency will provide more information than just singling out inconsistent loops. We provide a matrix display that summarizes network drivers and inconsistency in two dimensions, such that it may be possible to trace inconsistency back to single deviating direct comparisons. Naturally, it is difficult to display detailed network properties in just two dimensions, but we propose a clustering approach that automatically groups comparisons for highlighting hot spots.
Section “Methods” provides a detailed description of the different building blocks of our proposal: We present a fixedeffects model for network metaanalyses within the framework of general linear models with known variances in Section “Parameterization and twostage analysis of a fixedeffects model in network metaanalysis”. Based on this model, we discuss the resulting hat matrix in Section “Identifying drivers via the hat matrix”, which we use as an instrument for identifying drivers. We suggest using a chisquared statistic for the heterogeneity in the network, which we decompose into a test statistic for the inconsistency and a test statistic for the heterogeneity within groups of studies, classified according to which treatments are involved. A graphical tool that visualizes the network drivers and inconsistency hot spots is given in Section “Identifying hot spots of inconsistency”. Specifically, we use the inconsistency information along with detaching of single component metaanalyses to locate inconsistency hot spots. All the steps in Section “Methods” are illustrated using artificial examples. Section “Results” then provides results for two published network analyses. Finally, we discuss our methods and results in Section “Discussion”, and we provide concluding remarks in Section “Conclusions”.
Methods
In the following, we provide a fixedeffects model for network metaanalyses, on which we base our further analysis. We present tools to identify hot spots of inconsistency in the network and drivers with a high impact on network estimates. Using these two tools, we provide a graphical display to locate potential sources of inconsistency.
Parameterization and twostage analysis of a fixedeffects model in network metaanalysis
We consider a network metaanalysis with T+1 treatments A_{0},…,A_{T}, under which A_{0} represents a reference treatment. A total of S studies compares these treatments, such that a graphical representation of the comparison network with treatments as nodes and edges linking treatments directly compared in some studies creates a connected graph (see e.g. Figure 1a). We summarize all studies s (s=1,…,S) in the set , classify each study by the number of included treatments N_{s} and by a design index d=1,…,D according to which treatments are respectively involved (see [15,16,23] for a similar approach). We define as the subset of studies with the same design d that includes N_{d} different treatments.
Figure 1. Network design and hat matrix of an illustrative network metaanalysis. In a), the network design of an illustrative example is given: six treatments and eight different observed designs based on twoarmed studies. The nodes correspond to the treatments, and the edges show which treatments are directly compared. The thickness of an edge represents the inverse standard error , which is equal one for all designs. In b), the resulting hat matrix at the design level is given in percent, which indicates the contribution of the direct estimate in design d (shown in the column) to the network estimate in design d’ (shown in the row). In addition, the absolute values of the matrix elements are visualized by the area of the gray squares.
For a fixedeffects analysis, this network can be written in matrix notation as the following general linear model with heteroscedastic sampling variances:
Y is a vector of observed treatment effects of all S studies, e.g. log odds ratios for a binary outcome and the design matrix X with T columns contains the structure of the network at the study level. For all studies of one design, we choose the same reference treatment. Assuming a consistent network, we estimate the vector of basic parameters θ^{net} (in terminology of [9]) corresponding to the treatment effects of all T comparisons to the reference treatment. By considering linear combinations of them, we can then infer all other effects of the network. The vector ∈ comprises all error terms of the model with E(∈)=0 and known covariance matrix V, which has a diagonal form. The length of vector Y and ∈ as well as the number of rows of X and V depend on the number and design of the included studies. Each twoarmed study provides one entry to Y, one entry to the diagonal of V, and one row to X. We deal with the case of multiarmed studies separately in Section “Multiarmed studies”.
For exemplifying the model components, we consider a simple example of a network metaanalysis with three treatments A_{0},A_{1},A_{2} (T=2) and four observed studies (S=4): two studies (s=1,2) for comparison A_{1} versus A_{0} (d=1), one study (s=3) for comparison A_{2} versus A_{0} (d=2), and one study (S=4) for comparison A_{2} versus A_{1} (d=3). Then the basic parameters are and for the contrast of treatment A_{1} versus A_{0} (d=1, named 0:1) and the contrast of treatment A_{2} versus A_{0} (d=2, named 0:2). Under the consistency assumption, it follows that the effect . Let Y_{s} be the observed effect and V_{s} is the corresponding sampling variance in study s. We then have:
The vector of the basic parameters θ^{net} can be estimated in a classical frequentist manner by generalized least squares as follows:
which is sometimes referred to as the Aitken estimator [26].
This estimation can equivalently be performed in two steps (as discussed in [18,23]). First, D metaanalyses with inverse variance weighting summarization provide pooled estimates and their variances per study design. Secondly, model (1) deals with the results of these component metaanalyses just as with single study observations. The inverse variance weighting estimation of the first step is as follows:
Thus, evidence of all studies with the same treatment arms is initially summarized, resulting in estimated treatment effects and covariance matrices of socalled direct comparisons, since these comparisons are actually observed. In the second stage of the estimation, a linear model is fitted to the effect vector ’ of all summarized direct comparisons:
with E(∈_{a})=0 and Cov(∈_{a})=:V_{a}. The covariance matrix is given by and X_{a} is the compressed design matrix containing one set of rows for each design. In the case of twoarmed studies, the design matrix X_{a} is formed by stacking one row over each of the other’s rows for each type of design. In the example above we have:
Multiarmed studies
We distinguish each set of multiarmed studies sharing the same set of treatments as a different design. That means that if we add a threearmed study for A_{2} versus A_{1} versus A_{0} to the example above, we consider a further design (d=4).
Since the effects observed in one multiarmed study cannot be inconsistent, we use one designspecific treatment as a study reference for each multiarmed study, e.g. A_{0} in all studies comparing A_{2} versus A_{1} versus A_{0}. Then, a study with M+1 arms adds to the vector Y of model (1) a vector Y_{s} of M treatment effects for each comparison to the reference. In our example we have the vector Y_{s}=(Y_{0:1},Y_{0:2})’ of comparison A_{1} versus A_{0} and comparison A_{2} versus A_{0}. Furthermore, the multiarmed study gives M rows for X with the corresponding contrasts. Since pairwise treatment effects of one study are correlated, the multiarmed study adds a block V_{s} of size M×M for the covariance matrix V of the sampling error ∈ (compare to [13,27]). In the case of multiarmed studies of design d, a summarized treatment effect is a vector of length M with covariance matrix of size M×M. These summarizations can be calculated in accordance with the equations (3,4) [28] and can be used as observations in model (5). The design matrix X_{a} contains then M rows for the corresponding design of the studies. This means in the simple example above with four twoarmed and one threearmed study (D=4) that:
Identifying drivers via the hat matrix
In linear models, the hat matrix contains the linear coefficients that present each predicted outcome as a function of all observations. Its diagonal elements are known as leverages. They summarize the importance of the respective observation for the whole estimation. Observations with both high leverage and large residual are recognized as being highly influential [29].
In the context of network metaanalyses and model (5), the hat matrix is:
Its rows are the linear coefficients of (d=1,…,D) for the network estimate , where is the subvector of corresponding with design d’. The coefficients are a generalization of the study weights in simple metaanalyses but do not sum up to one. They have values between 1 and 1. While in simple metaanalyses the contribution of a study (or weight of a study) to the pooled estimate is proportional to the precision of the study, in network metaanalyses the contribution of direct estimates to a network estimate is a function not only of its precisions, but also of the network structure. Depending on the agreement of direct and indirect evidence, a large absolute entry in H indicates a strong influence of the respective direct estimate. Note that H is not necessarily symmetric, and for multiarmed studies the choice of the reference treatment affects the corresponding coefficients.
In network metaanalyses, the diagonal elements of H have a special role. In a connected network, the trace of H equals T, the number of parameters of model (5). In fact, each network estimate can be written as a weighted mean of a direct estimate which is based on all comparisons involving only the given two treatments and the indirect estimate which is based on all other studies. The diagonal element of H is identical to the weight of the direct estimate in this presentation. Different than in many regression applications, the offdiagonal elements of H deserve special attention in network metaanalyses. The smaller the diagonal element, the more weight is given to indirect evidence. This is also discussed in [18]. The offdiagonals indicate which study designs contribute in an essential way to the indirect part of the network estimate.
As an illustration of the hat matrix, we use an example of a network metaanalysis with six treatments (T=5) and eight different observed designs (D=8) based on twoarmed studies (N_{d}=2 for all d=1,…,8). The corresponding network is shown in Figure 1a), where the nodes correspond to the treatments and the edges show which treatments are directly compared. The thickness of an edge represents the inverse standard error (Vddir)^{−1/2}, which is equal to one for all d in our example (V_{a}=I_{8}, where I_{8} is the identity matrix of size eight). For one design there might, for example, be one study with or 100 studies with . The resulting hat matrix at the design level is given in percent in Figure 1b). In addition, the absolute values of the matrix elements are visualized by the area of the gray squares.
The diagonal squares indicate that the network estimates are predominantly driven by their corresponding direct estimates, all more than 50%. The diagonal squares are the largest for the edges 1:6 and 3:4 that intercede between the two triangles. Their direct estimates drive 70% of their network estimates. The smallest diagonal squares are seen for the edges 1:3 and 4:6 (direct estimates drive 53%), since the latter ones are paralleled by two independent indirect paths and the former ones only by one. Inspecting the offdiagonal squares, we learn that aside from its direct estimates, the network estimates and are driven by the other corresponding direct estimate and then by . Due to symmetry, the same holds for the edges involved in the triangle {4:5, 5:6, 4:6}.
Identifying hot spots of inconsistency
Decomposition of Cochran’s Q
An important aspect in metaanalysis is to investigate statistical heterogeneity. In network metaanalysis inconsistency arises as another aspect of heterogeneity. In a classical metaanalysis comparing two treatments, Cochran’s Q [30] is a wellaccepted tool for assessing heterogeneity between studies, which is seen to be the sum of squared Pearson residuals. We use the generalized Cochran’s Q statistic for multivariate metaanalysis [27,31] in the context of network metaanalyses:
To examine the heterogeneity of the whole network in more detail, particularly considering the inconsistency in the model, we decompose the Q^{net} statistic into two parts (similar to [32] who used a decomposition by study group in the context of classical metaanalysis):
The first is a sum of withindesign Q statistics
The second is a betweendesigns Q statistic
The heterogeneity of the whole network can be assigned to the heterogeneity between studies by Q^{het}, related to each design d with , and otherwise to the inconsistency of the network by Q^{inc}. Under the null hypothesis for both homogeneity and consistency, all Q statistics ((7), (9), (10), (11)) are approximately chisquared distributed with respective degrees of freedom given in Table 1. Thereby, the degrees of freedom of the chisquared distribution corresponding to Q^{inc} are identical to those defined in [16]. All Q statistics are independent of the choice of designspecific reference treatment.
Table 1. The network Q statistics and the degrees of freedom of their corresponding chisquared distribution
For example, for the network design in Figure 1a) we assume inconsistent treatment effects by ’, where each component metaanalysis corresponds to one study. The perturbation effect of five means that the contrast differs by five standard errors of a direct effect estimate. This may be a lot if the precision of component metaanalysis is small. This effect was chosen here in order to achieve a reasonable power for illustration purposes.
In real applications, the power may be small [33] and a failure to detect inconsistency does not automatically imply consistency. Note, however, that a deviating effect cannot be absorbed into a heterogeneity variance component, other than in randomeffects models. Depending on the number of studies that inform a design, a single deviating study may inflate either Q^{inc} or Q^{het}. That is why inconsistency and heterogeneity must be considered jointly.
As network estimates, we obtain in the example ’. With this, an inconsistency statistic Q^{inc}=3.36+1.78+0.25+3.36+0.25+0.03+0.11+0.03=9.17 results that is chisquared distributed with 8−5=3 degrees of freedom. Since there cannot be heterogeneity between studies, in this example Q^{inc} and are identical to Q^{net} and .
If some of the component metaanalyses are heterogeneous, the others can still validly be tested by their Even Q^{inc} has some interpretation in this case: The direct estimates are estimates of the inverse varianceweighted averages of different true but unknown studyspecific treatment effects. Then, Q^{inc} with the same reference distribution provides a valid test of the hypothesis of consistency of these averaged treatment effects.
Detaching a single design
Once inconsistency is indicated by a large Q^{inc}, formula (11) can be used to assess the contribution of each component metaanalysis of design d to the inconsistency. In fact, Q^{inc} is the sum of quadratic forms of residuals over all designs. For simple comparisons between two treatments, the summands are squared Pearson residuals. Unfortunately, a deviating effect of one component metaanalysis can simultaneously inflate several residuals. Therefore, we fit a set of extended models allowing for a deviating effect of each study design in turn and recalculate the Q statistic. This procedure is equivalent to a ‘leave one out’ approach: Once per fit, studies with one design are left out of the network estimate to obtain an independent estimate of the treatment effect in design d and to obtain a network model fit independent of studies with design d (for another leave one out approach, see [14]).
More formally, we modify model (5) by inserting N_{d}−1 new parameters into the parameter vector for all pairwise treatment comparisons in design d to the designspecific reference. The design matrix of the new model needs an extra column for each new parameter, here notated as set of indicator vectors , with entry one for each pairwise comparison in d and entry zero for all other comparisons. So we add N_{d}−1 columns for each design with N_{d} treatments. Each additional column corresponds to one of the nonreferencetreatments. We have the following model:
with E(∈)=0 and Cov(∈)=V_{a} as previously. In this model, the parameters θ^{net} capture all network evidence without the information from studies with design d, and the parameters denote the difference between direct and indirect effect estimate in design d. The latter is called a designbytreatment interaction in White et al. [16], but in contrast to White et al., we only add extra columns for one design at a time. Remaining inconsistency in this model can be tested by the corresponding Q statistic:
that is chisquared distributed with degrees of freedom (see Table 1). Here, the vector
of length contains the residuals that are identical to those of a consistency model fitted after holding out design d. And in the case of design d, the residuals equal zero.
For illustration purposes, we successively introduce one new parameter for each of the eight possible detachments of one component metaanalysis into the inconsistent network example from Section “Decomposition of Cochran’s Q” corresponding to Figure 1. For design 1:2, we use a parameter vector extended by θ1:2dirind in model (12); in combination with the design matrix . With ’ this results in that is chisquared distributed with two degrees of freedom. For design 1:3, we respectively obtain that is also chisquared distributed with two degrees of freedom.
Finally, to locate the inconsistency in the network, we compare the remaining inconsistency after exclusion of design d studies to the inconsistency before exclusion for all designs d^{′}=1,⋯,D by:
Here,
is the summand in Q^{inc} belonging to design d’ (it is ), and is the corresponding part from and model (12). Since for all d^{′}=d; it follows that in this case . In other words, is the reduction of the squared standardized residual for design d’ due to elimination of design d studies.
In the example, holding out design 1:2 results in a perfect fit of model (12) and we obtain for all d’ in {1:2,...5:6} ( ) since . For the detachment of the component metaanalysis with design 1:3, we obtain , , and so on.
The net heat plot
For a graphical inspection of network inconsistency, we use a color visualization of the quadratic matrix , which we call a net heat plot in the following. Warm colors in this plot (yellow over orange to red) indicate a positive . A negative is illustrated by blue colors. Because of the nonnegative scalars on the diagonal of the matrix, which sum up to the Q^{inc} statistic, the corresponding diagonal elements of the plot have nonblue colors. Warm colors on the offdiagonal of the plot indicate that a detachment of the component metaanalysis with design d (shown in the columns) reduces the inconsistency at design d’ (shown in the rows). The inconsistency between direct and indirect evidence at design d’ before the detachment is indicated by the color of the diagonal element d’. An increase in inconsistency is indicated by blue colors. The stronger the intensity of the color is, the greater the difference between the inconsistency before and after the detachment of studies with design d is. The color of the whole plot is implemented to have a maximum intensity for absolute values greater or equal to eight.
Designs where only one treatment is involved in other designs of the network (for example design 6:7 in Figure 2) or where the removal of would lead to a splitting of the network (for example design 3:4 in Figure 2) do not contribute to the inconsistency assessment and are not incorporated into the net heat plot.
Figure 2. Network design of an illustrative network metaanalysis. The nodes correspond to eight treatments and the edges display observed treatment comparisons. Design 6:7 and 3:4 do not contribute to the inconsistency assessment and are not incorporated into a net heat plot.
For the arrangement of the rows and columns of the plotted matrix, we use the sum of the absolute distances between the rows and the absolute distances between the columns of for complete linkage clustering (see for example [34]). This results in colored block structures that potentially indicate hot spots of inconsistency.
In the plot we also draw gray squares, as shown in Figure 1b), with areas proportional to the corresponding absolute elements of the hat matrix from equation (6). The larger the square is, the stronger the direct estimate of design d drives the network estimate of design d’. Consequently, a design d with large squared Pearson residuals strongly influences design d’. The combination of the color for the inconsistency and the differentlysized squares results in the visual appearance of a halo that relays both types of information at the same time (see for example [35] for use of such halo visualizations in a different context).
Further illustrative examples
To illustrate the application of the net heat plot, we consider the network example from the previous sections and Figure 1 as well as four additional network metaanalysis examples with six treatments and six, eight, or all possible fifteen component metaanalyses based on twoarmed studies (d in {1:2, 1:3,…,5:6} with N_{d}=2). These networks are displayed as graphs in Figures 3a) to e) on the left side, where the edges correspond to the different direct comparisons. The thickness of an edge represents the inverse standard error , which is equal to one for all . We have produced an inconsistent network of treatment effects by adding a δ=5 to one treatment effect , while all other effects of the network remain zero.
Figure 3. Five illustrative network metaanalyses with net heat plot. In a) to e), the network design is shown on the left: six treatments and six, eight or fifteen different observed designs based on twoarmed studies. The nodes are placed on the circumcircle and are labeled according to the treatments. The edges show which treatments are directly compared. The thickness of an edge represents the inverse standard error , which is equal one for all designs. We introduced inconsistency by perturbing the effect of one edge (marked in red) by five standard errors of the direct effect estimate. The corresponding net heat plots are shown on the right side: The area of the gray squares displays the contribution of the direct estimate in design d (shown in the column) to the network estimate in design d’ (shown in the row). The colors are associated with the change in inconsistency between direct and indirect evidence in design d’ (shown in the row) after detaching the effect of design d (shown in the column). Blue colors indicate an increase and warm colors indicate a decrease (the stronger the intensity of the color, the stronger the change).
Because the network structures and the assumed precisions of the direct effects are the same in scenarios a) to c), they share the same hat matrix, which is discussed in Section “Identifying drivers via the hat matrix” and illustrated in Figure 1b). That is why the net heat plots in Figures 3a) to c) contain the same gray squares, just ordered differently due to the clustering.
In scenario a), inconsistency is introduced through the treatment effect in design 1:2. The overall inconsistency statistic is Q^{inc}=9.17 (p=0.027, see Table 2). In the net heat plot, the color intensities of the diagonal elements indicate that the squared Pearson residual for design 1:3 and especially the residuals for the designs 1:2 and 2:3 almost solely contribute to Q^{inc}. The latter ones have higher residuals, although their direct estimates drive their network estimates more strongly, with 63% in contrast to 53% in the case of design 1:3. This can be seen in the hat matrix elements that are displayed here by the area of the squares. The warmcolored offdiagonal elements in the column of design 1:2 or 2:3 are equal to the colors on the diagonal, which indicates a complete elimination of inconsistency in the whole network after relaxing design 1:2 or 2:3. This is also recognizable by and in Table 2, each with a p value of one. A detachment of design 1:3 does not reduce all residuals but increases that of the designs 1:6, 3:4, and 4:6, as indicated by the blue colors. Relaxing other designs causes only little change to the squared Pearson residuals. For example relaxing design 1:6 weakly reduces the residuals of design 1:2 and 2:3 but inflates the residuals of design 1:3 and increases the inconsistency in the whole network (p=0.016 for ). Due to the arrangement of the rows and columns in the plot (as explained in Section “The net heat plot”), we can see a hot spot of inconsistency between the effects of the component metaanalyses with designs 1:2, 2:3, and 1:3 by the warmcolored block on the diagonal; however, the effect of 1:3 is supported by other evidence of the network shown by the bluecolored elements in row and column of design 1:3. Altogether, designs 1:2 and 2:3 can be identified as a source of inconsistency in the network. However, to be able to understand whether the effects of the component metaanalyses of both designs are the source or whether only one of them is, we need more network connectivity so that they are included solely in network loops. The squares in the columns of the two identified designs show that the corresponding treatment effects drive the network estimate of design 1:3, which is therefore perturbed. Although attenuated, driving is also observed in designs 1:6, 3:4, and 4:6, as far as the influence of the effect in design 1:2 (and 2:3) is sufficient.
Table 2. The inconsistency in the illustrative examples
In scenario b), we shifted the effect in design 1:6 analogously to scenario a) by δ=5 from the rest of the network. This causes a Q^{inc} of only 7.50 with a p=0.058, which is mainly composed of the squared Pearson residuals of designs 1:3 and 4:6 and especially of the residuals of designs 1:6 and 3:4. Contrasting the colors and the size of squares on the matrix diagonal shows that the latter two hold the strongest inconsistency contribution, although their corresponding direct estimates drive their network estimates the most strongly. In this scenario, a detachment of the effect in designs 1:6 or 3:4 eliminates the inconsistency of the network. In contrast, relaxing one of the designs 1:2, 2:3, 4:5, or 5:6 only slightly reduces the inconsistency of the whole network (each , , , with p=0.033), and a detachment of designs 1:3 or 4:6 even increases the inconsistency (each , with a p=0.033). As well, in all six cases the squared Pearson residual of at least one other design is inflated. So in this scenario, we see a hot spot of inconsistency between designs 1:6, 3:4, 1:3, and 4:6 by the intense warmcolored block on the diagonal (4×4). The strongest inconsistency is between the effect in designs 1:6 and 3:4. Weaker inconsistency can be observed between the effects in the designs 1:2 and 2:3 as well between the effects in 4:5 and 5:6. The effects of designs 1:3 and 4:6 are supported by the evidence of the designs 1:2 and 2:3 as well as 4:5 and 5:6 respectively. So in this scenario, designs 1:6 and 3:4 can be identified as a plausible source of inconsistency, and analogous to scenario a), the inconsistency causing edge 1:6 cannot be distinguished from the jointlyacting edges 3:4, although in this example these two are not adjacent edges. The squared Pearson residuals for the two identified designs, shown on the diagonal of the plot, are smaller in comparison to the residuals of the designs 1:2 and 2:3 in scenario a), although in both scenarios a perturbation is introduced with δ=5. This is because the corresponding network estimates are more strongly driven by their direct estimates with each 70% and not only with 63% as in designs 1:2 and 2:3. The squares in the columns of the two identified designs indicate that they drive the network estimates in designs 1:3 and 4:6 and, a little weaker, of the remaining other designs, which therefore differ from their direct estimates. Contrasting the colors and the size of squares on the offdiagonal elements of all 2×2 blocks on the diagonal implies that the weakest amount of treatment effect deviation is observed between the effects in designs 1:2 and 2:3 as well as between the effects in 4:5 and 5:6, since the largest hat matrix elements are here as well the less intensive color. Altogether, the influence of the perturbed treatment effect in design 1:6 is more broad, but with overall weaker severity as the equally perturbed effect in scenario a).
In scenario c), we changed the effect in design 1:3 with δ=5 and found the highest network inconsistency statistic Q^{inc}=11.67 (p=0.009) in comparison to both previous scenarios. The squared Pearson residual for design 1:3 provides the largest contribution to the Q^{inc} statistic. Smaller residuals are observed for the adjacent edges 1:2, 2:3, 1:6, and 3:4. A detachment of the effect in design 1:3 eliminates the inconsistency of the network. Relaxing other designs causes only a little change to the squared Pearson residuals and increases residuals for some designs. A hot spot of inconsistency can be seen between the effects in designs 1:3, 1:2, and 2:3. However, the effect in design 1:2 is supported by the effects in designs 1:6, 3:4, and 4:6, and vice versa, the latter ones are supported by the effects in design 1:2. The same holds for the effect in design 2:3 and the effects in the three designs. Altogether, edge 1:3 can be distinctly identified as a plausible source of inconsistency since this is nested in two loops. The squared Pearson residual for this design is higher in comparison to the residuals for the inconsistencygenerating designs in the previous two scenarios, although in all scenarios an equally strong perturbation is introduced. This is because 1:3 is the least selfdriving design. Since the effect of design 1:3 strongly drives the network estimates of the designs 1:2, 2:3, 1:6, and 3:4, they are also influenced by the perturbation.
In scenario d), we analyze a sparsely connected network that forms one loop. In such a network with observed inverse standard errors being the same for each direct estimate, all corresponding network estimates are composed 83% of its own and 17% balanced of all other direct estimates. So, in the net heat plot we see only large squares on the diagonal. A perturbation of the effect at design 1:2 results in a network inconsistency statistic of Q^{inc}=4.17 (p=0.041), which is the sum of equallysized squared Pearson residuals. A detachment of any design interrupts the loop and flow of evidence so that the network estimates correspond, if existing, to their direct estimates and the inconsistency of the network is dissolved. In this scenario, we can recognize inconsistency but cannot locate its source since we have insufficient degrees of freedom. Nevertheless, several indirect estimates were affected by the perturbation of design 1:2.
In network scenario e), all fifteen possible pairwise comparisons are observed with same precision. Because of this tight linkage, each network estimate is driven onethird by its corresponding direct estimate. The remaining twothirds of indirect estimation is based on all eight adjacent edges in a balanced way. The disturbance of the network consistency by adding a δ=5 to treatment effect does not produce as much inconsistency in the whole network as seen in the other scenarios (Q^{inc}=16.67 with p=0.082). Almost exclusively, the squared Pearson residual for design 1:2 is increased so that a detachment of design 1:2 eliminates the inconsistency. A detachment of one of the eight adjacent edges causes only a little change and even weakly increases the inconsistency in the whole network, which results each time in a p value of 0.075. In the case of nonadjacent edges, the p values corresponding to are even 0.054. So in this scenario, the source of inconsistency is uniquely identifiable in the net heat plot, even more easily compared to scenario c). It only weakly drives and affects the network estimates of its adjacent edges so that the perturbation of the effect in design 1:2 has only a little influence on the network.
The examples show that perturbation of a single design may have side effects on residuals, more or less spread out in the network. Our clustering proved successful in grouping together designs with interrelated residuals that were simultaneously affected by one perturbation. The resulting hot spots facilitate the identification of sources of inconsistency, which may or may not be uniquely identifiable. While related large residuals are obviously grouped together, it may also occur that large residuals emerging from two independent perturbations are also grouped in proximity. In this case we expect to find two diagonal blocks, each signaling the local side effects of one perturbation and each representing one hot spot of inconsistency.
Software
We implemented our methods in the opensource statistical environment R[36]. While multivariate metaanalysis for the aggregation step of studies with the same design can be carried out using standard statistic software [28,37], we provide a preliminary standalone R function for the net heat plot available on the website http://www.unimedizinmainz.de/fileadmin/kliniken/imbei/Dokumente/Biometrie/Software/netheat.R webcite. An R package is in preparation and will be available from the standard CRAN repository for the R environment.
Results
An example of a network metaanalysis in diabetes
We applied our methods to a network metaanalysis example by Senn at al. [17]. They examined the continuous outcome of blood glucose change according to the marker HbA1c in patients with type two diabetes after adding one treatment out of ten different groups of glucoselowering agents to a baseline sulfonylurea therapy. As effect measures, we consider mean differences.
The ten different treatment groups are abbreviated as follows by their first four letters: acar: Acarbose, benf: Benfluorex, metf: Metformin, migl: Miglitol, plac: Placebo, piog: Pioglitazone, rosi: Rosiglitazone, sita: Sitagliptin, SUal: Sulfonylurea alone, vild: Vildagliptin. This network metaanalysis involved 26 randomized controlled trials including one threearmed trial for plac:acar:metf and 15 different designs, of which ten are used in only one study. In the network, 15 out of 45 possible different pairwise contrasts are observed, of which eight involve a placebo (see Figure 4).
Figure 4. Network design in the diabetes example. The nodes are placed on the circumcircle and are labeled according to the treatments. The edges display the observed treatment comparisons. The thickness of the edges is proportional to the inverse standard error of the treatment effects, aggregated over all studies including the two respective treatments. The network includes 25 twoarmed studies on fourteen different designs and one threearmed study of design plac:acar:metf.
Across the entire network (analogues to the result of Senn at al. [17]) as well as for exclusively within designs, we observed heterogeneity with p values <0.001 (see Table 3). Regarding the statistics, the component metaanalyses with designs plac:benf, plac:metf, plac:migl, and, plac:rosi contribute the most to the heterogeneity within designs.
Table 3. Heterogeneity and inconsistency in the diabetes example
To have a closer look at the inconsistency of the network, we use the net heat plot in Figure 5. Studies with design plac:benf, plac:migl, plac:sita, or plac:vild are not included in this plot because they do not contribute to the inconsistency assessment. There are direct treatment effects that strongly drive other network estimates in a consistent manner. For example, the treatment effects in designs plac:acar and acar:SUal agree with the existing direct evidence of each other, but we observe a Q^{inc} statistic with a p value of 0.002, which is composed of the squared Pearson residuals for the designs metf:SUal, rosi:SUal, plac:piog, metf:piog, and plac:rosi. The first two have higher residuals in comparison to plac:piog, although their direct estimates more strongly drive their network estimates, with 56% and 41% in contrast to 36% in the case of design plac:piog. We can observe a hot spot of inconsistency between the effects in designs metf:SUal, rosi:SUal, plac:piog, and metf:piog, for which only one study is observed in each case. The effects in designs plac:piog and metf:piog as well as, in particular, the designs metf:SUal and rosi:SUal are especially inconsistent. Although the direct estimate in design plac:rosi is hampered with large heterogeneity (p=0.001), it has a large evidence base of six studies and hence strongly drives its network estimate with 83% and other network estimates as well. Note, that the contribution of single studies is easily disclosed by splitting the amount of 83% into a sum according to the inverse variances of the estimates of each study (83%=11%+18%+20%+22%+4%+8%). A detachment of the corresponding design reduces the residuals of design metf:SUal, rosi:SUal, and plac:piog, but inflates the residuals of design metf:rosi and piog:rosi. Overall, a detachment of the effects for each of the five inconsistent component metaanalyses mentioned increases the squared Pearson residuals for some other designs in the network and results in blue entries in the plot.
Figure 5. Net heat plot in the diabetes example. The area of the gray squares displays the contribution of the direct estimate in design d (shown in the column) to the network estimate in design d’ (shown in the row). The colors are associated with the change in inconsistency between direct and indirect evidence in design d’ (shown in the row) after detaching the effect of design d (shown in the column). Blue colors indicate an increase and warm colors indicate a decrease (the stronger the intensity of the color, the stronger the change). The two contrasts of the threearmed study with design plac:acar:metf are marked with ^{∗}.
The strongest reduction in the whole network inconsistency is achieved with a detachment of the effect in design rosi:SUal. In this case, the net heat plot in Figure 6 results. The inconsistency between the effects in designs plac:piog and metf:piog remains, but in an attenuated form. Now, the effect of design metf:SUal is inconsistent with the effect of the designs plac:acar and acar:SUal, which were supported by the effect in design rosi:SUal in the previous version of the network. However, with a p value of 0.342 for the Q^{inc} statistic, there is no longer strong evidence for inconsistency. The hot spot of inconsistency detected included designs with only one study. Indeed, one or a few biased studies may either cause heterogeneity when paralleled by other studies of the same design (which is observed within the plac:rosi studies) or may cause inconsistency when solely representing a design.
Figure 6. Net heat plot in the diabetes example after exclusion of the study with design rosi:SUal. The area of the gray squares displays the contribution of the direct estimate in design d (shown in the column) to the network estimate in design d’ (shown in the row). The colors are associated with the change in inconsistency between direct and indirect evidence in design d’ (shown in the row) after detaching the effect of design d (shown in the column). Blue colors indicate an increase and warm colors indicate a decrease (the stronger the intensity of the color, the stronger the change). The two contrasts of the threearmed study with design plac:acar:metf are marked with ^{∗}.
An example of a network metaanalysis in antidepressants
Cipriani et al. [12] performed a network metaanalysis to examine the efficacy between twelve newgeneration antidepressants as monotherapy for the acutephase treatment of major depression. The twelve antidepressants are abbreviated as follows: bupr: Bupropion, cita: Citalopram, dulo: Duloxetine, esci: Escitalopram, fluo: Fluoxetine, fluv: Fluvoxamine, miln: Milnacipran, mirt: Mirtazapine, paro: Paroxetine, rebo: Reboxetine, sert: Sertraline, venl: Venlafaxine. The efficacy was defined as a reduction of at least 50% from the baseline depression rating score after 8 weeks. For the network metaanalysis, they involved 111 randomized controlled trials including two threearmed trials of design fluo:paro:sert. In these studies, 42 of 66 possible pairwise contrasts between the 12 treatments are observed (see Figure 7) in D=43 different designs, of which 16 are observed in only one study.
Figure 7. Network design in the antidepressants example. The nodes are placed on the circumcircle and are labeled according to the treatments. The edges display the observed treatment comparisons. The thickness of the lines is proportional to the inverse standard error of the treatment effect, aggregated over all studies including these two respective treatments. The network includes 109 twoarmed studies with 42 different designs and two threearmed studies, both with design fluo:paro:sert.
Analogous to [12], we used log odds ratios as effect measures, but for combining study estimates we used the fixedeffects model (5) instead of a randomeffects model within the Bayesian framework. The treatment effects and respective standard errors of our model are very similar to the results of Cipriani et al. [12], and the standard errors are not systematically smaller as could be expected, because we observed only little heterogeneity in the whole network (p=0.113) as well as within designs (p=0.125) and no significant inconsistency (p=0.293). This results from the calculated Q statistics corresponding to Section “Decomposition of Cochran’s Q” (see Table 4). Regarding the heterogeneity within the designs, only the two studies with design paro:sert are conspicuous, with a p value of 0.006.
Table 4. Heterogeneity and inconsistency in the antidepressants example
The net heat plot presented in Figure 8 provides a detailed assessment of the slight inconsistency in this quite tightly connected network. As seen from the color on the diagonal of the plot, the squared Pearson residuals for designs cita:esci, cita:paro, fluo:bupr, and mirt:venl contribute the most to Q^{inc}. There is a small hot spot of inconsistency between the effects in designs cita:esci and cita:paro as well as between the effects in fluo:bupr and bupr:sert. The largest squared Pearson residual is observed for design cita:esci, although the direct estimate in this design drives the corresponding network estimate comparatively strongly with 51% (maximum selfdriving is observed in design dulo:esci with 61%). In contrast to the other four designs mentioned, the direct estimate of cita:esci also strongly drives network estimates for some other designs in the network, which can be seen from the square sizes in the corresponding column. A detachment of the effect in design cita:esci results in the strongest reduction of the inconsistency in the whole network (resulting in with p=0.591). While the direct evidence contributes more than 50% of the network estimate of this contrast, the direct estimate is larger than the network estimate (log odds ratio 0.39 vs. 0.17), and publication bias may be affecting the former one. The squared Pearson residuals for the designs cita:paro, cita:mirt, esci:paro, and esci:sert are particularly reduced. In contrast, the direct treatment effects of designs fluo:venl and fluo:paro have the smallest standard error and drive the network estimates of many other designs (see large squares in the corresponding columns in Figure 8); however, a detachment of one of these designs causes only small changes in the squared Pearson residuals.
Figure 8. Net heat plot in the antidepressants example. The area of the gray squares displays the contribution of the direct estimate in design d (shown in the column) to the network estimate in design d’ (shown in the row). The colors are associated with the change in inconsistency between direct and indirect evidence in design d’ (shown in the row) after detaching the effect of design d (shown in the column). Blue colors indicate an increase and warm colors indicate a decrease (the stronger the intensity of the color, the stronger the change). The two contrasts of the two threearmed trials with design fluo:paro:sert are marked with ^{∗}.
Discussion
To ensure the validity and robustness of the conclusion from a network metaanalysis, it is important to assess the consistency of the network and the contribution of each component metaanalysis to the estimates. Our intention was to develop a sensitivity analysis tool that allows the identification of which component metaanalyses drive which network estimates and to locate the drivers that may have generated a hot spot of inconsistency. The net heat plot serves both purposes simultaneously: the first one by graphically showing elements of the hat matrix and the latter one by colored block structures in the plot. We have shown that the net heat plot allows the identification of a single deviating design that induces inconsistency in artificial examples. In the case of stronger network connectivity, increased location specificity might be possible. In networks that only include one loop, it is not possible to trace inconsistency back to a single design, but designs that are part of several loops may be identifiable as a unique source for a hot spot of inconsistency. We also demonstrated the applicability of the plot in two published network metaanalyses.
It is well known in regression diagnostics (see for example [29]) that the influence of an observation (on parameter estimates and prediction) is driven by both the respective residual and the diagonal element of the hat matrix. Analogous to classical metaanalyses, outlier effect estimates of single studies or a few highlyweighted studies play an important role, which can be inspected in forest plots. Influence measures are usually displayed as index plots with observation numbers on the horizontal axis; this has been successfully exploited for simple metaanalysis [21,22]. We felt that this is insufficient in network metaanalyses and thus proposed the net heat plot as an additional tool. We display all elements of the hat matrix in the net heat plot and pointed out that the lines of the hat matrix are the linear coefficients for a specific network estimate. As such they represent the natural generalization of simple metaanalysis weights. They quantify the contribution of a component metaanalysis to the networkestimate of a given contrast and may therefore be of interest, even in a consistent network metaanalysis. Simultaneously, the changes in the squared Pearson residuals are visualized in the net heat plot after allowing for a deviating effect of one single component metaanalyses to detect outlying direct estimates. In passing, we have shown that Cochran’s chisquared statistic, the sum of squared Pearson residuals, can be generally used in network metaanalyses in a fixedeffects model framework to assess the heterogeneity of the whole network and can be decomposed to separate out the inconsistency of the network. Particularly, we have shown how multiarmed studies can be incorporated both into the inconsistency chisquared statistic via a quadratic form of Pearson residuals and into the net heat plot.
Overall, inconsistency testing has also been discussed in large complex networks by comparing a consistency model with an unrestricted inconsistency model [10], as we have done in turn for each single component metaanalysis. However, the authors essentially only consider inconsistency between twoarmed component metaanalyses because they do not analyze independent effects for multiarmed studies. We included multiarmed studies and thereby opened a way for dividing overall heterogeneity exhaustively into heterogeneity within designs and inconsistency. Within a Bayesian framework, the authors discuss models with and without a random component for heterogeneity within component metaanalyses. We advocate the fixedeffects model, not only for the sake of simplicity but, more importantly in the diagnostic framework, because it potentially provides a clearer picture and allows for better recognition and location of inconsistency. In contrast to testing loops for inconsistency [6], which leads to redundant testing of many dependent hypotheses or is confined to simple networks composed of independent loops (as argued in [10]), our approach is applicable in large and complex networks. The approaches that capture inconsistency by a single extra variance component in a mixed effects model [8,9] only aim at quantifying inconsistency and at providing conservative confidence intervals. The assumptions are difficult to justify or falsify and, more importantly, the approach contains no straightforward way to locate inconsistency.
The recently published designbytreatment interaction model by [15,16] is most similar in spirit to our approach. In contrast to White et al. [16] and Higgins et al. [15], we do not include random effects for heterogeneity within designs. The advantage is that one or a few deviating or biased studies are treated equally, whether they are paralleled by many other studies of the same design or are the sole representative of their design. In a randomeffects model, the former studies would add to the heterogeneity variance whereas the latter studies would inform fixed designbytreatment interaction parameters. In a fixedeffects model, inconsistency is indicated by the Q statistic more sensitively than in the randomeffects model of [16]. If heterogeneity or inconsistency is detected and not explained by single outliers, the model should be extended with study level covariates, along the lines explored by [6,38]. Ideally we should end up with a homogeneous model, thereby explaining rather than modeling heterogeneity and inconsistency.
However, the fact that failure to detect heterogeneity does not constitute proof of homogeneity must be taken into account in the assessment of inconsistencies in network metaanalyses. This already holds for a simple metaanalysis and is even more relevant for network metaanalyses. In a network without loops, inconsistency cannot be detected at all. In this context, we point to the importance of the hat matrix. It allows for the assessment of the contribution of each component metaanalysis to a network estimate and directs attention to the crucial components. We have illustrated that often only a few components are important.
Often, when inconsistency is observed, some component metaanalyses are heterogeneous, too. We point out that the inconsistency assessment is still valid in this context. However, then the direct effect estimates are no longer estimates of a single parameter, but are rather weighted averages of estimates of different parameters: the studyspecific treatment effects. Nevertheless, inconsistency assessment and the investigation of heterogeneity within component metaanalyses may interfere in this case, and it may be necessary to exclude single studies and repeat the net heat plot in order to find satisfactory explanations of overall heterogeneity. In fact, inspection of both coefficients (entries of the hat matrix) and of residuals was proposed by Senn et. al [17] at the study level, and this may be more appropriate if heterogeneity within designs is large. However, when applied at the study level, the net heat plot also has the additional advantage of pointing to influential studies, i.e. studies with large weight and large residuals.
Heterogeneity and inconsistency can be broadly viewed as different aspects of heterogeneity, the latter being understood as any discrepancy between results of single studies and predictions based on a consistency model for a network. This fact is not only reflected in the decomposition of the Q statistic, but also underlines that our tools can be applied either at an aggregate level or at a study level. We presented the aggregate level approach here for its parsimony. The study level approach may be more appropriate, particularly if component metaanalyses are strongly heterogeneous. In fact, a visual display of the hat matrix at study level has been proposed and discussed in [17]. Another potentially viable approach would be to complement our tools at an aggregate level with ordinary forest plots for component metaanalyses.
Some caution is due when interpreting a net heat plot. Different from usual regression diagnostics, a single component metaanalysis may stand for a large body of evidence in network metaanalyses. If a component metaanalysis is recognized as deviating from the rest or is identified as a major source of heterogeneity, it may or may not provide the more reliable part of the whole body of evidence. Song et al. [39] argued that sometimes the indirect part of evidence may be more reliable than the direct part. That is why tracking heterogeneity should only be the starting point for focusing on subject matter details of component metaanalyses and, hopefully, single studies for finding subject matter reasons for the observed heterogeneity, as argued by [40] for classical metaanalyses. In fact, this process of investigation was demonstrated in one example without using a formal tool to sort out inconsistency; this was a simple inspection of squared Pearson residuals [17] and has been elaborated upon in worked examples (e.g. in [38,41]). In large and complex networks, we feel that the two step approach, separately investigating inconsistency and heterogeneity within designs is necessary in order to limit efforts. Furthermore it specifically can answer whether a set of studies sharing the same design is influential.
More than in classical regression diagnostics, there are model diagnostic challenges in network metaanalyses: Masking, a phenomenon already known, may be more pronounced here because we have inherently small numbers of observations: the component metaanalyses. Masking may occur if more than one observation deviates from the true model. In this case, parameter estimates are affected by outliers even after holding out one observation, and outliers may be obscured, i.e. masked [29]. To tackle this, we combined the technique of withholding one observation with a graphical display. While this is clearly adequate if only one outlier exists, it may also facilitate the detection of more outliers. For a more rigorous approach, methods of holding out several observations will have to be explored. The second problem, uniqueness, is particularly virulent in network metaanalyses: several component metaanalyses could be the explanation for all observed inconsistency. We discussed the extreme case of a circular network where inconsistency is completely unidentifiable. The ability to track down inconsistency to only one or at least a few component metaanalyses depends, as we illustrated, on the connectedness of the network. A lack of network connectivity can be useful for planning further studies, but the challenges for future research are twofold: find rules for the identifiability of deviating components and to find tools for economically displaying the ambiguity if it exists.
Searching for influential component metaanalyses or influential studies is not the only way for responding to inconsistency and heterogeneity. As mentioned in [16] and worked out in [38], the consistency model can be extended to allow for (treatment by covariate interaction of) study level covariates, and the model extension can explain inconsistency and heterogeneity. Both approaches are complementary. Of note, the net heat plot could again be applied to an extended consistency model.
One core component of our approach is to allow component metaanalyses to have deviating treatment effects. This idea of extending the model by relaxing parameter constraints is easily extended to generalized linear models for binary outcomes as well as to randomeffects models. The approach is not confined to withholding the effects of one design, but is naturally applicable to allowing for an arbitrary number of designs to have specific deviating effects, e.g. all designs containing a specific treatment. In all types of generalization, the challenge remains to perform these model relaxations in a systematic way and to provide tools to transparently display the multitude of results, for which our presented net heat can be a useful starting point.
Conclusions
We have illustrated the importance of assessing consistency in network metaanalyses, where, for example, one deviating component metaanalysis may induce a hot spot of inconsistency. As a tool for this task, we have developed the net heat plot that displays drivers of the network estimates, plausible sources for inconsistency, and possible disturbed network estimates, illustrating its usefulness in several artificial and real data examples.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
UK, HB and JK developed the method. UK produced the results and wrote the first draft of the manuscript. HB and JK contributed to the writing. All authors read and approved the final manuscript.
Acknowledgements
This work contains part of the PhD thesis of UK. A grant from the Mainzer Forschungsförderungsprogramm (MAIFOR) supported UK.
We thank Katherine Taylor for proofreading and Nadine Binder for pointing us to the halo visualization.
We thank all reviewers for their numerous comments and suggestions that greatly helped to improve the paper.
References

Wells GA, Sultan SA, Chen L, Khan M, Coyle D (Eds): Indirect Evidence: Indirect Treatment Comparisons in MetaAnalysis. Ottawa: Canadian Agency for Drugs and Technologies in Health; 2009.

Hoaglin DC, Hawkins N, Jansen JP, Scott DA, Itzler R, Cappelleri JC, Boersma C, Thompson D, Larholt KM, Diaz M, Barrett A: Conducting indirecttreatmentcomparison and networkmetaanalysis studies: report of the ISPOR task force on indirect treatment comparisons good research practices: part 2.
Value Health 2011, 14(4):429437.
[http://dx.doi.org/10.1016/j.jval.2011.01.011 webcite]
PubMed Abstract  Publisher Full Text 
Dias S, Welton NJ, Sutton AJ, E AA (Eds): A Generalised Linear Modelling Framework for Pairwise and Network MetaAnalysis of Randomised Controlled Trials,. NICE DSU: Technical Support Document 2; 2011.
[http://www.nicedsu.org.uk webcite]

Salanti G: Indirect and mixedtreatment comparison, network, or multipletreatments metaanalysis: many names, many benefits, many concerns for the next generation evidence synthesis tool.
Res Syn Meth 2012, 3(2):8097.
[http://doi.wiley.com/10.1002/jrsm.1037 webcite]
Publisher Full Text 
Baker SG, Kramer BS: The transitive fallacy for randomized trials: if A bests B and B bests C in separate trials, is A better than C?
BMC Med Res Methodol 2002, 2:13. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Salanti G, Marinho V, Higgins JPT: A case study of multipletreatments metaanalysis demonstrates that covariates should be considered.
J Clin Epidemiol 2009, 62(8):857—864.
[http://dx.doi.org/10.1016/j.jclinepi.2008.10.001 webcite]
PubMed Abstract  Publisher Full Text 
Jorgensen AW, Maric KL, Tendal B, Faurschou A, Gotzsche PC: Industrysupported metaanalyses compared with metaanalyses with nonprofit or no support: differences in methodological quality and conclusions.
BMC Med Res Methodol 2008, 8:60.
[http://dx.doi.org/10.1186/14712288860 webcite]
PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text 
Lumley T: Network metaanalysis for indirect treatment comparisons.
Stat Med 2002, 21(16):2313—2324.
[http://dx.doi.org/10.1002/sim.1201 webcite]
PubMed Abstract  Publisher Full Text 
Lu G, Ades AE: Assessing evidence inconsistency in mixed treatment comparisons.
J Am Stat Assoc 2006, 101(474):447459. Publisher Full Text

Dias S, Welton NJ, Sutton AJ, Caldwell DM, Guobing L, Ades AE (Eds): Inconsistency in Networks of Evidence Based on Randomised Controlled Trials,. NICE DSU: Technical Support Document 4; 2011.
[http://www.nicedsu.org.uk webcite]

Bucher HC, Guyatt GH, Griffith LE, Walter SD: The results of direct and indirect treatment comparisons in metaanalysis of randomized controlled trials.
J Clin Epidemiol 1997, 50(6):683691. PubMed Abstract  Publisher Full Text

Cipriani A, Furukawa TA, Salanti G, Geddes JR, Higgins JP, Churchill R, Watanabe N, Nakagawa A, Omori IM, McGuire H, Tansella M, Barbui C: Comparative efficacy and acceptability of 12 newgeneration antidepressants: a multipletreatments metaanalysis.
Lancet 2009, 373(9665):746758.
[http://dx.doi.org/10.1016/S01406736(09)60046 webcite5]
PubMed Abstract  Publisher Full Text 
Salanti G, Higgins JPT, Ades AE, Ioannidis JPA: Evaluation of networks of randomized trials.
Stat Methods Med Res 2008, 17(3):279301.
[http://dx.doi.org/10.1177/0962280207080643 webcite]
PubMed Abstract  Publisher Full Text 
Dias S, Welton NJ, Caldwell DM, Ades AE: Checking consistency in mixed treatment comparison metaanalysis.
Stat Med 2010, 29(7–8):932944.
[http://dx.doi.org/10.1002/sim.3767 webcite]
PubMed Abstract  Publisher Full Text 
Higgins JPT, Jackson D, Barrett JK, Lu G, Ades aE, White IR: Consistency and inconsistency in network metaanalysis: concepts and models for multiarm studies.
Res Syn Meth 2012, 3(2):98110.
[http://doi.wiley.com/10.1002/jrsm.1044 webcite]
Publisher Full Text 
White IR, Barrett JK, Jackson D, Higgins JPT: Consistency and inconsistency in network metaanalysis: model estimation using multivariate metaregression.
Res Syn Meth 2012, 3(2):111125.
[http://doi.wiley.com/10.1002/jrsm.1045 webcite]
Publisher Full Text 
Senn S, Gavini F, Magrez D, Scheen A: Issues in performing a network metaanalysis.
Stat Methods Med Res 2012.
(Epub ahead of print). [http://dx.doi.org/10.1177/0962280211432220 webcite]

Rücker G: Network metaanalysis, electrical networks and graph theory.
Res Syn Meth 2012, 3(4):312324.
[http://doi.wiley.com/10.1002/jrsm.1058 webcite]
Publisher Full Text 
Caldwell DM, Welton NJ, Ades AE: Mixed treatment comparison analysis provides internally coherent treatment effect estimates based on overviews of reviews and can reveal inconsistency.
J Clin Epidemiol 2010, 63(8):875882.
[http://dx.doi.org/10.1016/j.jclinepi.2009.08.025 webcite]
PubMed Abstract  Publisher Full Text 
Chatterjee S, Hadi AS: Influential Observations, High Leverage Points, and Outliers in Linear Regression.
Statist Sci 1986, 1(3):379393. Publisher Full Text

Viechtbauer W, Cheung WL: Outlier and influence diagnostics for metaanalysis.
Res Syn Meth 2010, 1(2):112125. Publisher Full Text

Gumedze FN, Jackson D: A random effects variance shift model for detecting and accommodating outliers in metaanalysis.
BMC Med Res Methodol 2011, 11:19.
[http://dx.doi.org/10.1186/147122881119 webcite]
PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text 
Lu G, Welton NJ, Higgins JPT, White IR, Ades A: Linear inference for mixed treatment comparison metaanalysis: A twostage approach.
Res Syn Meth 2011, 2:4360. Publisher Full Text

Chung H, Lumley T: Graphical exploration of network metaanalysis data: the use of multidimensional scaling.
Clin Trials 2008, 5(4):301307.
[http://dx.doi.org/10.1177/1740774508093614 webcite]
PubMed Abstract  Publisher Full Text 
Galbraith RF: A note on graphical presentation of estimated odds ratios from several clinical trials.
Stat Med 1988, 7:889894. PubMed Abstract  Publisher Full Text

Aitken AC: On least squares and linear combination of observations.

Gleser LJ, Olkin I: Stochastically dependent effect sizes. In The Handbook of Research Synthesis and MetaAnalysis,. Edited by Cooper H, Hedges LV, Valentine JC. New York: Russell Sage Foundation; 2009:357376.

Jackson D, Riley R, White IR: Multivariate metaanalysis: Potential and promise.
Stat Med 2011, 30(20):24812498.
[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470931/ webcite]

Belsley DA, Kuh E, Welsch RE: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (Wiley Series in Probability and Statistics). New Jersey: John Wiley & Sons; 2004.
Ⓒ1980

Cochran W: The combination of estimates from different experiments.
Biometrics 1954, 10:101129. Publisher Full Text

Raudenbush SW, Becker BJ, Kalaian H: Modeling multivariate effect sizes.

Borenstein M, Hedges LV, Higgins JPT, Rothstein HR: Introduction to MetaAnalysis. Chichester: John Wiley & Sons; 2009.

Song F, Clark A, Bachmann MO, Maas J: Simulation evaluation of statistical properties of methods for indirect and mixed treatment comparisons.
BMC Med Res Meth 2012, 12:138.
[http://www.ncbi.nlm.nih.gov/pubmed/22970794 webcite]
BioMed Central Full Text 
Gordon AD: Classification. London: Chapman and Hall/ CRC; 1999.

Oelke D, Janetzko H, Simon S, Neuhaus K, Keim D: Visual boosting in pixelbased visualizations.

R Core Team: R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2012.
[http://www.Rproject.org/ webcite]. [ISBN 3900051070]

Gasparrini A, Armstrong B, Kenward MG: Multivariate metaanalysis for nonlinear and other multiparameter associations.
Stat Med 2012, 31(29):38213839. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cooper NJ, Sutton AJ, Morris D, Ades AE, Welton NJ: Addressing betweenstudy heterogeneity and inconsistency in mixed treatment comparisons: Application to stroke prevention treatments in individuals with nonrheumatic atrial fibrillation.
Stat Med 2009, 28(14):18611881. PubMed Abstract  Publisher Full Text

Song F, Harvey I, Lilford R: Adjusted indirect comparison may be less biased than direct comparison for evaluating new pharmaceutical interventions.
J Clin Epidemiol 2008, 61(5):455463.
[http://www.ncbi.nlm.nih.gov/pubmed/18394538 webcite]
PubMed Abstract  Publisher Full Text 
Thompson SG: Why sources of heterogeneity in metaanalysis should be investigated.
BMJ 1994, 309(6965):13511355. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salanti G, Dias S, Welton NJ, Ades AE, Golfinopoulos V, Kyrgiou M, Mauri D, Ioannidis JPA: Evaluating novel agent effects in multipletreatments metaregression.
Stat Med 2010, 29(23):23692383.
[http://dx.doi.org/10.1002/sim.4001 webcite]
PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: