Abstract
Background
Consider a metaanalysis where a 'headtohead' comparison of diagnostic tests for a disease of interest is intended. Assume there are two or more tests available for the disease, where each test has been studied in one or more papers. Some of the papers may have studied more than one test, hence the results are not independent. Also the collection of tests studied may change from one paper to the other, hence incomplete matched groups.
Methods
We propose a model, the proportional odds ratio (POR) model, which makes no assumptions about the shape of OR_{p}, a baseline function capturing the way OR changes across papers. The POR model does not assume homogeneity of ORs, but merely specifies a relationship between the ORs of the two tests.
One may expand the domain of the POR model to cover dependent studies, multiple outcomes, multiple thresholds, multicategory or continuous tests, and individuallevel data.
Results
In the paper we demonstrate how to formulate the model for a few real examples, and how to use widely available or popular statistical software (like SAS, R or SPlus, and Stata) to fit the models, and estimate the discrimination accuracy of tests. Furthermore, we provide code for converting ORs into other measures of test performance like predictive values, posttest probabilities, and likelihood ratios, under mild conditions. Also we provide code to convert numerical results into graphical ones, like forest plots, heterogeneous ROC curves, and post test probability difference graphs.
Conclusions
The flexibility of POR model, coupled with ease with which it can be estimated in familiar software, suits the daily practice of metaanalysis and improves clinical decisionmaking.
Background
A diagnostic test, in its simple form, tries to detect presence of a particular condition (disease) in a sample. Usually there are several studies where performance of the diagnostic test is measured by some statistic. One may want to combine such studies to get a good picture of performance of the test, a metaanalysis. Also, for a particular disease there may be several diagnostic tests invented, where each of the tests is subject of one or more studies. One may also want to combine all such studies to see how the competing tests are performing with respect to each other, and choose the best for clinical practice.
To pool several studies and estimate a summary statistic some assumptions are made. One such assumption is that differences seen between individual study results are due to chance (sampling variation). Equivalently, this means all study results are reflecting the same "true" effect [1]. However, metaanalysis of studies for some diagnostic tests show that this assumption, in some cases, is not empirically supported. In other words, there is more variation between the studies that could be explained by random chance alone, the socalled "conflicting reports". One solution is to relax the assumption that every study is pointing to the same value. In other words, one accepts explicitly that different studies may correctly give "different" values for performance of the same test.
For example, sensitivity and specificity are a pair of statistics that together measure the performance of a diagnostic test. One may want to compute an average sensitivity and an average specificity for the test across the studies, hence pooling the studies together. Instead, one may choose to extract odds ratio (OR) from each paper (as test performance measure), and then estimate the average OR across the studies. The advantage is that widely different sensitivities (and specificities) can point to the same OR. This means one is relaxing the assumption that all the studies are pointing to the same sensitivity and specificity, and accepts that different studies are reporting "truly different" sensitivity and specificity, and that the betweenstudy variation of them is not due to random noise alone, but because of difference in choice of decision threshold (the cutoff value to dichotomize the results). Therefore the major advantage of OR, and its corresponding receiveroperatingcharacteristic (ROC) curve, is that it provides measures of diagnostic accuracy unconfounded by decision criteria [2]. An additional problem when pooling sensitivities and specificities separately is that it usually underestimates the test performance [[3], p.670].
The above process may be used once more to relax the assumption that every study is pointing to the same OR, thus relaxing the "ORhomogeneity" assumption. In other words, in some cases, the remaining variation between studies, after utilizing OR as the summary performance measure, is still too much to be attributed to random noise. This suggests OR may vary from study to study. Therefore one explicitly assumes different studies are measuring different ORs, and that they are not pointing to the same OR. This difference in test performance across studies may be due to differences in study design, patient population, case difficulty, type of equipment, abilities of raters, and dependence of OR on threshold chosen [4]. Nelson [5] explains generating ROC curves that allow for the possibility of "inconstant discrimination accuracy", a heterogeneous ROC curve (HetROC). This means the ROC curve represents different ORs at different points. This contrasts with the fact that the homogeneousROC is completely characterized by one single OR.
There are a few implementations of the heterogeneous ROC. One may classify them into two groups. The first group is exemplified by Tosteson and Begg [6]. They show how to use ordinal regression with two equations that correspond to location and scale. The latent scale binary logistic regression of Rutter and Gatsonis [4] belong to this group. The second group contains implementations of Kardaun and Kardaun [7], and Moses et al [8]. Moses et al explain a method to plot such heterogeneous ROC curve under some parametric assumptions, and they call it summary ROC (SROC).
When comparing two (or more) diagnostic tests, where each study reports results on more than one test, the performance statistics (in the study results) are correlated. Then standard errors computed by SROC are invalid. Toledano and Gatsonis [9] use the ordinal regression model, and account for the dependency of measurements by generalized estimating equations (GEE). However, to fit the model they suggest using a FORTRAN code.
We propose a regression model that accommodates more general heterogeneous ROC curves than SROC. The model accommodates complex missing patterns, and accounts for correlated results [10]. Furthermore, we show how to implement the model using widely available statistical software packages. The model relaxes ORhomogeneity assumption. In the model, when comparing two (or more) tests, each test has its own trend of ORs across studies, while the trends of two tests are (assumed to be) proportional to each other, the "proportional odds ratio" assumption. We alleviate dilemma of choosing weighting schemes such that do not bias the estimates [[11], p.123], by fitting the POR model to 2by2 tables. The model assumes a binomial distribution that is more realistic than a Gaussian used by some implementations of HetROC. Also, it is fairly easy to fit the model to (original) patient level data (if available).
Besides accounting better for betweenstudy variation, we show how to use the POR model to "explain why" such variation exists. This potentially gives valuable insights and may have direct clinical applications. It may help define as to when, where, how, and on what patient population to use which test, to optimize performance.
We show how to use "deviation" contrast, in parameterization of categorical variables, to relax the restriction that a summary measure may be reported only if the respective interaction terms in the model are insignificant. This is similar to using grand mean in a "factor effects" ANOVA model (compared to "cell means" ANOVA model).
We show how to use nonparametric smoothers, instead of parametric functions of true positive rate (TPR) and/or false positive rate (FPR), to generate heterogeneous ROC for a single diagnostic test across several studies.
Our proposed POR model assumes the shape of the heterogeneous ROC curve is the same from one test to the other, but they differ in their locations in the ROC space. This assumption facilitates the comparison of the tests. However, one may want to relax the POR assumption, where each test is allowed to have a heterogeneous ROC curve with a different shape. One may implement such generalized comparison of the competing diagnostic tests by a mixed effects model. This may improve generalizability of metaanalysis results to all (unobserved) studies. Also, a mixed effects model may take care of remaining betweenstudy variation better.
Methods
Average difference in performances
To compare two diagnostic tests i and j, we want to estimate the difference in their performance. However, in reality such difference may vary from one paper (study) to the other. Therefore Δ_{i,j,p }= PERF_{i,p } PERF_{j,p}, where the difference Δ depends on paper index p, where PERF_{i,p }is observed performance of test i in paper p. To simplify notation, assume that a single number measures performance of each test in each paper. We relax this assumption later, allowing for the distinction between the two types of mistakes (FNR and FPR, or equivalently TPR and FPR). We decompose the differences
(1) Δ_{i,j,p }= PERF_{i,p } PERF_{j,p}= δ_{i,j }+ δ_{i,j,p},
where δ_{i,j }is the 'average' difference between the two tests, and δ_{i,j,p }is deviation of the observed difference within paper p from the average δ_{i,j}. The δ_{i,j }is an estimator for the difference between performance of the two tests. Note by using deviation parameterization (similar to an ANOVA model) [[12], pp.51 & 45] we explicitly accept and account for the fact that the observed difference varies from one paper to the other, while estimating the 'average' difference. This is similar to a randomeffects approach where a random distribution is assumed for the Δ_{i,j,p }and then the mean parameter for the distribution is estimated. In other words, one does not need to assume 'homogeneous' difference of the two tests across all the papers, and then estimate the 'common' difference [13].
The observed test performance, PERF, may be measured in several different scales, such as paired measures sensitivity and specificity, positive and negative predictive values, likelihood ratios, post test odds, and post test probabilities for normal and abnormal test results; as well as single measures such as accuracy, risk or rate ratio or difference, Youden's index, area under ROC curve, and odds ratio (OR). When using OR as the performance measure, the marginal logistic regression model
(2) logit(Result_{pt}) = β_{0 }+ β_{1}*Disease_{pt }+ β_{2}*PaperID_{pt }+ β_{3}*Disease_{pt}*PaperID_{pt }+ β_{4}*TestID_{pt }+ β7*Disease_{pt}*TestID_{pt }+ β_{6}*TestID_{pt}*PaperID_{pt }+ β_{7}*Disease_{pt}*TestID_{pt}*PaperID_{pt}
implements the decomposition of the performance. Model (2) is fitted to the (repeated measures) grouped binary data, where the 2by2 tables of goldstandard versus test results are extracted from each published paper. In the model (2) Result is an integervalued variable for positive test result (depending on software choice, for grouped binary data, usually Result is replaced by number of positive test results over the total sample size, for each group); Disease is an indicator for actual presence of disease, ascertained by the gold standard; PaperID is a categorical variable for papers included in the metaanalysis; and TestID is a categorical variable for tests included. Regression coefficients β_{2 }to β_{7 }can be vector valued, meaning having several components, so the corresponding categorical variables should be represented by suitable number of indicator variables in the model. Indexes p and t signify paper p and test t. They define the repeated measures structure of the data [10]. Note model (2) fits the general case where there are two or more tests available for the disease, where each test has been studied in one or more papers. Some of the papers may have studied more than one test; hence the results are not independent. Also the collection of tests studied may change from one paper to the other, hence incomplete matched groups.
From model (2) one can show that
LOR_{pt }= β_{1 }+ β_{3}*PaperID_{pt }+ β_{5}* TestID_{pt }+ β_{7}*TestID_{pt}*PaperID_{pt}
and therefore the difference between performance of two tests i and j, measured by LOR, is
LOR_{pi } LOR_{pj }= β_{5}* TestID_{pi } β_{5}* TestID_{pj }+ β_{7}*TestID_{pi}*PaperID_{pi } β_{7}*TestID_{pj}*PaperID_{pj}
where we identify δ_{i,j }of the decomposition model (1) with the β_{5}* TestID_{pi } β_{5}*TestID_{pj}, and identify δ_{i,j,p }with β_{7}*TestID_{pi}*PaperID_{pi } β_{7}*TestID_{pj}*PaperID_{pj}.
If there is an obvious and generally accepted diagnostic test that can serve as a reference category (RefCat) to which other tests can be compared, then a "simple" parameterization for tests is sufficient, However, usually it is not the case. When there is no perceived referent test to which the other tests are to be compared, a "deviation from means" coding is preferred for the tests. Using the deviation parameterization for both TestID and PaperID in the model (2), one can show that β_{5}*TestID_{pt }is the average deviation of the LOR of test t from the overall LOR (the β_{1}), where the overall LOR is the average over all tests and all papers. Therefore β_{5}*TestID_{pt }of model (2) will be equivalent to the δ_{i,j }of the decomposition model (1), and β_{7}*TestID_{pt}*PaperID_{pt }equivalent to δ_{i,j,p}.
Proportional odds ratio model
Model (2) expands each study to its original sample size, and uses patients as primary analysis units. Compared to a randomeffects model where papers are the primary analysis units, it has more degrees of freedom. However, in a real case, not every test is studied in every paper. Rather majority of tests are not studied in each paper. Therefore the data structure of testsbypapers is incomplete with many unmeasured cells. The threeway interaction model (2) may become overparameterized. One may want to drop the term β_{6}*Disease_{pt}*TestID_{pt}*PaperID_{pt}. Then for the reduced model
(3) logit(Result_{pt}) = β_{0 }+ β_{1}*Disease_{pt }+ β_{2}*PaperID_{pt }+ β_{3}*Disease_{pt}*PaperID_{pt }+ β_{4}*TestID_{pt }+ β_{5}*Disease_{pt}*TestID_{pt}
we have LOR_{pt }= β_{1 }+ β_{3}*PaperID_{pt }+ β_{5}* TestID_{pt}, where the paper and test effects are completely separate. We call this reduced model the Proportional Odds Ratio (POR) model, where the ratio of odds ratios of two tests is assumed to be constant across papers, while odds ratio of each test is allowed to vary across the papers. Note the difference with the proportional odds model where ratio of odds is assumed to be constant [14]. In the POR model
(4) OR_{pt }= OR_{p }* , t = 1, 2, ..., k, p = 1, 2, ..., m
where t is an index for the k diagnostic tests, and p is an index representing the m papers included in the analysis. OR_{p }is a function capturing the way OR changes across papers. Then to compare two diagnostic tests i and j
OR_{pi }/ OR_{pj }=
where the ratio of the two ORs depends only on the difference between the effect estimates of the two tests, and is independent of the underlying OR_{p }across the papers. Thus the model makes no assumptions about the shape of OR_{p }(and in particular homogeneity of ORs) but merely specifies a relationship between the ORs of the two tests.
One may want to replace the PaperID variable with a smooth function of FPR or TPR, such as natural restricted cubic splines. There are two potential advantages. This may preserve some degrees of freedom, where one can spend by adding covariates to the model to measure their potential effects on the performance of the diagnostic tests. Thus one would be able to explain why performance of the same test varies across papers. Also, this allows plotting a ROC curve where the OR is not constant across the curve, a flexible ROC (HetROC) curve.
(5) logit(Result_{pt}) = β_{0 }+ β_{1}*Disease_{pt }+ β_{2}*S(FPR_{pt}) + β_{3}*Disease_{pt}*S(FPR_{pt}) + β_{4}*TestID_{pt }+ β_{5}*Disease_{pt}*TestID_{pt }+ β_{6}*X_{pt }+ β_{5}*Disease_{pt}*X_{pt}
To test the POR assumption one may use model (2) where the threeway interaction of Disease and TestID with PaperID is included. However, in majority of real datasets this would mean an overparameterized model. Graphics can be used for a qualitative checking of the POR assumption. For instance, the yaxis can be LOR, while the xaxis is paper number. To produce such plot, it may be better to have the papers ordered in some sense. One choice is to compute an unweighted average of (observed) ORs of all the tests the paper studied, and use it as the OR of that paper. Then sort the papers based on such ORs. The OR of a test may vary from one paper to the other (with no restriction), but the POR assumption is that the ratio of ORs of two tests remains the same from one paper to another. If one shows ORs of a test across papers by a smooth curve, then one expects that the two curves of the two tests are proportional to each other. In the logOR scale, this means the vertical distance of the two curves remains the same across the xaxis. To compute the observed LOR for a test in a paper one may need to add some value (like 1/2) to the cell counts, if some cell counts are zero. However, this could introduce some bias to the estimates.
Among the approaches for modeling repeatedmeasures data, we use generalized estimating equations to estimate the marginal logistic regression [15]. Software is widely available for estimation of parameters of a marginal POR model. These include SAS (genmod procedure), R (function geese), and STATA (command xtgee), with R being freely available open source software [16].
One may use a nonlinear mixed effects modeling approach on the cellcount data for estimation of parameters of the POR model. The Paper effect is declared as random, and interaction of the random effect with Disease is included in the model, as indicated in model (2). However, such mixed effects nonlinear models are hard to converge, especially for datasets where there are many papers studying only one or a small number of the included tests (such as the dataset presented as example in this paper). If the convergence is good, it may be possible to fit a mixed model with the interaction of Disease, Test, and the Paper random effect. Such model relaxes the POR assumption, besides relaxing the assumption of ORhomogeneity. In other words, one can use the model to quantitatively test the POR assumption. One should understand that the interpretation of LOR estimate from a marginal model is of a populationaverage, while that of a mixed model is a conditionalaverage. Therefore there is a slight difference in their meaning.
Expanding the proportional odds ratio model
One may use the frameworks of the generalized linear models (GLM) and the generalized estimating equations (GEE) to extend the POR model and apply it to different scenarios. By using suitable GLM link function and random component [[17], p.72], one may fit the POR model to multicategory diagnostic tests, like baselinecategory logits, cumulative logits, adjacentcategories and continuationratio logits [[17], chapter 8]. A loglinear 'Proportional Performance' (PP) regression may be fitted to the cell counts, treating them as Poisson. Also, one may fit the PP model to the LORs directly, assuming a Gaussian random component with an identity link function. Comparing GEE estimates by fitting the model to 2by2 tables versus GEE estimates of the model fitted directly on LOR versus a Mixed model fitted on LOR, usually statistical power decreases across the three. Also, there is issue of incorporation of sample sizes that differ across studies. Note some nuisance parameters, like coefficients of all main effects and the intercept, won't need to be estimated as they are no longer present in the model fitted directly on LORs.
One may avoid dichotomizing results of the diagnostic test by using the 'likelihood ratio' as the performance measure, and fitting a PP model to such continuous outcome. For a scenario where performance of a single test has been measured multiple times within the same study, for example with different diagnostic calibrations (multiple thresholds), the POR estimated by the GEE incorporates data dependencies. When there is a multilayer and/or nested clustering of repeated measures, software to fit a mixedeffects POR model may be more available than an equivalent GEE POR.
When POR is implemented by a logistic regression on 2by2 tables, it uses a grouped binary data structure. It takes a minimal effort to fit the same logistic model to the "ungrouped" binary data, the socalled "individual level" data.
Methods of metaanalysis that allow for different outcomes (and different numbers of outcomes) to be measured per study, such as that of Gleser and Olkin [18], or DuMouchel [19], may be used to implement the POR model. This would prevent conducting parallel metaanalyses that is usually less efficient.
Results
Deep vein thrombosis
To demonstrate how to fit the POR model, we use a recent metaanalysis of diagnostic tests for deep vein thrombosis (DVT) by Heim et al. [20]. In this metaanalysis there are 23 papers and 21 tests, comprising 483 potential performance measurements, while only 66 are actually observed, thus 86% of cells are not measured. We fitted the reduced marginal logistic regression model (3). Table 1 shows the parameter estimates for Test effects. SAS code to estimate the parameters is provided [see 1].Data files are provided in Additional file 2.
Table 1. Parameter estimates for test effects
Additional File 1. In this file we present sample codes for a few of the models presented in the paper. The estimation mostly has been done in SAS, while the graphing (and some modelfitting) has been done in R.
Format: DOC Size: 192KB Download file
This file can be viewed with: Microsoft Word Viewer
Since we have used deviation contrast for the variables, estimate of β_{1 }is the "overall mean" for the logOR. This is similar to an ANOVA analysis where the overall mean is estimated by the model. Therefore the average OR is equal to exp(2.489) = 12.049. Components of β_{5 }estimate deviation of LOR of each test from the overall LOR. Software gives estimates of SEs, plus confidence intervals and pvalues, so inference is straightforward.
A forest plot may be used to present the results of the modeling in a graphical way. This may connect better with clinically oriented audience. In Figure 1 we have sorted the 21 tests based on their LOR estimate.
Figure 1. Comparing performance of each diagnostic test to the overall LOR
The horizontal axis is logOR, representing test performance. The dashed vertical line shows overall mean LOR. For each diagnostic test the solid square shows the LOR, while the horizontal line shows the corresponding 95% CI. If the horizontal line does not intersect the vertical line, the test is significantly different from the overall mean LOR.
Note that the CIs in the plot are computed by adding the overall LOR to the CI for the deviation effect of each particular test. This ignores the variability of the overall LOR estimate. One can estimate the LOR of a test and its CI more accurately by some extra computations, or by fitting a slightly modified model. A method is illustrated and implemented [see 1]. However, the gain in accuracy was small in this particular example. The model also estimates paper effects. However, one may not be interested in those primarily.
One can translate LOR to other measures of test performance. There are numerous types of these measures. We provide code to convert the LOR estimated by the POR model to such measures. Note that majority of them, unlike LOR, are in pairs. This means in order to compare two tests, one needs to use two numbers to represent each single test. For example, sensitivityspecificity is a pair. If a test has a higher sensitivity than the other test, while having a lower specificity, it is not immediately clear which test is better. Also, note that some performance measures are independent of disease prevalence, while others depend on prevalence. This means the same test would perform differently for populations with different disease prevalence.
Note in order to compute some of the performance measures, one needs to assume a prevalence and sensitivity or specificity. We assumed a disease prevalence of 40%, and a specificity of 90%, for Table 2, as the tests are mainly used for ruling out the DVT.
Table 2. Other performance measures for the 21 diagnostic tests of DVT
We suggest graphs to compare tests when using such "prevalencedependent paired performance measures" [21]. In Figure 2 we have used a pair of measures, 'probability of disease given a normal test result' and 'probability of disease given an abnormal test result', the dashed red curve and the dotanddash blue curve respectively.
Figure 2. Posttest probability difference for diagnostic test VIDAS
The way one may read the graph is that, given a particular population with a known prevalence of disease like 40%, we perform the diagnostic test on a person picked randomly from the population. If the test turns normal, the probability the person has disease decreases from the average 40% to about 4% (draw a vertical line from point 0.4 on xaxis to the dashed red curve, then draw a horizontal line from the curve to the yaxis). If the test turns abnormal, the probability the person is diseased increases from 40% to about 57%. The dotted green diagonal line represents a test no better than flipping a coin, an uninformative test. The farther the two curves from the diagonal line, the more informative the test is. In other words, the test performs better.
One can summarize the two curves of a test in a single curve, by computing the vertical distance between the two. The solid black curve in the figure is such "difference" curve. It seems this particular test is performing the best in populations with disease prevalence of around 75%.
One can use the difference curve to compare several tests, and study effect of prevalence on the way the tests compare to each other. In Figure 3 two tests VIDAS and DDimer from the DVT example are compared. From the model estimates we know that both tests perform better than average. And that VIDAS performs better than DDimer.
Figure 3. Comparing posttest probability difference for VIDAS – DDimer
The black solid curve is comparing the two tests. For populations with low disease prevalence (around 17%), the DDimer is performing better than VIDAS. However, when the prevalence is higher (around 90%), VIDAS is preferred. Simultaneous confidence bands around the comparison curve would make formal inference possible.
Random effects
A nonlinear mixed effects POR model fitted to cell counts of the DVT dataset does not converge satisfactorily. We fitted the mixed model to a subset of the data where only two tests and seven papers are included, Table 3. For codes refer to the 1.
Table 3. Data structure for two diagnostic tests
Five of the seven papers have studied both the tests. Result of SAS Proc NLMixed still is sensitive to initial values of parameters. The threeway interaction term of disease, test, and paper in the mixed model (where POR is not assumed) is insignificant, Table 4. A POR assumption for the two tests may be acceptable.
Table 4. Comparing parameter estimates from three models
The estimate of overall LOR from both the PORmixed model and PORmarginal model are significantly different from zero. However, the mixed model estimate of LOR is much smaller than the marginal one. For nonlinear models, the marginal model describes the population parameter, while the mixed model describes an individual's [[15], p.135]. The estimate of deviation of test (NycoCard) from the overall LOR is closer in the two models. Plus the marginal estimate is closer to 0 than the mixed estimate. One expects coefficient estimates of mixed model being closer to zero, compared to the fixed model, while the mixed model CI's being wider.
Metaanalysis of a single test: the baseline OR_{p }function
Sometimes one may be interested in constructing the ROC curve for the diagnostic test. A homogeneous ROC curve assumes the performance of the test (as measured by LOR) is the same across the whole range of specificity. However, this assumption may be relaxed in a HetROC. We fitted a simplified version of model (5) for test SimpliRED,
logit(Result_{pt}) = β_{0 }+ β_{1}*Disease_{pt }+ β_{2}*S(FPR_{pt}) + β_{3}*Disease_{pt}*S(FPR_{pt})
where index t is fixed, and then used estimates of the coefficients to plot the corresponding HetROC, Figure 4.
Figure 4. Heterogeneous ROC curve for diagnostic test SimpliRED
The eleven papers that studied test SimpliRED are shown by circles where the area is proportional to the sample size of the study. The black dashed curve is ROC curve assuming homogeneousOR. The red solid curve relaxes the assumption, hence a heterogeneous ROC curve. The amount of smoothing of the curve can be controlled by the "degreeoffreedom" DF parameter. Here we have used a DF of 2. Codes to make such plots are presented in the 1.
Model checking
Checking the POR assumption, model (2) may be used to reject significance of the threeway interaction term. However, the dataset gathered for the DVT metaanalysis is such that no single paper covers all the tests. Moreover, out of 21, there are 7 tests that have been studied in only one paper. For Figure 5 we chose tests that have been studied in at least 5 of the 23 papers. There are 5 such tests. Note that even for such "popular" tests, out of 10 pairwise comparisons, 3 are based on only one paper (so no way to test POR). Four comparisons are based on 4 papers, one based on 3 papers, and the remaining two comparisons are based on 2 papers.
Figure 5. Observed logoddsratios of each diagnostic test
We sorted the papers, the xaxis, based on average LOR within that paper. We fitted Lowess smooth lines to the observed LORs of each test separately. Figure 5 shows the smooth curves are relatively parallel. Note the range of LORs of a single test. The LORs vary considerably from one paper to the other. Indeed the homogeneityofORs assumption is violated in four of the five tests.
Also, to verify how good the model fits the data, one may use an observedversusfitted plot. Plots or lists of standardized residuals may be helpful finding papers or tests that are not fitted well. This may provide a starting point for further investigation.
Discussion
A comparison of the relative accuracy of several diagnostic tests should ideally be based on applying all the tests to each of the patients or randomly assigning tests to patients in each primary study. Obtaining diagnostic accuracy information for different tests from different primary studies is a weak design [3]. Comparison of the accuracy of two or more tests within each primary study is more valid than comparison of the accuracy of two or more tests between primary studies [22]. Although a headtohead comparison of diagnostic tests provides more valid results, there are realworld practical questions that metaanalysis provides an answer that is more timely and efficient than a single big study [23]. Metaanalysis can potentially provide better understanding by examining the variability in estimates, hence the validity versus generalizability (applicability). Also, there may be tests that have never been studied simultaneously in a single study, hence metaanalysis can "reconstruct" such a study of diagnostic tests.
Relaxing the assumption of OR homogeneity
In metaanalysis of two (or more) diagnostic tests, where attention is mainly on the difference between performances of two tests, having a homogeneous estimate of performance of each single test is of secondary importance, and it may be treated as nuisance. The POR model assumes differences between LORs of two tests are the same across all papers, but does not assume the OR of a test is the same in every paper. Hence no need for homogeneity of OR of a test across papers that reported it, but shifting the assumption one level higher to POR.
Common versus average effect size
The POR model uses "deviation from means" parameterization. Then one does not need to drop the interactions coefficient β_{3 }in the model logit(Result) = β_{0 }+ β_{1}*Disease + β_{2}*PaperID + β_{3}*Disease*PaperID, to interpret β_{1}, the overall LOR. This means the POR model explicitly accepts that performance of the diagnostic test varies across the papers, but at the same time estimates its mean value. McClish explains if a test for OR homogeneity shows heterogeneity, there may be no 'common' measure to report, but still there is an 'average' measure one can report. [13]
Advantages of using 2by2 tables
We demonstrated how to fit the POR model to the cell counts, rather than to the OR values. This, we believe, has several advantages. 1. One does not need assuming normality of some summary measure. This results in binomial distributional assumption that is more realistic. 2. Also, different study sample sizes are incorporated into the POR model without faulty biasintroducing weighting schemes, as shown by Mosteller & Chalmers [25]. And extension of the POR model to individual level patient data is much easier. 3. The effective sample size for a metaanalysis by a random model is the number of papers included, which is usually quite small. There is a great danger for overfitting. And the number of explanatory variables one could include in the model is very restricted. Since we use the grouped binary data structure, the patients are the effective sample size, hence much bigger degrees of freedom.
The way the randomeffects model is usually implemented is by extracting OR from each paper, and assuming LOR being normally distributed. Then the distinction between the two types of mistakes (FNR and FPR, or equivalently TPR and FPR) is lost, since one enters the LOR as datapoints into the model. The bivariate model by Houwelingen et al [26] tries to fix this, by entering two datapoints into the model for each test from each paper. A fourth advantage of fitting the POR model to the cell counts is that the two types of mistakes are included in the model. Consider the logistic regression logit(Result) = β_{0 }+ β_{1}*Disease + β_{2}*PaperID . Then we have log(true positive/false negative) = β_{0 }+ β_{1 }+ β_{2}*PaperID. Substituting a value for the covariate (here PaperID) such as a modal or average value, and using the model estimates for the betas, one gets the logodds. Then one exponentiates it to get the TP/FN, call it Q. Now it is easy to verify that sensitivity = Q/(1+Q). Likewise we have log(false positive/true negative) = β_{0 }+ β_{2}*PaperID, that we call = log(W). Then specificity = 1/(1+W). Also, one can apply separate weights to the log(true positive/false negative) and log(false positive/true negative), to balance the true positive and false positive rates for decision making in a particular clinical practice.
When collecting papers from biomedical literature for metaanalysis of a few diagnostic tests, it is hard to come up with a complete square dataset, where every paper has included all the tests of interest. Usually the dataset contains missing values, and a casewise deletion of papers with missing tests means a lot of data is thrown away. A method of analysis that can utilize incomplete matched groups may be helpful. The POR model allows complex missing patterns in data structure. Convergence of marginal POR model seems much better than nonlinear mixed model, when fitted to cell counts of incomplete matched groups. This is an advantage for using GEE to estimate POR.
The fact that one can use popular free or commercial software to fit the proposed models, facilitates incorporation of the POR modeling in the practice of metaanalysis.
Unwanted heterogeneity versus valuable variability
The POR model utilizes the variation in the observed performance of a test across papers. Explaining when and how the performance of the test changes, and finding the influential factors, is an important step in advancing science. In other words, rather than calling it 'heterogeneity', treated as 'unwanted' and unfortunate, one calls it 'variability' and utilizes the observed variability to estimate and explain when and how to use the agent or the test in order to optimize their effects.
Victor [32] emphasizes that results of a metaanalysis can only be interpreted if existing heterogeneities can be adequately explained by methodological heterogeneities. The POR model estimates effect of potential predictors on betweenstudy variation, hence trying to 'explain' why such variation exists.
The POR model incorporates risk of events in the control group via a predictor, such as observed prevalence, hence a 'control rate regression'. [26]
ROC curve
Although implementing the HetROC means that one accepts the diagnostic test performs differently in different FPRs along the ROC curve, in some implementations of HetROC, such as method of summary ROC, one compares tests by a single point of their respective ROCs. This is not optimal. (The Q test of the SROC method is a single point test, where that point on the ROC may not be the point for a specific costbenefit case.) In such method although one produces a complete SROC, but one does not use it in comparing the diagnostic tests. In the POR model, one uses LOR as the measure for diagnostic discrimination accuracy, and builds statistical test based on the LORratio, hence the test corresponds to whole ROCs (of general form).
The ROC graph was designed in the context of the theory of signal detectability [27,28]. ROC can be generated in two ways, by assuming probability distribution functions (PDFs) for the two populations of 'diseased' and 'healthy', or by algebraic formulas [29]. Nelson claims the (algebraic) ROC framework is more general than the signal detection theory (and its PDFbased ROC) [5]. The locationscale regression models implement ROC via PDFs, while the method of summaryROC uses algebraic approach. The POR model uses a hybrid approach. While POR may be implemented by logistic regression, the smoothing covariate resembles the algebraic method. Unlike locationscale regression models that use two equations, POR uses one equation, hence it is easier to fit by usual statistical packages. One may use a fiveparameter logistic to implement the HetROC. However, the model cannot be linearized, then according to McCullagh [14] it won't have good statistical properties. The POR model not only relaxes assumption of Var1/Var2 = 1, where Var1 and Var2 are variances of the two underlying distributions for the two populations, but even monotonicity of ROC. Hence the model can be used to represent both asymmetric ROCs and nonregular ROCs (singular detection).
In building HetROC curve, the POR model accommodates more general heterogeneous ROCs than SROC, because it uses nonparametric smoother instead of arbitrary parametric functions used in SROC method. When in the POR model the smoother covariate is replaced by log{TPR*FPR/ [(1TPR)*(1FPR)]}, a HetROC similar to SROC of Moses et al is produced.
When one uses a smooth function of FPR in the POR model, it is equivalent to using a function of outcome as predictor. This resembles a 'transition model'. Ogilvie and Creelman [30] claim that for estimating parameters of a best fitting curve going through observed points in the ROC space, least squares is not good since both axes are dependent variables and subject to error. They claim maximum likelihood is a preferred method of estimation. Crouchley and Davies [31] warn that, although GEE is fairly robust, it becomes inconsistent if any of the covariates are endogenous, like a previous or related outcome or baseline outcome. They claim a mixed model is better for studying microlevel dynamics. We have observed that the smooth HetROC curve may become decreasing at right end, due to some outlier points. Using less smoothing in the splines may be a solution.
When there is only one diagnostic test, and one is mainly interested in pooling several studies of the same test, the POR model estimates effect sizes that are more generalizable. By using the smoother (instead of PaperID), one fits a subsaturated model that allows inclusion of other covariates, hence it is possible to estimate effect of study level factors on performance and explain the heterogeneity. Also it does not assume any a priori shape of the ROC, including monotonicity. Plus, it enables graphing of the HetROC. It does not need omission of interaction terms to estimate the overall performance, and it does not need assumption of OR homogeneity. If several performance measurements of the same test is done in a single study, like evaluating the same test with different diagnostic calibrations, the POR model provides more accurate estimates, by incorporating the dependence structure of the data.
Random effects
When there is heterogeneity between a few studies for the same diagnostic test, one solution to absorb the extra betweenstudy variation is to use a random/mixed effects model. However, Greenland [33] cautions when working with random effect models: 1. if adding random effect changes the inference substantially, it may indicate large heterogeneity, needing to be explained; 2. specific distributional forms for random effects have no empiric, epidemiologic, or biologic justification. So check its assumptions; 3. the summary statistic from randomeffect model has no populationspecific interpretation. It represents the mean of a distribution that generates effects. Random models estimate unit specific coefficients while marginal models estimate population averages. The choice between unitspecific versus populationaverage estimates will depend on the specific research questions that are of interest. If one were primarily interested in how a change in a covariate affect a particular individual cluster's mean, one would use the unitspecific model. If one were interested in how change in covariate can be expected to affect the overall population mean, one would use the populationaverage model. The difference between "unitspecific" models and "populationaverage" models arises only in the case of a nonlinear link function. In essence randomeffect model exchanges questionable homogeneity assumption for a fictitious random distribution of effects. Advantage of a random model is that SE and CI reflect unaccountedfor sources of variation, and its drawback is that simplicity of interpretation is lost. When residual heterogeneity is small, fixed and random should give same conclusions. Inference about the fixed effects (in a mixed model) would apply to an entire population of cases defined by random effect, while the same coefficient from a fixed model apply only to particular units in the data set. Crouchley and Davies [31] explain one of the drawbacks of their random model is that it rapidly becomes overparameterized, and also may encounter multiple optima.
Followups
We suggest these followups: 1. the POR model has been implemented both by marginal and mixed models. It would be useful to implement a marginalized mixed POR model; 2. in clinical practice, usually a group of diagnostic tests is performed on an individual, for a particular disease. Some of these tests are requested simultaneously and some in sequence. It would be useful, and practically important, to extend the POR model such that it incorporates such sequence of testing and a priori results; 3. the utility of POR model may be extended to metaanalysis of therapeutics.
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
MSS conceived of the model, and participated in its design and implementation. JS participated in implementation of the model and performing of the example analysis. Both authors read and approved the final manuscript.
Additional File 2. This zipped file contains 8 data files, in the .csv (comma separated value) and .xls (MS Excel) formats. They are to be used with the SAS and R codes we presented in the Appendix [additional file 1]. Five files are for the SAS codes presented in the Appendix. The file names are "data5.xls", "data5_t12&17.xls", "u125.xls", "data5_t18.xls", "data6.xls". Three files are for the R codes presented in the Appendix. The file names are "obsVSfit.csv", "dataNewExcerpt2.csv", and "data6_lor2.csv".
Format: ZIP Size: 37KB Download file
References

L'Abbe KA, Detsky AS, O'Rourke K: Metaanalysis in clinical research.
Ann Intern Med 1987, 107:22433. PubMed Abstract

Dorfman DD, Berbaum KS, Metz CE: Receiver operating characteristic rating analysis.
Invest Radiol 1992, 27(9):723731. PubMed Abstract

Irwig L, Tosteson ANA, Gatsonis C, Lau J, Colditz G, Chalmers TC, Mosteller F: Guidelines for metaanalyses evaluating diagnostic tests.
Ann Intern Med 1994, 120:667676. PubMed Abstract  Publisher Full Text

Rutter CM, Gatsonis CA: Regression methods for metaanalysis of diagnostic test data.
Acad Radiol 1995, 2:S48S56. PubMed Abstract

Nelson TO: ROC curves and measures of discrimination accuracy: A reply to Swets.

Tosteson AN, Begg CB: A general regression methodology for ROC curve estimation.
Med Decis Making 1988, 8:204215. PubMed Abstract

Kardaun JW, Kardaun OJWF: Comparative diagnostic performance of three radiological procedures for the detection of lumbar disk herniation.
Meth Inform Med 1990, 29:1222. PubMed Abstract

Moses LE, Shapiro D, Littenberg B: Combining independent studies of a diagnostic test into a summary ROC curve: dataanalytic approaches and some additional considerations.
Stat Med 1993, 12(14):1293316. PubMed Abstract

Toledano A, Gatsonis CA: Regression analysis of correlated receiver operating characteristic data.
Acad Radiol 1995, 2:S30S36. PubMed Abstract

Siadaty MS, Philbrick JT, Heim SW, Schectman JM: Repeatedmeasures modeling improved comparison of diagnostic tests in metaanalysis of dependent studies.
Journal of Clinical Epidemiology 2004, 57(7):698710. PubMed Abstract  Publisher Full Text

Irwig L, Macaskill P, Glasziou P, Fahey M: Metaanalytic methods for diagnostic test accuracy.
J Clin Epidemiol 1995, 48(1):119130. PubMed Abstract  Publisher Full Text

Hosmer DW, Lemeshow S: Applied Logistic Regression. New York: WileyInterscience; 1989.

McClish DK: Combining and comparing area estimates across studies or strata.
Med Decis Making 1992, 12:274279. PubMed Abstract

Diggle P, Heagerty P, Liang KY, Zeger S: Analysis of Longitudinal Data. New York: Oxford University Press; 2002.

Ihaka R, Gentleman RR: A language for data analysis and graphics.
Journal of Computational and Graphical Statistics 1996, 5:299314.

Agresti A: An Introduction to Categorical Data Analysis. New York: WileyInterscience; 1996.

Gleser LJ, Olkin I: Stochastically dependent effect sizes. In In The Handbook of Research Synthesis. Edited by Cooper H, Hedges LV. New York: Russell Sage Foundation; 1994:33956.

DuMouchel W: Repeated measures metaanalyses.
Bulletin of the International Statistical Institute, Session 51, Tome LVII, Book 1 1998, 285288.

Heim SW, Schectman JM, Siadaty MS, Philbrick JT: Ddimer testing for deep venous thrombosis: a metaanalysis.
Clin Chem 2004, 50(7):113647. PubMed Abstract  Publisher Full Text

Hamilton GW, Trobaugh GB, Ritchie JL, Gould KL, DeRouen TA, Williams DL: Myocardial imaging with Thallium 201: an analysis of clinical usefulness based on Bayes' theorem.
Semin Nucl Med 1978, 8(4):358364. PubMed Abstract

Cochrane methods group on systematic review of screening and diagnostic tests: recommended methods

Spitzer WO: The challenge of metaanalysis.
J Clin Epidemiol 1995, 48(1):14. PubMed Abstract  Publisher Full Text

Neter J, Kutner MH, Wasserman W, Nachtsheim CJ: Applied Linear Statistical Models. Boston: McGrawHill/Irwin; 1996.

Mosteller F, Chalmers T: Some progress and problems in metaanalysis of. clinical trials.

van Houwelingen HC, Arends LR, Stijnen T: Advanced methods in metaanalysis: multivariate approach and metaregression.
Stat Med 2002, 21(4):589624. PubMed Abstract  Publisher Full Text

Peterson WW, Birdsall TG, Fox WC: The theory of signal detectability.
Transactions of the IRE professional group on information theory 1954, 4:171212.

Tanner WP, Swets JA: A decisionmaking theory of visual detection.
Psychol Rev 1954, 61(6):401409. PubMed Abstract

Swets JA: Indices of discrimination or diagnostic accuracy: Their ROCs and implied models.
Psychol Bull 1986, 99(1):100117. PubMed Abstract  Publisher Full Text

Ogilvie JC, Creelman CD: Maximum likelihood estimation of receiver operating characteristic curve parameters.

Crouchley R, Davies RB: A comparison of population average and randomeffect models for the analysis of longitudinal count data with baseline information.
J R Statist Soc A 1999, 162:331347. Publisher Full Text

Victor N: "The challenge of metaanalysis": Discussion. Indications and contraindications for metaanalysis.
J Clin Epidemiol 1995, 48(1):58. PubMed Abstract  Publisher Full Text

Greenland S: Quantitative methods in the review of epidemiologic literature.
Epidemiol Rev 1987, 9:130. PubMed Abstract
Prepublication history
The prepublication history for this paper can be accessed here: