Interval estimation and optimal design for the within-subject coefficient of variation for continuous and binary variables1Department of Biostatistics, Epidemiology and Scientific Computing King Faisal Specialist Hospital and Research Centre, P.O. Box 3354, Riyadh 11211, Saudi Arabia 2Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada 3Department of Epidemiology and Biostatistics, McMaster University Hamilton, Ontario, Canada
BMC Medical Research Methodology 2006, 6:24doi:10.1186/1471-2288-6-24 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2288/6/24
©
2006 Shoukri et al; licensee BioMed Central Ltd. AbstractBackgroundIn this paper we propose the use of the within-subject coefficient of variation as an index of a measurement's reliability. For continuous variables and based on its maximum likelihood estimation we derive a variance-stabilizing transformation and discuss confidence interval construction within the framework of a one-way random effects model. We investigate sample size requirements for the within-subject coefficient of variation for continuous and binary variables. MethodsWe investigate the validity of the approximate normal confidence interval by Monte Carlo simulations. In designing a reliability study, a crucial issue is the balance between the number of subjects to be recruited and the number of repeated measurements per subject. We discuss efficiency of estimation and cost considerations for the optimal allocation of the sample resources. The approach is illustrated by an example on Magnetic Resonance Imaging (MRI). We also discuss the issue of sample size estimation for dichotomous responses with two examples. ResultsFor the continuous variable we found that the variance stabilizing transformation improves the asymptotic coverage probabilities on the within-subject coefficient of variation for the continuous variable. The maximum like estimation and sample size estimation based on pre-specified width of confidence interval are novel contribution to the literature for the binary variable. ConclusionUsing the sample size formulas, we hope to help clinical epidemiologists and practicing statisticians to efficiently design reliability studies using the within-subject coefficient of variation, whether the variable of interest is continuous or binary. BackgroundMeasurement errors can seriously affect statistical analysis and interpretation; it therefore becomes important to assess the magnitude of such errors by calculating a reliability coefficient and assessing its precision. For instance medical diagnosis, clinicians have now become cognizant to the paramount importance of obtaining accurate measurements to ensure safe and efficient delivery of care to their patients. Experiments designed to measure validity and precision of instruments used in biomedical and epidemiological research are ubiquitous. For example, Ashton [1] demonstrated the importance of evaluating the reliability of manual and automated methods for quantifying total white matter lesions burden in multiple sclerosis patients. They compared the coefficient of variations of three methods. In oncology, Schwartz et al. [2] used the coefficient of variation to evaluate the repeatability in bi-dimensional computed tomography measurements of three techniques: hand-held calipers on film, electronic calipers on a workstation, and an auto-contour technique on a workstation. Differences between the coefficients of variation were statistically significantly different for the auto-contour technique, compared to the other techniques. The coefficient of variation is often used to compare variables measured on different scales. For example, in social sciences, when the intent is to compare the variability in school performance with the variability of household income, a comparison of standard deviations makes no sense because income and school performance are measured on different scales. The correct comparison may be based on the coefficient of variation because it adjusts for scale. Other applications of the coefficient of variation are given in Tian [3]. Scientists have developed several indices to assess the reliability and reproducibility of quantitative measurements. The intra-class correlation (ICC), the proportion of the between-subject variance to the total variance, has been widely used as an index of measurement reliability. For a comprehensive review on the ICC and its applications, we refer the reader to Fleiss [4], Dunn [5] and Shoukri [6]. One of the criticisms of the ICC is that its value depends on the population from which the study subjects have been obtained, and this may lead to difficulties in comparing results from different studies. Accordingly, Quan and Shih [7] (QS) considered an alternative measure, the Within-Subject Coefficient of Variation (WSCV) as an alternative to the ICC for assessing measurements reproducibility or test-re-test reliability. Because of the requirement that repeated observations are made on each subject, they used the one-way random effects model (REM) as a mechanism to describe the data. Although the use of the WSCV as a measure of reproducibility is long standing, the issue of sample size determination has not been adequately investigated. Sample size estimation is one of the most important issues in the design of any study that uses inferential statistics. When the ICC is used as the index of reliability, Donner and Eliasziw [8] provided contours of exact power for selected numbers of subjects (k) and numbers of replicates (n). These power results were then used to identify optimal designs that minimize the study costs. Assuming a constant number of replicates per subject, Walter et al. [9] considered an approximation to determine the required number of subjects to achieve fixed levels of power. Bonett [10] calculated the sample size required to achieve a prescribed expected width for the confidence interval on the ICC. Shoukri et al. [11] derived the values of k and n that allocate the sample resources optimally and minimize the variance of the estimated ICC under cost constraints. The cost structure that was considered was general and followed the general guidelines identified by Flynn et al. [12]. In this paper, we derive the optimal allocation for the number of subjects and the number of repeated measurements needed to minimize the variance of the maximum likelihood estimator (MLE) of the WSCV. In Section 2 we present the random effects model, the definition of the WSCV, and the asymptotic distribution of its MLE for continuous data. In Section 3, we use the calculus of optimization to find the optimal combinations (n, k) that minimize the variance of the MLE of WSCV for normally distributed variables. The use of the WSCV for dichotomous data has never been investigated before, and a novel contribution in this paper is the estimation of WSCV for binary outcome measurements, and sample size requirements, with emphasis on the case of two ratings per subject (i.e. n = 2). We devote Section 4 to the binary data, and general discussion is presented in Section 5. MethodsEstimating the WSCV for continuous variablesAssumptionsConsider a random sample of k subjects with n repeated measurements of a continuous variable Y, and denote by Yij the jth reading made on the ith subject under identical experimental conditions (i = 1,2,...k; j = 1,2,...n). In a test-retest scenario, and under the assumption of no reader's effect (i.e. the readings within a specific subject are exchangeable), Yij denotes the reading of the jth trial made on the ith subject. A useful model for analyzing such data is given by: Yij = μ + si + eij i = 1,2,...k; j = 1,2,...n (1) where μ is the mean of Yij, the random subject effects si are normally distributed with mean 0 and variance Quan and Shih [7] defined the WSCV parameter in the above model as: θ = σe/μ (2) With model (1), it is assumed that the within subject variance is the same for all subjects. Maximum likelihood estimatorUnder the above set-up, the log-likelihood function has the form Define σ2 = ResultsThe asymptotic variance-covariance matrix of the MLE's is obtained by inverting Fisher's information matrix. The large sample variance of To construct an approximate confidence interval on Due to the dependence of the variance of Variance Stabilizing Transformation (VST)To improve the estimated coverage proportion, we propose a variance stabilizing transformation g (see Kendall vol.1 page 541 [13]) where, g = ∫(var( where, ξ1 = f( Note that the limits of the interval depend on the unknown value of the intra-class correlation, which can be replaced by its MLE as defined in section 2.1. To examine the finite sample behavior of the VST based confidence interval estimator, a Monte-Carlo study was conducted under model (1) using the S-Plus program. The values of ρ were, 0.3, 0.4, 0.6, 0.7, and 0.8; μ = 10, and θ = 4%, 10% or 20%. Sample size (k) = 12, 25, 50, 75, and number of replicates (n) = 2, 3, and 5. The number of repetitions for each simulation was 1000. Tables 1, 2, 3, and 4, demonstrate the coverage proportion for the 95% nominal level confidence interval on the WCV. The estimated coverage proportions were close to the 95% nominal level. Table 1. Estimated coverage probabilities under the VST. The nominal level is 95%. (θ = 0.04) Table 2. Estimated coverage probabilities under the VST. The nominal level is 95%. (θ = 0.01) Table 3. Estimated coverage probabilities under the VST. The nominal level is 95%. (θ = 0.02) Table 4. Results for 10 replicates on each of three patient's total lesion burden. Values are given volumes in cubic centimeters. Example 1Accurate and reproducible quantification of brain lesion count and volume in multiple sclerosis (MS) patients using magnetic resonance imaging (MRI) is a vital tool for evaluation of disease progression and patient response to therapy. Current standard methods for obtaining these data are largely manual and subjective and are therefore error-prone and subject to inter-and intra-operator variability. Therefore, there is a need for a rapid automated lesion quantification method. Ashton et al. [1] compared manual measurements and an automated data technique known as Geometrically Constrained Region Growth (GEORG) of the brain lesion volume of 3 MS patients, each measured 10 times by a single operator for each method. The data are presented in Table 5. Table 5. Summary analysis of data in Table 2 and 95% confidence intervals Based on the guidelines for the levels of reliability provided by Fleiss [4], a value of an ICC above 80% indicates an excellent reliability, and from Table 3 both methods cross this threshold level. However, based on the WSCV values, the manual method is definitely less reproducible than the automated method (the GEORG is 5 times more reproducible than the manual). This example demonstrates the usefulness of the WSCV over the ICC as a measure of reproducibility. Clearly, one should construct a formal test on the significance of the difference between two correlated within-subject coefficients of variation. There are several competing methods to construct such a test (e.g. LRT, Wald, and Score tests) but this issue is quite involved and so we intend to report our findings in a future publication. Sample size estimationIn the following development we discuss the second objective of this paper. We assume that the investigator is interested in the number of replicates, n, per subject, so that the variance of the estimate of θ is minimized, given that the total number of measurements is fixed a priori at N = nk. Efficiency criterionFor fixed total number of measurements N = nk, equation (3) gives: The necessary condition for var( The required number of subject is thus k* = N/n*. Table 4 shows few optimal allocations of (n, k) for ρ = 0.6, 0.7, and 0.8, θ = 0.1, 0.2, 0.3 and 0.4, when N = 24. Note that in practice, only integer values of (n, k) are used, and because N = nk is fixed a priori, we first round the optimum values of n to the nearest integer; then k = N/n was rounded to the nearest integer. The values of var( Fixed width confidence interval approachBonett [10] discussed the issue of sample size requirements that achieve a pre-specified expected width for a confidence interval about ICC. This approach is useful in planning a reliability study in which the focus is on estimation rather than hypothesis testing. He demonstrated that the effect of inaccurate planning value of ICC is more serious in hypothesis testing applications. Shoukri et al. [11] argued that the hypothesis testing approach might not be appropriate while planning a reproducibility study. This is because, in most cases, values of the coefficient under the null and alternative hypotheses may be difficult to specify. An alternative approach is to focus on the width of the CI for θ. Since the approximate width of an (1 - α)100%CI on θ is, 2zα/2 var( We observe that, for fixed n and θ, larger values of ρ require larger number of subjects to satisfy the criterion. As an example, suppose that it is of interest to construct 95% CI on θ with expected width w = 0.05, ρ = 0.3, and an afforded number of replicates n = 2. If the hypothesized value of θ is 0.10, then k = 31, and if θ is 0.3 (i.e. lower reliability), then k = 323. Cost criterionFunding constraints will often determine the cost of recruiting subjects for a reliability study. Although too small a sample may lead to a study that produces an imprecise estimate of the reproducibility coefficient, too large a sample may result in a waste of resources. Thus, an important decision in a typical reliability study is to balance the cost of recruiting subjects with the need for a precise estimate of the parameter summarizing reliability. In this section, we determine the combinations (n, k) that minimize the variance of C = c0 + kc1 + nkc2 (6) where, c0 is the fixed cost, c1 the cost of recruiting a single subject, and c2is the cost of making one observation. Using the method of Lagrange multipliers and following Shoukri et al. [11], we write the objective function Ψ in this form Ψ = var( where, var( 2θ2 ρ* n4 - 4θ2 ρ* n3 -(2θ2r + r - 2θ2 ρ* + 1)n2 + 4θ2rn - 2θ2r = 0 (8) where r = c1/c2, and ρ* = ρ/(1 - ρ) Although an explicit solution to (8) is available, the resulting expression is complicated and does not provide any useful insight. The 4th degree polynomial in the left side of (8) has two imaginary roots, one negative and one admissible (positive) root for n. Table 5 summarize the results of the optimization procedure where we provide the optimal n for various values of θ, ρ, and r, noting that: ResultsFrom Table 7, it is apparent that when r = c1/c2 increases, the required number of replicates per subject (n) increases, because the cost of making a single observation (c2) decreases and the cost of recruiting a subject (c1) increases. When r is fixed, an increase in ρ results in a decline in the required value of n and accordingly an increase in k. An increase in θ also results in a decrease in n. The general conclusion is that it is sensible to decrease the number of items associated with a higher cost, while increasing those with lower cost. Table 6. Optimal combinations of (nopt, kopt) which minimize the variance of Table 7. Optimal replications (rounded to the nearest integer) of n that minimize the variance of We note that by setting c1 = 0 in Equation (8), we obtain Example 2To assess the accuracy of Doppler Echocardiography (DE) in determining aortic valve area (AVA) prospective evaluation on patients with aortic stenosis, an investigator wishes to demonstrate a high degree of reliability (ρ = 0.80) in estimating AVA using the "velocity integral method" with a planned value for the WSCV = 0.10. Suppose that the total cost of making the study is fixed at $1600.0. It is assumed that the overhead fixed cost c0 is absorbed by the hospital. Moreover, we assume that travel cost is $200.0, and the administrative cost using the DE is $200.0 per visit. From Table 5, noptfor r = 1, ρ = 0.8, and θ = 0.10 is 6. From (9), kopt = (1600/15)/(1 + 6) = 15. That is we need 15 patients, with 6 measurements each to minimize var( Estimating the WSCV for dichotomous responsesAssumptionsConsider a random sample of k subjects, each is blindly evaluated n times by the same rater. We assume that all subject responses yij (where j = 1, 2, ...n) are dichotomous and are conditionally independent with probabilities P(yij = 1) = pi (i = 1,2,...,k) and p(yij = 0) = 1 - pi. Thus, for fixed pi, the conditional distribution of the random variable A case of special interest to clinical epidemiologists is when n = 2, or a test re-test reliability study involving two readings per subject. For this case we investigate the sample size issue in the following section. ResultsThe special case n = 2Under the above set-up, the common correlation model (CCM) (see Mak [15], Bloch and Kraemer [16]) provides an appropriate description for the joint distribution of (Yij, Yil): P11 = P(Yij = 1,Yil = 1) = π2 + ρπ(1 - π). P10 = P01 = P(Yij = 1,Yil = 0) = P(Yij = 0,Yil = 1) = (1 - ρ)π(1 - π). (10) P00 = P(Yij = 0,Yil = 0) = (1 - π)2 + ρπ(1 - π). The data layout can be summarized as in Table 8. Table 8. Data layout for a 2 × 2 binary classification where, k = k11 + k01 + k10 + k11. Since for the ith subject, the mean of two measurements is μi = Therefore, an unbiased estimator of the population mean π is: and the MSW, is S2 = Since E( var(∂) = k-1 (a1 + a2 + a3), where a1 = υ2(1 - υ2π)(1 - π + υ2π2)/π, a2 = (1 - 2πυ2)2(1 - 2π2υ2)/8π2, and a3 = υ2(1 - 2υ2π)(1 - υ2π). (12) We suggest an approximate (1 - α)100% confidence interval as Example 3To illustrate the methodology discussed in this section, we use data from an investigation of mammography by Powell et al. [17] concerning the equivalence of film-screen (FS) and digital images (DI). Two readings were made on the presence/absence (1/0) of malignancy by each rater on the same set of k = 58 patients. The data and the results of the analysis are summarized in Table 9. Both methods seem to have the same levels of reliability in terms of ICC and WSCV. We note that the 95% confidence interval is somewhat relatively wide, and this may be due to the fact that the sample size is not large enough. Table 9. Data analysis from a mammography study by Powell et al. (1999). Note that if the observed frequencies in the sample of k subjects are given as in Table 10, we can write a simpler estimator of the WSCV as Table 10. Data Layout for the CCM Sample size estimationMethodsThere has been increasing attention given recently to estimation of sample size using a confidence interval rather than a significance testing approach (e.g. Gardner and Altman [19]). This is consistent with recent arguments made by many authors, including Goodman and Berlin [20] who state that "confidence intervals should play an important role when setting sample size" and that " the size of a confidence interval can be predicted in the planning stages of an experiment and this can be a great help in understanding the implications of different sample size choices". For comparative interest we also present sample size requirements needed to test H0 :υ = υ0, where υ0 is some hypothesized value of the WSCV υ. Fixed width confidence interval (CI) on υFollowing the approach described in Section 3.2, the approximate width of an (1 - α)100% CI on υ is, 2zα/2{var( Hypothesis testing procedureDonner and Eliasziw (DE) [18] developed the Goodness-of-fit (GOF) to efficiently construct a confidence interval and to estimate the sample size required to test a specific hypothesis on intra-class kappa value. Here we use the GOF to estimate the sample size needed for ensuring enrollment of a sufficient number of subjects in a reproducibility study. This follows from the observation that, to test the null hypothesis, H0 : υ = υ0 then: has a non-central chi-square distribution with one degree of freedom under the alternative hypothesis H1 : υ = υ1 with non-centrality parameter Following DE it can be shown that the sample size needed to conduct a two-sided test with significance level α and power 1 - β is: where z1-α/2 and z1-β are the critical values of the standard normal distribution corresponding to α and β. As an example suppose it is of interest to test H0: v = 0.04 versus H1: v = 0.1, where v0 corresponds to high reliability. To ensure with 80 per sent probability a significant result at α = 5% and π = 0.30 when v1 = 0.10, we compute the required number of subjects from the above equation as k = 986 and when π = 0.50, k = 355. For the sake of comparison to the fixed width CI procedure, suppose it is of interest to construct 95% CI on v with expected width w = 0.10. If the hypothesized values of v is 0.10 and π = 0.30, then from (13) k = 1100, and if π = 0.5, then k = 400. DiscussionThe ICC has been traditionally used to assess the reliability of a measurement. QS considered the WSCV as an alternative measure of reproducibility for continuous scale measurements. It should be emphasized that our investigation has not allowed for forms of systematic error (e.g. measurement, or trend that is unaccounted for in the model). A reviewer of this paper indicated that this is beyond our scope. In this paper we have dealt with the issue of sample size estimation of the WSCV from continuous and binary scale measurements focusing on random measurement error, in the conventional way that reliability is usually discussed. As in any reliability study, a crucial decision that a researcher faces in the design stage is the determination of the number of subjects, k and the number of measurements per subject, n. We have discussed two alternative statistical techniques to determine an optimal allocation. When we have prior knowledge of what constitutes an acceptable level of reproducibility, a hypothesis testing approach may be used. We used this approach in the case of binary outcome variable, following the GOF approach proposed by DE. The application of the GOF was straightforward because the number of replicates n = 2 was fixed. However, there are situations, when appropriate values of the reliability coefficient under the null and alternative hypotheses may be difficult to specify. An alternative to hypotheses testing is the efficient allocation of the sample, and the guidelines provided in this article for the continuous scale measurements allow selection of the pair (n, k) that maximizes the precision of the estimated coefficient under cost constrains. We note that cost implications, for dichotomous assessments, are quite important particularly when n is larger than two, which we intend to report on in a future paper. Finally it is noted that in practice, the optimal allocation must be integer values, and the net loss/gain in precision as a result of rounding the values the values of (n, k) was negligible. Ideally one should adopt one of the available optimization algorithms, often referred to as integer programming models. These models are suited for the optimal allocations problems since the main concern was to find the best solution(s) in a well-defined discrete space. ConclusionThe WSCV is a useful index measure of measurements reliability. Investigators may design reliability studies using either efficiency or cost considerations. For continuous measurements, optimal allocation of the sample may be achieved with as few as two replications per subject. For dichotomous data, when each subject is measured twice, investigators may use, either fixed length confidence interval, or power considerations is estimating the sample size. Both methods produce comparable results. Competing interestsThe author(s) declare that they have no competing interests. The authors contributed equally to this work AcknowledgementsDrs. M. Shoukri and N. ElKum acknowledge the support by the Research Centre of The King Faisal Specialist Hospital. Dr. Walter acknowledges the support by NSERC Canada. References
Pre-publication historyThe pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/6/24/prepub Have something to say? Post a comment on this article! |




on Google Scholar







author email
corresponding author email












































