Abstract
Background
In this paper we propose the use of the withinsubject coefficient of variation as an index of a measurement's reliability. For continuous variables and based on its maximum likelihood estimation we derive a variancestabilizing transformation and discuss confidence interval construction within the framework of a oneway random effects model. We investigate sample size requirements for the withinsubject coefficient of variation for continuous and binary variables.
Methods
We investigate the validity of the approximate normal confidence interval by Monte Carlo simulations. In designing a reliability study, a crucial issue is the balance between the number of subjects to be recruited and the number of repeated measurements per subject. We discuss efficiency of estimation and cost considerations for the optimal allocation of the sample resources. The approach is illustrated by an example on Magnetic Resonance Imaging (MRI). We also discuss the issue of sample size estimation for dichotomous responses with two examples.
Results
For the continuous variable we found that the variance stabilizing transformation improves the asymptotic coverage probabilities on the withinsubject coefficient of variation for the continuous variable. The maximum like estimation and sample size estimation based on prespecified width of confidence interval are novel contribution to the literature for the binary variable.
Conclusion
Using the sample size formulas, we hope to help clinical epidemiologists and practicing statisticians to efficiently design reliability studies using the withinsubject coefficient of variation, whether the variable of interest is continuous or binary.
Background
Measurement errors can seriously affect statistical analysis and interpretation; it therefore becomes important to assess the magnitude of such errors by calculating a reliability coefficient and assessing its precision. For instance medical diagnosis, clinicians have now become cognizant to the paramount importance of obtaining accurate measurements to ensure safe and efficient delivery of care to their patients. Experiments designed to measure validity and precision of instruments used in biomedical and epidemiological research are ubiquitous. For example, Ashton [1] demonstrated the importance of evaluating the reliability of manual and automated methods for quantifying total white matter lesions burden in multiple sclerosis patients. They compared the coefficient of variations of three methods. In oncology, Schwartz et al. [2] used the coefficient of variation to evaluate the repeatability in bidimensional computed tomography measurements of three techniques: handheld calipers on film, electronic calipers on a workstation, and an autocontour technique on a workstation. Differences between the coefficients of variation were statistically significantly different for the autocontour technique, compared to the other techniques. The coefficient of variation is often used to compare variables measured on different scales. For example, in social sciences, when the intent is to compare the variability in school performance with the variability of household income, a comparison of standard deviations makes no sense because income and school performance are measured on different scales. The correct comparison may be based on the coefficient of variation because it adjusts for scale. Other applications of the coefficient of variation are given in Tian [3].
Scientists have developed several indices to assess the reliability and reproducibility of quantitative measurements. The intraclass correlation (ICC), the proportion of the betweensubject variance to the total variance, has been widely used as an index of measurement reliability. For a comprehensive review on the ICC and its applications, we refer the reader to Fleiss [4], Dunn [5] and Shoukri [6]. One of the criticisms of the ICC is that its value depends on the population from which the study subjects have been obtained, and this may lead to difficulties in comparing results from different studies. Accordingly, Quan and Shih [7] (QS) considered an alternative measure, the WithinSubject Coefficient of Variation (WSCV) as an alternative to the ICC for assessing measurements reproducibility or testretest reliability. Because of the requirement that repeated observations are made on each subject, they used the oneway random effects model (REM) as a mechanism to describe the data. Although the use of the WSCV as a measure of reproducibility is long standing, the issue of sample size determination has not been adequately investigated. Sample size estimation is one of the most important issues in the design of any study that uses inferential statistics.
When the ICC is used as the index of reliability, Donner and Eliasziw [8] provided contours of exact power for selected numbers of subjects (k) and numbers of replicates (n). These power results were then used to identify optimal designs that minimize the study costs. Assuming a constant number of replicates per subject, Walter et al. [9] considered an approximation to determine the required number of subjects to achieve fixed levels of power. Bonett [10] calculated the sample size required to achieve a prescribed expected width for the confidence interval on the ICC. Shoukri et al. [11] derived the values of k and n that allocate the sample resources optimally and minimize the variance of the estimated ICC under cost constraints. The cost structure that was considered was general and followed the general guidelines identified by Flynn et al. [12].
In this paper, we derive the optimal allocation for the number of subjects and the number of repeated measurements needed to minimize the variance of the maximum likelihood estimator (MLE) of the WSCV. In Section 2 we present the random effects model, the definition of the WSCV, and the asymptotic distribution of its MLE for continuous data. In Section 3, we use the calculus of optimization to find the optimal combinations (n, k) that minimize the variance of the MLE of WSCV for normally distributed variables. The use of the WSCV for dichotomous data has never been investigated before, and a novel contribution in this paper is the estimation of WSCV for binary outcome measurements, and sample size requirements, with emphasis on the case of two ratings per subject (i.e. n = 2). We devote Section 4 to the binary data, and general discussion is presented in Section 5.
Methods
Estimating the WSCV for continuous variables
Assumptions
Consider a random sample of k subjects with n repeated measurements of a continuous variable Y, and denote by Y_{ij }the j^{th }reading made on the i^{th }subject under identical experimental conditions (i = 1,2,...k; j = 1,2,...n). In a testretest scenario, and under the assumption of no reader's effect (i.e. the readings within a specific subject are exchangeable), Y_{ij }denotes the reading of the j^{th }trial made on the i^{th }subject. A useful model for analyzing such data is given by:
Y_{ij }= μ + s_{i }+ e_{ij } i = 1,2,...k; j = 1,2,...n (1)
where μ is the mean of Y_{ij}, the random subject effects s_{i }are normally distributed with mean 0 and variance , or N(0, ), the measurement errors e_{ij }are N(0, ), and the s_{i }and e_{ij}terms are independent. We assume that the subjects are randomly drawn from some population of interest.
Quan and Shih [7] defined the WSCV parameter in the above model as:
θ = σ_{e}/μ (2)
With model (1), it is assumed that the within subject variance is the same for all subjects.
Maximum likelihood estimator
Under the above setup, the loglikelihood function has the form
Define σ^{2 }= + and ρ = /( + ), the intraclass correlation coefficient, for which = ρσ^{2 }and = σ^{2}(1ρ). Because the design is balanced the maximum likelihood estimators (MLE) for μ, σ^{2}, and ρ are given in closed forms by: , , and the estimated ICC is , where , are respectively, the withinsubject and between subjects mean squares as obtained from the usual oneway ANOVA table, and . Note that the MSB does not exist for k = 1, which means that to obtain a sensible estimate of ρ as an index of reliability, the study should include more than one subject.
Results
The asymptotic variancecovariance matrix of the MLE's is obtained by inverting Fisher's information matrix. The large sample variance of can be obtained using delta method (see Kendall vol. 1 [13]) and was shown by Quan and Shih [7] to be:
To construct an approximate confidence interval on , it is assumed that for large k, (  θ) follows a normal distribution with mean 0 and variance A(ρ,n,θ). An approximate 100(1  α)% confidence interval on θ can be given as , where Z_{α/2 }is the 100(1  α/2)% cut off point of the standard normal distribution.
Due to the dependence of the variance of on the true parameter value θ itself, we found that the asymptotic coverage deviates from its nominal levels for some values of θ. To improve the coverage probability we suggest a variance stabilizing transformation to remove the dependence of var() on θ.
Variance Stabilizing Transformation (VST)
To improve the estimated coverage proportion, we propose a variance stabilizing transformation g (see Kendall vol.1 page 541 [13]) where, g = ∫(var())^{1/2 }dθ. With θ defined as in equation (2), it can be shown that where, , and ρ* = ρ/(1  ρ) Letting , we may establish assuming the function g is bounded and differentiable, that, f(,n,ρ) is asymptotically normally distributed with mean f(θ,n,ρ) and variance 1/k. Therefore, we can construct 100(1 α)% confidence limits on θ based on the above transformation. The upper and lower (1 α/2)100% confidence bound on θ are respectively given by:
where, ξ_{1 }= f(,n,ρ) + z_{α/2}/, and ξ_{2 }= f(,n,ρ)  z_{α/2}/.
Note that the limits of the interval depend on the unknown value of the intraclass correlation, which can be replaced by its MLE as defined in section 2.1.
To examine the finite sample behavior of the VST based confidence interval estimator, a MonteCarlo study was conducted under model (1) using the SPlus program. The values of ρ were, 0.3, 0.4, 0.6, 0.7, and 0.8; μ = 10, and θ = 4%, 10% or 20%. Sample size (k) = 12, 25, 50, 75, and number of replicates (n) = 2, 3, and 5. The number of repetitions for each simulation was 1000. Tables 1, 2, 3, and 4, demonstrate the coverage proportion for the 95% nominal level confidence interval on the WCV. The estimated coverage proportions were close to the 95% nominal level.
Table 1. Estimated coverage probabilities under the VST. The nominal level is 95%. (θ = 0.04)
Table 2. Estimated coverage probabilities under the VST. The nominal level is 95%. (θ = 0.01)
Table 3. Estimated coverage probabilities under the VST. The nominal level is 95%. (θ = 0.02)
Table 4. Results for 10 replicates on each of three patient's total lesion burden. Values are given volumes in cubic centimeters.
Example 1
Accurate and reproducible quantification of brain lesion count and volume in multiple sclerosis (MS) patients using magnetic resonance imaging (MRI) is a vital tool for evaluation of disease progression and patient response to therapy. Current standard methods for obtaining these data are largely manual and subjective and are therefore errorprone and subject to interand intraoperator variability. Therefore, there is a need for a rapid automated lesion quantification method. Ashton et al. [1] compared manual measurements and an automated data technique known as Geometrically Constrained Region Growth (GEORG) of the brain lesion volume of 3 MS patients, each measured 10 times by a single operator for each method. The data are presented in Table 5.
Table 5. Summary analysis of data in Table 2 and 95% confidence intervals
Based on the guidelines for the levels of reliability provided by Fleiss [4], a value of an ICC above 80% indicates an excellent reliability, and from Table 3 both methods cross this threshold level. However, based on the WSCV values, the manual method is definitely less reproducible than the automated method (the GEORG is 5 times more reproducible than the manual). This example demonstrates the usefulness of the WSCV over the ICC as a measure of reproducibility. Clearly, one should construct a formal test on the significance of the difference between two correlated withinsubject coefficients of variation. There are several competing methods to construct such a test (e.g. LRT, Wald, and Score tests) but this issue is quite involved and so we intend to report our findings in a future publication.
Sample size estimation
In the following development we discuss the second objective of this paper. We assume that the investigator is interested in the number of replicates, n, per subject, so that the variance of the estimate of θ is minimized, given that the total number of measurements is fixed a priori at N = nk.
Efficiency criterion
For fixed total number of measurements N = nk, equation (3) gives:
The necessary condition for var() to have a unique minimum is that ∂var()/∂n = 0. This, and the additional condition that ∂^{2 }var()/∂n^{2 }> 0 are both satisfied so long as 0 <ρ < 1. Differentiating (4) with respect to n, equating to zero and solving for n we obtain
The required number of subject is thus k* = N/n*.
Table 4 shows few optimal allocations of (n, k) for ρ = 0.6, 0.7, and 0.8, θ = 0.1, 0.2, 0.3 and 0.4, when N = 24.
Note that in practice, only integer values of (n, k) are used, and because N = nk is fixed a priori, we first round the optimum values of n to the nearest integer; then k = N/n was rounded to the nearest integer. The values of var() at the rounded optimal allocations for different values of ρ,θ and n showed that the net loss or gain in efficiency due to rounding is negligible. It is clear that to efficiently estimate the WSCV for large values of θ we need smaller number of replicates and larger number of subjects.
Fixed width confidence interval approach
Bonett [10] discussed the issue of sample size requirements that achieve a prespecified expected width for a confidence interval about ICC. This approach is useful in planning a reliability study in which the focus is on estimation rather than hypothesis testing. He demonstrated that the effect of inaccurate planning value of ICC is more serious in hypothesis testing applications. Shoukri et al. [11] argued that the hypothesis testing approach might not be appropriate while planning a reproducibility study. This is because, in most cases, values of the coefficient under the null and alternative hypotheses may be difficult to specify. An alternative approach is to focus on the width of the CI for θ. Since the approximate width of an (1  α)100%CI on θ is, 2z_{α/2 }var()^{1/2}, an approximate sample size that yields an (1  α)100% CI for θ with a desired width w obtains by setting w = 2z_{α/2 }{var(θ)}^{1/2 }and then solving for k:
We observe that, for fixed n and θ, larger values of ρ require larger number of subjects to satisfy the criterion. As an example, suppose that it is of interest to construct 95% CI on θ with expected width w = 0.05, ρ = 0.3, and an afforded number of replicates n = 2. If the hypothesized value of θ is 0.10, then k = 31, and if θ is 0.3 (i.e. lower reliability), then k = 323.
Cost criterion
Funding constraints will often determine the cost of recruiting subjects for a reliability study. Although too small a sample may lead to a study that produces an imprecise estimate of the reproducibility coefficient, too large a sample may result in a waste of resources. Thus, an important decision in a typical reliability study is to balance the cost of recruiting subjects with the need for a precise estimate of the parameter summarizing reliability.
In this section, we determine the combinations (n, k) that minimize the variance of subject to cost constraints. Constructing a flexible cost function starts with identifying sampling and overhead costs. The sampling cost depends primarily on the size of the sample and includes costs for data collection, compensation to volunteers, management, and evaluation. On the other hand, overhead costs are independent of sample size. Following Sukhatme et al. [14], we assume that the overall cost function is given as:
C = c_{0 }+ kc_{1 }+ nkc_{2 } (6)
where, c_{0 }is the fixed cost, c_{1 }the cost of recruiting a single subject, and c_{2}is the cost of making one observation. Using the method of Lagrange multipliers and following Shoukri et al. [11], we write the objective function Ψ in this form
Ψ = var() + λ(C  c_{0 } kc_{1 } nkc_{2}) (7)
where, var() is given by Equation (3) and λ is the Lagrange multiplier. Differentiating Ψ with respect to n, k and λ and equating to zero, we obtain
2θ^{2 }ρ* n^{4 } 4θ^{2 }ρ* n^{3 }(2θ^{2}r + r  2θ^{2 }ρ* + 1)n^{2 }+ 4θ^{2}rn  2θ^{2}r = 0 (8)
where r = c_{1}/c_{2}, and ρ* = ρ/(1  ρ)
Although an explicit solution to (8) is available, the resulting expression is complicated and does not provide any useful insight. The 4^{th }degree polynomial in the left side of (8) has two imaginary roots, one negative and one admissible (positive) root for n. Table 5 summarize the results of the optimization procedure where we provide the optimal n for various values of θ, ρ, and r, noting that:
Results
From Table 7, it is apparent that when r = c_{1}/c_{2 }increases, the required number of replicates per subject (n) increases, because the cost of making a single observation (c_{2}) decreases and the cost of recruiting a subject (c_{1}) increases. When r is fixed, an increase in ρ results in a decline in the required value of n and accordingly an increase in k. An increase in θ also results in a decrease in n. The general conclusion is that it is sensible to decrease the number of items associated with a higher cost, while increasing those with lower cost.
Table 6. Optimal combinations of (n_{opt}, k_{opt}) which minimize the variance of
for N = 24.Table 7. Optimal replications (rounded to the nearest integer) of n that minimize the variance of
subject to cost constraints.We note that by setting c_{1 }= 0 in Equation (8), we obtain , as in Equation (5). The situation c_{1 }= 0 is quite plausible, at least approximately if the major cost is in actually making the observations (e.g. expensive equipment, cost of interviews versus free volunteer subjects). This means that a special cost structure is implied by the optimal allocation procedure discussed earlier.
Example 2
To assess the accuracy of Doppler Echocardiography (DE) in determining aortic valve area (AVA) prospective evaluation on patients with aortic stenosis, an investigator wishes to demonstrate a high degree of reliability (ρ = 0.80) in estimating AVA using the "velocity integral method" with a planned value for the WSCV = 0.10. Suppose that the total cost of making the study is fixed at $1600.0. It is assumed that the overhead fixed cost c_{0 }is absorbed by the hospital. Moreover, we assume that travel cost is $200.0, and the administrative cost using the DE is $200.0 per visit. From Table 5, n_{opt}for r = 1, ρ = 0.8, and θ = 0.10 is 6. From (9), k_{opt }= (1600/15)/(1 + 6) = 15. That is we need 15 patients, with 6 measurements each to minimize var() subject to the given cost.
Estimating the WSCV for dichotomous responses
Assumptions
Consider a random sample of k subjects, each is blindly evaluated n times by the same rater. We assume that all subject responses y_{ij }(where j = 1, 2, ...n) are dichotomous and are conditionally independent with probabilities P(y_{ij }= 1) = p_{i }(i = 1,2,...,k) and p(y_{ij }= 0) = 1  p_{i}. Thus, for fixed p_{i}, the conditional distribution of the random variable follows binomial distribution with parameters n and p_{i}. To account for the variation of response probabilities between subjects, as considered by Mak [15], we assume further that the probabilities p_{i }are independently and identically distributed as a beta distribution, Beta (α,β), with mean π = α/(α + β) and variance π (1  π)ρ. Given these assumptions, one can show that the correlation between y_{ij }and y_{il }is in fact ρ. Define _{i• }= y_{i•}/n and /(n  1), , and . We therefore estimate the WSCV for binary assessments by:
A case of special interest to clinical epidemiologists is when n = 2, or a test retest reliability study involving two readings per subject. For this case we investigate the sample size issue in the following section.
Results
The special case n = 2
Under the above setup, the common correlation model (CCM) (see Mak [15], Bloch and Kraemer [16]) provides an appropriate description for the joint distribution of (Y_{ij}, Y_{il}):
P_{11 }= P(Y_{ij }= 1,Y_{il }= 1) = π^{2 }+ ρπ(1  π).
P_{10 }= P_{01 }= P(Y_{ij }= 1,Y_{il }= 0) = P(Y_{ij }= 0,Y_{il }= 1) = (1  ρ)π(1  π). (10)
P_{00 }= P(Y_{ij }= 0,Y_{il }= 0) = (1  π)^{2 }+ ρπ(1  π).
The data layout can be summarized as in Table 8.
Table 8. Data layout for a 2 × 2 binary classification
where, k = k_{11 }+ k_{01 }+ k_{10 }+ k_{11}.
Since for the i^{th }subject, the mean of two measurements is μ_{i }= (Y_{i1 }+ Y_{i2}) when summed up over all subjects we have:
Therefore, an unbiased estimator of the population mean π is:
and the MSW, is S^{2 }= (k_{10 }+ k_{01}). Hence the sample coefficient of variation for binary responses is:
Since E() = π and E(SD)= π(1  π)(1  ρ) we define as the population coefficient of variation for dichotomous outcome with its MLE given by . The CCM can be reparameterized by substituting for the reliability coefficient ρ. Applying the delta method, the first order approximation to the variance of is shown to be:
var(∂) = k^{1 }(a_{1 }+ a_{2 }+ a_{3}),
where a_{1 }= υ^{2}(1  υ^{2}π)(1  π + υ^{2}π^{2})/π,
a_{2 }= (1  2πυ^{2})^{2}(1  2π^{2}υ^{2})/8π^{2}, and
a_{3 }= υ^{2}(1  2υ^{2}π)(1  υ^{2}π). (12)
We suggest an approximate (1  α)100% confidence interval as .
Example 3
To illustrate the methodology discussed in this section, we use data from an investigation of mammography by Powell et al. [17] concerning the equivalence of filmscreen (FS) and digital images (DI). Two readings were made on the presence/absence (1/0) of malignancy by each rater on the same set of k = 58 patients. The data and the results of the analysis are summarized in Table 9. Both methods seem to have the same levels of reliability in terms of ICC and WSCV. We note that the 95% confidence interval is somewhat relatively wide, and this may be due to the fact that the sample size is not large enough.
Table 9. Data analysis from a mammography study by Powell et al. (1999).
Note that if the observed frequencies in the sample of k subjects are given as in Table 10, we can write a simpler estimator of the WSCV as /(n_{2 }+ 2n_{1}). To construct an estimate of the confidence interval on υ, the MLE of ρ, and should be substituted in equation (12) where, from Donner and Eliasziw [18].
Table 10. Data Layout for the CCM
Sample size estimation
Methods
There has been increasing attention given recently to estimation of sample size using a confidence interval rather than a significance testing approach (e.g. Gardner and Altman [19]). This is consistent with recent arguments made by many authors, including Goodman and Berlin [20] who state that "confidence intervals should play an important role when setting sample size" and that " the size of a confidence interval can be predicted in the planning stages of an experiment and this can be a great help in understanding the implications of different sample size choices".
For comparative interest we also present sample size requirements needed to test H_{0 }:υ = υ_{0}, where υ_{0 }is some hypothesized value of the WSCV υ.
Fixed width confidence interval (CI) on υ
Following the approach described in Section 3.2, the approximate width of an (1  α)100% CI on υ is, 2z_{α/2}{var()}^{1/2}. An approximate sample size that yields an (1  α)100% CI for υ with a desired width w obtains by setting w = 2z_{α/2}{var()}^{1/2 }and then solving for k:
k = 4(a_{1 }+ a_{2 }+ a_{3 })/w^{2}. (13)
Hypothesis testing procedure
Donner and Eliasziw (DE) [18] developed the Goodnessoffit (GOF) to efficiently construct a confidence interval and to estimate the sample size required to test a specific hypothesis on intraclass kappa value. Here we use the GOF to estimate the sample size needed for ensuring enrollment of a sufficient number of subjects in a reproducibility study. This follows from the observation that, to test the null hypothesis, H_{0 }: υ = υ_{0 }then:
has a noncentral chisquare distribution with one degree of freedom under the alternative hypothesis H_{1 }: υ = υ_{1 }with noncentrality parameter
Following DE it can be shown that the sample size needed to conduct a twosided test with significance level α and power 1  β is:
where z_{1α/2 }and z_{1β }are the critical values of the standard normal distribution corresponding to α and β.
As an example suppose it is of interest to test H_{0}: v = 0.04 versus H_{1}: v = 0.1, where v_{0 }corresponds to high reliability. To ensure with 80 per sent probability a significant result at α = 5% and π = 0.30 when v_{1 }= 0.10, we compute the required number of subjects from the above equation as k = 986 and when π = 0.50, k = 355. For the sake of comparison to the fixed width CI procedure, suppose it is of interest to construct 95% CI on v with expected width w = 0.10. If the hypothesized values of v is 0.10 and π = 0.30, then from (13) k = 1100, and if π = 0.5, then k = 400.
Discussion
The ICC has been traditionally used to assess the reliability of a measurement. QS considered the WSCV as an alternative measure of reproducibility for continuous scale measurements. It should be emphasized that our investigation has not allowed for forms of systematic error (e.g. measurement, or trend that is unaccounted for in the model). A reviewer of this paper indicated that this is beyond our scope. In this paper we have dealt with the issue of sample size estimation of the WSCV from continuous and binary scale measurements focusing on random measurement error, in the conventional way that reliability is usually discussed.
As in any reliability study, a crucial decision that a researcher faces in the design stage is the determination of the number of subjects, k and the number of measurements per subject, n. We have discussed two alternative statistical techniques to determine an optimal allocation. When we have prior knowledge of what constitutes an acceptable level of reproducibility, a hypothesis testing approach may be used. We used this approach in the case of binary outcome variable, following the GOF approach proposed by DE. The application of the GOF was straightforward because the number of replicates n = 2 was fixed. However, there are situations, when appropriate values of the reliability coefficient under the null and alternative hypotheses may be difficult to specify. An alternative to hypotheses testing is the efficient allocation of the sample, and the guidelines provided in this article for the continuous scale measurements allow selection of the pair (n, k) that maximizes the precision of the estimated coefficient under cost constrains. We note that cost implications, for dichotomous assessments, are quite important particularly when n is larger than two, which we intend to report on in a future paper.
Finally it is noted that in practice, the optimal allocation must be integer values, and the net loss/gain in precision as a result of rounding the values the values of (n, k) was negligible. Ideally one should adopt one of the available optimization algorithms, often referred to as integer programming models. These models are suited for the optimal allocations problems since the main concern was to find the best solution(s) in a welldefined discrete space.
Conclusion
The WSCV is a useful index measure of measurements reliability. Investigators may design reliability studies using either efficiency or cost considerations. For continuous measurements, optimal allocation of the sample may be achieved with as few as two replications per subject. For dichotomous data, when each subject is measured twice, investigators may use, either fixed length confidence interval, or power considerations is estimating the sample size. Both methods produce comparable results.
Competing interests
The author(s) declare that they have no competing interests.
The authors contributed equally to this work
Acknowledgements
Drs. M. Shoukri and N. ElKum acknowledge the support by the Research Centre of The King Faisal Specialist Hospital. Dr. Walter acknowledges the support by NSERC Canada.
References

Ashton E, Takahashi C, Berg M, Goodman A, Totterman S, Ekholm S: Accuracy and reproducibility of manual and semiautomated quantification of MS lesions in MRI. Technical Report, Department of Radiology, University of Rochester Medical Center, Rochester, NY; 2003.

Schwartz L, Ginsberg M, DeCorato D: Evaluation of Tumor Measurements in oncology: Use of filmbased and electronic techniques.
Journal of Clinical Oncology 2000, 18(10):21792184. PubMed Abstract  Publisher Full Text

Tian L: Inference on the common coefficient of variation.
Statistics in Medicine 2005, 24:22132220. PubMed Abstract  Publisher Full Text

Fleiss J: Design and analysis of clinical experiments. Wiley & Sons, New York; 1986.

Dunn G: Design and analysis of reliability Studies. Oxford University Press, New York; 1989.

Shoukri MM: Measures of interobserver agreement. Chapman & Hall/CRC Press. Boca Raton, Florida; 2004.

Quan H, Shih WJ: Assessing reproducibility by the withinsubject coefficient of variation with random effects models.
Biometrics 1996, 52:11951203. PubMed Abstract

Donner A, Eliasziw M: Sample size requirements for reliability studies.
Statistics in Medicine 1987, 6:441448. PubMed Abstract

Walter D, Eliasziw M, Donner A: Sample size and optimal design for reliability studies.
Statistics in Medicine 1998, 17:101110. PubMed Abstract  Publisher Full Text

Bonett DG: Sample size requirements for estimating intraclass correlations with desired precision.
Statistics in Medicine 2002, 21:13311335. PubMed Abstract  Publisher Full Text

Shoukri M, Asyali M, Walter S: Issues of cost and efficiency in the design of reliability studies.
Biometrics 2003, 59:11071112. PubMed Abstract  Publisher Full Text

Flynn N, Whitely E, Peters T: Recruitment strategy in a cluster randomized trial: Cost implications.
Statistics in Medicine 2002, 21:397405. PubMed Abstract  Publisher Full Text

Kendall M, Stuart A: The advanced theory of statistics. Volume I. London: Griffin; 1986.

Sukhatme P, Sukhatme B, Sukhatme S, Asok C: Sampling theory of surveys with applications. Ames, IA: Iowa State University Press; 1984. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mak TK: Analyzing the intraclass correlation for dichotomous variables.

Bloch DA, Kraemer HC: 2 × 2 kappa coefficients: measures of agreement or association.
Biometrics 1989, 45:269287. PubMed Abstract

Powell KA, Obouchowski NA, Chilote WA, Barry MM, Ganocik SN, Cardenso G: Filescreen versus digitized mammography: assessment of clinical equivalence.
American Journal of Roentgenology 1999, 173:889894. PubMed Abstract

Donner A, Eliasziw M: A goodnessoffit approach to inference procedures for the kappa statistic: confidence interval construction, significance testing and sample size estimation.
Statistics in Medicine 1992, 11:15511519. PubMed Abstract

Gardner M, Altman D: Confidence intervals rather than Pvalues: estimation rather than hypothesis testing.
British Medical Journal 1986, 292:746750. PubMed Abstract

Goodman S, Berlin J: The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results.
Annals of Internal Medicine 1994, 121:200206. PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: