Abstract
Background
Multicentre randomized controlled trials (RCTs) routinely use randomization and analysis stratified by centre to control for differences between centres and to improve precision. No consensus has been reached on how to best analyze correlated continuous outcomes in such settings. Our objective was to investigate the properties of commonly used statistical models at various levels of clustering in the context of multicentre RCTs.
Methods
Assuming no treatment by centre interaction, we compared six methods (ignoring centre effects, including centres as fixed effects, including centres as random effects, generalized estimating equation (GEE), and fixed and randomeffects centrelevel analysis) to analyze continuous outcomes in multicentre RCTs using simulations over a wide spectrum of intraclass correlation (ICC) values, and varying numbers of centres and centre size. The performance of models was evaluated in terms of bias, precision, mean squared error of the point estimator of treatment effect, empirical coverage of the 95% confidence interval, and statistical power of the procedure.
Results
While all methods yielded unbiased estimates of treatment effect, ignoring centres led to inflation of standard error and loss of statistical power when within centre correlation was present. Mixedeffects model was most efficient and attained nominal coverage of 95% and 90% power in almost all scenarios. Fixedeffects model was less precise when the number of centres was large and treatment allocation was subject to chance imbalance within centre. GEE approach underestimated standard error of the treatment effect when the number of centres was small. The two centrelevel models led to more variable point estimates and relatively low interval coverage or statistical power depending on whether or not heterogeneity of treatment contrasts was considered in the analysis.
Conclusions
All six models produced unbiased estimates of treatment effect in the context of multicentre trials. Adjusting for centre as a random intercept led to the most efficient treatment effect estimation across all simulations under the normality assumption, when there was no treatment by centre interaction.
Background
A multicentre randomized control trial (RCT) is an experimental study "conducted according to a single protocol but at more than one site and, therefore, carried out by more than one investigator"[1]. Multicentre RCTs are usually carried out for two main reasons. First, they provide a feasible way to accrue sufficient participants to achieve reasonable statistical power to detect the effect of an experimental treatment compared with some control treatment. Second, by enrolling participants of more diverse demographics from a broader spectrum of geographical locations and various clinical settings, multicentre RCTs increase generalizability of the experimental treatment for future use [1].
Randomization is the most important feature of RCTs, for on average it balances known and unknown baseline prognostic factors between treatment groups, in addition to minimizing selection bias. Nevertheless, randomization does not guarantee complete balance of participant characteristics especially when the sample size is moderate or small. Stratification is a useful technique to guard against potential bias introduced by imbalance in key prognostic factors. In multicentre RCTs, investigators often use a stratified randomization design to achieve balance over key differences in study population (e.g. environmental, socioeconomic or demographical factors) and management team (e.g. patient administration and management) at centre level to improve precision of statistical analysis [2]. Regulatory agencies recommend that stratification variables in design should usually be accounted for in analysis, unless the potential value of adjustment is questionable (e.g. very few subjects per centre) [1].
The current study was motivated by the COMPETE II trial which was designed to determine if an integrated computerized decision support system shared by primary care providers and patients could improve management of diabetes [3]. A total number of 511 patients were recruited from 46 family physician practices. Individual patients were randomized to one of the two intervention groups stratified by physician practice using permuted blocks of size 6.The number of patients treated by one physician varied from 1 to 26 (interquartiles = 7.25, 11, 15; mean = 11; standard deviation [SD] = 6). The primary outcome was a continuous variable representing the change of a 10point process composite score based on eight diabetesrelated component variables from baseline to a mean of 5.9 months' followup. A positive change indicated a favourable result. During the study, the possibility of clustering within physician practice and its consequence on statistical analysis was a concern to the investigators. The phenomenon of clustering emerges when outcomes observed from patients managed by the same centre, practice or physician are more similar than outcomes from different centres, practices or physicians. Clustering often arises in situations where patients are selective about which centre they belong to, patients in a centre or practice are managed according to the same clinical care paths, or patients influence each other in the same cluster [4]. Intraclass (or intracentre) correlation (ICC) is often used to quantify the average correlation between any two outcomes within the same cluster [5]. It is a number between zero and one. A large value indicates that withincluster observations are similar relative to observations from other clusters and each observation within cluster contains less unique information. This implies that the independence assumption which many standard statistical models are based on is violated. An ICC of zero indicates that individual observations within the same clusters are uncorrelated and different clusters on average have similar observations.
Through a literature review, we identified six statistical methods that were sometimes employed to analyze continuous outcomes in multicentre RCTs: A. simple linear regression (two sample ttest), B. fixedeffects regression, C. mixedeffects regression, D. generalized estimating equations (GEE), E1. fixedeffects centrelevel analysis, and E2. randomeffects centrelevel analysis. The first four methods use patient as unit of analysis, yet address centre effects differently [68]. Simple linear regression completely ignores centre effects that are likely to arise from two sources: (1) possible differences in environmental, socioeconomic or treatment factors between centres, and (2) potential correlation among patients within centres. Although stratified randomization attempts to minimize the impact of centre on standard error of the treatment effect by ensuring that the treated and control groups are largely balanced with respect to centre, failure to control for stratification in analysis will likely inflate variance of the effect estimate. The fixedeffects model treats each participating centre as a fixed intercept to control for possible population or environmental differences among centres. This model assumes that study subjects from the same centre have independent outcomes, i.e. the intraclass correlation is fixed at zero. The mixedeffects model incorporates dependence of outcomes within a centre and treats centres as random intercepts. Proposed by Liang and Zeger [9], the generalized estimating equation (GEE) model extends generalized linear regression with continuous, categorical or count outcomes to correlated observations within cluster. Under a commonly used and perhaps oversimplified assumption, that the degree of similarity between any two outcomes from a centre is equal, an exchangeable correlation structure can be used to assess treatment effect in Model C and D. Though the within and betweencentre variances ( and ) are estimated differently in these two models. Method E1 and E2 are routinely employed to combine information from different studies in metaanalysis [10]. One can also apply them to aggregate treatment effects over multiple centres [1113]. The overall effect is obtained as the average withincentre effect differences over centre, using inversevariance weighting.
To date, only a few studies have been carried out to compare the performance of statistical models in analyzing multicentre RCTs using Monte Carlo simulation [6,7,14], whereas many studies assessed the impact of ICC in cluster randomization trials. Moerbeek et al [6] compared the simple linear regression model, fixedeffects regression and fixedeffects centrelevel analysis with equal centre size. Pickering et al [7] examined the bias, precision and power of three methods: simple regression, fixedeffects and mixedeffects regression assuming block randomization of size 2 or 4 on a continuous outcome. In the presence of imbalance and nonorthogonality, they found ignoring centres or incorporating them as randomeffects led to greater precision and smaller type II error compared with treating centres as fixed effects. Performance of the GEE approach and centrelevel methods were not investigated in that work. Jones et al [14] compared the fixedeffects and randomeffects regression models to a twostep Frequentist procedure as well as a Bayesian model, in the presence of treatment by centre interaction, and recommended fixedeffects weighted method for future analysis of multicentre trials. The investigation was further expanded to assessing correlated survival outcomes from large multicentre cancer trials. A series of randomeffects approaches were proposed to account for centre or treatment by centre heterogeneity in proportional hazards models [15,16].
A lack of definitive evidence on which models perform the best in various situations led to this comprehensive simulation study to examine the performance of all six commonly used models with continuous outcomes. The objective was to assess their comparative performance in terms of bias, precision (simulation standard deviation (SD) and average estimated SE), and mean squared error (MSE) of the point estimator of the treatment effect, empirical coverage of the 95% confidence interval (CI) and the empirical statistical power, over a wide spectrum of ICC value and centre size. We did not consider treatment by centre interaction this study, partly because clinicians and trialists have been making efforts to standardize the conduct and management of multicentre trials via, for instance, uniform patient selection criteria, staff training, and trial monitoring and auditing to reduce heterogeneity of treatment effects among centres. Furthermore it is uncommon to find clinical trials designed with sufficient power to detect treatment by covariate interactions.
In this paper, we survey six methods to investigate the effect of a treatment in multicentre RCTs in detail. We outline the design and analysis of an extensive simulation study, and report how model performance varies with ICC, centre size and the number of centres. We also present the estimated effect of the computeraid decision support system on management of diabetes using different methods.
Methods
Approaches assessing treatment effects
We investigated six statistical approaches to evaluating effect of an experimental treatment on a continuous outcome compared with the control, for multicentre RCTs. Assuming baseline prognostic characteristics are approximately balanced between the treatment and control groups via randomization, we do not consider covariates other than centre effects in the models. The first four approaches use individual patient as unit of analysis, while centre is the unit of analysis in the last two approaches.
Simple linear regression (Model A)
This approach models the impact of treatment (X) on outcome (Y) via regression technique (Equation 1). In the context of a twoarm trial, this approach is the same as a twosample ttest [6].
where Y_{ij }is the outcome of the ith patient in the jth centre, X_{ij }stands for the treatment assignment (X_{ij }= 1 for the treatment, X_{ij }= 0 for the control), and e_{ij }is the random error assumed to follow a normal distribution with mean 0 and variance . The intercept, β_{0 }, represents the mean outcome for the control group in all participating centres, and the slope β_{1 }represents effect of the treatment on the mean outcome.
Fixedeffects regression (Model B)
This model (Equation 2) allows a separate intercept for each centre (β_{0j}) as a fixed effect by restricting the scope of statistical inference to the sample of participating centres in a RCT. Interpretation for β_{1 }remains the same as in Model A. Model A and B were fitted using the linear model procedure 'lm( )' in R.
Mixedeffects regression (Model C)
Similar to Model B, the mixedeffects regression model assumes that the intercept β_{0j }= β_{0 }+ b_{0j }follows a normal distribution N(β_{0}, ), and is thus random effect. In Equation 3, b_{0j }is the random deviation from the mean intercept β_{0}, specific for each centre.
Similar to the previous models, the withincentre variability is reflected by . The variability of outcome betweencentre is captured by in Model C. The variance and covariance of outcomes in the same or different centres can be expressed as: , , . The intraclass correlation that measures the correlation among outcomes within centre is given by , assumed equal across all centres. We fitted Model C in R via linear mixedeffects procedure 'lme( )' using the restricted maximum likelihood (REML) method [17,18].
Generalized estimating equations (Model D)
The GEE method has gained increasing popularity among health science researchers for its availability in most statistical software. As opposed to the mixedeffects method that estimates treatment difference between arms and individual centre effects, the GEE approach models the marginal populationaverage treatment effects in two steps: 1) it fits a naïve linear regression assuming independence between observations within and across centres, and 2) it estimates parameters of the working correlation matrix using residuals in the naïve model and refit regression model to adjust standard error and confidence interval for withincentre dependence [19]. As a result, the estimated impact of treatment on the outcome in GEE model reflects the "combined" within and betweencentre relationship. GEE employs quasilikelihood to estimate regression coefficients iteratively, and a working correlation needs to be supplied to approximate the within centre correlation. When the working correlation is misspecified, the sandwichbased covariance estimator will lead to a robust yet less efficient estimate of treatment effect in GEE model [9]. Recently, statisticians found that variance of the estimated treatment effect could be underestimated when the number of centres was small [20]. We therefore assessed the efficiency of GEE models using procedure 'gee( )' in library(gee) in R. As in the mixedeffects model, an exchangeable correlation structure was assumed in fitting Model D.
Centrelevel fixedeffects model (Model E  1)
The centre level model is a stratified analysis performed on the mean difference in outcome between the treatment and control arms within centre. The overall treatment effect is estimated by a weighted average of individual mean differences across all centres. The principle of inversevariance weighting is often used (Figure 1). This model is essentially a centrelevel inversevariance weighted paired ttest (i.e. the treatment arm is paired to the control arm in the same centre) to account for within centre correlation [10]. In the absence of intraclass correlation and under the assumption of equal sampling variation at patient level, the inversevariance weight reduces to for the jth centre, which can be further simplified as the size of centre n_{j }= n_{tj }+ n_{cj}, given equal numbers of patients in two arms. Here n_{tj }and n_{cj }represent the number of patients in the treatment and control group, respectively, in the jth centre. This form of the weighted analysis (without adjustment for covariates) was discussed extensively by many researchers [2123]. We implemented Models E  1 using the fixedeffects method for metaanalysis provided by the 'metacont( )' procedure in R.
Figure 1. A schematic of fixed and randomeffects centrelevel models.
Centrelevel randomeffects model (Model E  2)
A randomeffects approach is used to aggregate mean effect differences not only across all participating centres but also across a population of centres represented by the sample. This model factors heterogeneity of treatment effect among centres (i.e. random treatment by centre interaction) into its weighting scheme and captures within and betweencentre variation of the outcome. One should not confuse this method with the mixedeffects model using patientlevel data (Model C). For Model E2, the underlying true treatment effects are not a fixed single value for all centres, rather they are considered random effects, normally distributed around a mean treatment effect with betweencentre variation. Model C, on the other hand, treats centres as random intercepts and postulates the same treatment effect across all centres. Model E2 does not serve as a fair comparator to the alternatives listed here which assume no treatment by centre interaction. Preliminary investigation suggested E2 could outperform E1 in some situations; we therefore included E2 in the study to advance understanding of these models. Details of centre level models are provided in Figure 1. Model E  2 was carried out using DerSimonianLaird randomeffects [24] method using the 'metacont( )' procedure in R. The confidence interval for Model E  2 was constructed based on the within and betweencentre variances.
Study data simulation
We used Monte Carlo simulation to assess performance of statistical models to analyze parallel group multicentre RCTs with a continuous outcome. We simulated outcome, Y, using the mixedeffects linear regression model (Model C): Y_{ij }= β_{0 }+ b_{0j }+ β_{1}X_{ij }+ e_{ij }for the ith patient in the jth centre, where X_{ij}(= 0, 1) is the dummy variable for treatment allocation (i = 1...m_{j}, j = 1... J). We generated random error, e, from . We set the true treatment effect (β_{1}) to be 0.5 residual standard deviation (σ_{e}), an effect size suggested by the COMPETE II trial. This corresponds to a medium effect size according to Cohen's criterion [25]. To simulate centre effects, we employed the relationship between ICC and : To fully study the behaviour of candidate models at various ICC levels, we considered the following values of ICC for completeness: 0.00, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50 and 0.75. This in turn set the corresponding values to be 0, 1/99, 1/19, 1/9, 3/17, 1/4, 1/3, 3/7, 7/13, 2/3, 9/11, 1 and 3. However, we focused interpretation of the results on lower values of ICC as they were more likely to occur in practice [2628].
The original sample size was determined to be 84 per arm using a twosided twosample ttest (Model A) to ensure 90% power to detect a standardized effect size of 0.5 at 5% type I error rate. We increased the final sample size to 90 (Power increases to 91.8%) per arm to accommodate more combinations of the number and size of participating centres. We assumed patients were randomly allocated to two groups with a ratio of 1:1, the most common and efficient choice. We generated data in nine scenarios (Table 1) to assess model performance in three designs: (a) balanced studies where equal numbers of patients are enrolled from study centres and the numbers of patients in the two arms are the same (fixed by design); (b) unbalanced studies where equal numbers of patients are enrolled from study centres but the numbers of patients in two arms within centre may be different due to chance yet remain 1:1 allocation ratio; and (c) unbalanced studies where the numbers of patients enrolled vary among centres, and block randomization of size 2 or 4 is used to reduce chance imbalance. For designs (a) and (b), we considered three combinations of centre size and number of centres: J = 45 centres, 4 patients per centre; J = 18 centres, 10 patients per centre; and J = 6 centres, 30 patients per centre. Design (c) mimicked a more realistic scenario for multicentre RCTs. For the first setup of design (c), we grouped 180 patients to 17 centres. It was constructed so that the centre composition and degree of allocation imbalance were analogous to the COMPETE II trial but at a smaller sample size: the number of patients per centre varying from 1 to 28; quartiles = 5, 10, 15; mean = 11; SD = 8; percentage of unbalanced centres between 47% and 70% depending on block size.
Table 1. Catalogue of simulation designs
To compare results from various models in analyzing the COMPETE II trial, and assess accuracy and precision of the effect estimates, we included an additional scenario in design (c) to imitate this motivating example more closely, with respect to sample size and centre composition (scenario 9). We generated treatment allocation (X1) and outcome (Y) for 511 patients in 46 centres, where the number of patients per centre was set exactly the same as observed in the COMPETE II trial (Table 2). In particular, three centres recruiting only one patient was simulated. Analogously to COMPETE II, a fixed block size of 6 was used to assign patients to treatments. The same simulation model was employed as in previous scenarios yet a separate set of parameters based on results of the COMPETE II trial were used (Table 3): β_{0 }= 1.34, β_{1}= 1.26, , , ICC = 0.125.
Table 2. Centre composition of the COMPETE II trial
Table 3. Estimates of intervention effects in COMPETE II trial
We generated 1000 simulations for each of the 13 ICC values under each of the first eight scenarios and 1000 simulations for the specified ICC value for the ninth scenario. Separate sets of centre effects were simulated for each scenario and each simulation 11000. We chose to simulate 1000 replicates so that the simulation standard deviation for the empirical power at a nominal level of 90% in the absence of clustering was controlled at 1%. This also ensured that standard deviations of the coverage of the confidence interval and the empirical power not exceed 1.6%.
Comparison of analytic models
We applied six statistical models to each simulated dataset. For each model, we calculated the bias, simulation standard deviation (SD), average of estimated standard error (SE) and mean squared error (MSE) of the point estimator of treatment effect (i.e. β_{1}), empirical coverage of the 95% confidence interval around β_{1 }and the empirical statistical power. We constructed confidence intervals based on ttest for Models A  C, and Wald interval based on normal approximation for Models D and E. We estimated bias as the difference between the average estimate of β_{1 }over 1000 simulated datasets and the true effect. The simulation or empirical SD was calculated as the standard deviation of the estimated β_{1}s across simulations, indicating precision of the estimator. We also obtain average of the estimated SEs from 1000 simulations to assess accuracy of variance estimator from each simulation dataset. The overall error rate of the point estimator was captured by the estimated MSE, enumerated by the average squared difference between the estimated β_{1 }and true value across the 1000 datasets. Furthermore, we reported performance of the interval estimators in each model. The empirical coverage was estimated as the proportion of 95% confidence intervals that covered the true β_{1}, and the empirical power was the proportion of confidence intervals that rejected a false null hypothesis, i.e. zero lies outside CI. All datasets were simulated and analyzed in R version 2.4.1[29].
Results
Analysis of COMPETE II trial data
We applied all six models to the COMPETE II data and reported results in Table 3. Approximately equal numbers of patients were randomized to the intervention and control groups within each family doctor, leading to 253 and 258 patients in the intervention and control group, respectively. Among 46 family physicians, 11 physicians (24%) treated equal numbers of patients in two arms, 24 physicians (52%) treated one more patient in the intervention or control arm, 10 physicians (22%) managed 2 more patients in either arm, and one physician (2%) managed 3 more patients in one arm compared with the other.
All baseline characteristics were roughly balanced between arms [3]. The analyses using patientlevel data produced similar estimates for β_{1 }and the effect size was around 0.5 times the corresponding residual standard deviation. The standard error of the estimated β_{1 }reduced from 0.25 (Model A) to 0.23 (Models B, C) then 0.19 (Model D) when centre effects were adjusted, leading to narrower CIs around estimated β_{1 }in Models B  D. The intraclass correlation was estimated 0.138 in Model C and 0.124 in Model D. The two centrelevel analyses returned slightly larger estimates of β_{1 }than those from the individual patientlevel models. In fact the minimal variance between physicians indicated no noticeable heterogeneity between physicians (τ^{2 }= 0, I^{2 }= 0), resulting in same estimates from E1 and E2. Zero was not contained in the 95% confidence intervals, therefore all models led to the conclusion that the experimental intervention significantly improved patient management over usual care based on the change of composite process score.
Balanced design with equal centre size
Properties of point estimates
Table 4 summarizes descriptive statistics of the point estimator of treatment effect in Models A  E for three values in the lower range of the spectrum of ICC, in the balanced design. The point estimates of β_{1 }were unbiased in all six models for all ICC values. Upon review, it was surprising that the point estimates in Model A, ignoring stratification and clustering, were invariant of ICC, and that the same estimates were returned by four patientlevel models for each simulation. In fact, when treatments are allocated in same proportion in all centres, centre has no association with the treatment allocation, hence adjusting for centre effect or not has little impact on point estimate of the treatment  response relationship given a continuous response variable. For this reason, different ways to incorporate betweencentre information (Models BD) led to same estimates of treatment contrast in a balanced design. Same point estimates led to same empirical SD and overall error rate (measured by MSE) of the estimator in Models A  D regardless of ICC. Across different ICC values and scenarios 1  3, Models B and C yielded accurate estimates of the standard error of that approximated the empirical SD and the true standard deviation, 0.149, calculated using the best linear unbiased estimator of the simulation model, i.e. Model C [18]. From Table 4 we found that the standard error of , in Model A increased with ICC in each scenario, deviating from the corresponding empirical SD. The standard error could be slightly underestimated in Model D when the number of centres was small (Table 4 scenario 2 and 3 comparing empirical SD and average SE). This agreed with previous work concerning small sample properties of the GEE model [20].
Table 4. Properties of point estimates of the treatment effect from Models A  E in scenarios 1 to 3
The centrelevel analyses produced larger empirical SE and MSE for compared with the patientlevel analyses given small or moderate centre sizes (Table 4). The difference reduced as centre size increased. When only a few patients were enrolled per centre, the fixedeffects centrelevel point estimator in Model E  1 had large sampling variation that was severely underestimated at all ICC values. The randomeffects model (E  2) based on DerSimonianLaird method on the other hand seemed to yield valid SE for that was on average greater than SEs from the patientlevel models. The average estimate of SE for over all simulations in Model E  2 was always larger than estimates of SE in Models B and C, followed by the SE estimated in Model E  1 across different combinations of centre size and number of centres. In this study, although datasets were generated so that the treatment effects were homogeneous among centres (i.e. no treatment by centre interaction), randomeffects analysis using centrelevel data outperformed the fixedeffects analysis when the centre size was small, for Model E  2 took into account the observed "heterogeneity" due to imprecise estimation of the centre mean difference and the associated standard error.
Properties of interval estimates
The empirical coverage of confidence intervals (CIs) and the statistical power in balanced studies are displayed in Table 5. Models B and C produced similar coverage close to the nominal value of 95% over different ICC values and centre composition. Model A provided conservatively high coverage increasing with ICC, illustrating that for moderate to large ICC values, CIs in Model A were abnormally wide due to overestimated SE for . The empirical coverage of CIs from Model D or E  1 on average was farther down from 95% compared with Models B and C. This is likely caused by underestimation of the standard error in Models D and E1, and is associated with an apparent increase of power in the first three scenarios. For Model D, the coverage dropped to below 90% when the number of centres reduced to six in scenario 3. The coverage of Model E  1 was too low to be useful when studies were conducted at many smaller centres (scenario 1). However, coverage increased gradually with centre size and approached 95% when there were 30 patients per centre (scenario 3). Model E2 presented similar coverage pattern to E1, although the coverage was closer to 95%. Models B and C largely maintained nominal power of 91.8% regardless of ICC value. Power of Model A decreased dramatically as ICC departed from 0, indicating that the model failed to adjust for betweencentre variation or withincentre correlation in the outcome measure. The nominal type II error rate (8%) was maintained in Models D and E  1 in scenarios 1  3. Model E  2 generally had lower power to detect the true treatment effect due to a larger standard error that reflects both the withincentre variability and treatment by centre interaction. Interestingly, this power rose as the number of centres reduced and approached 88% in scenario 3.
Table 5. Coverage of the 95% interval estimate of the treatment effect and statistical power of Models A  E in scenarios 1 to 3
Overall, Models B and C had very close performance that outweighed other models in balanced design. Models C and D converged to a solution in all simulations.
Design with equal centre size and chance imbalance
Properties of point estimates
Performance of different models in multicentre studies of equal centre sizes, 1to1 allocation ratio and chance imbalance is displayed in Table 6 and 7. Similar results were observed as in the balanced design, though a few differences emerged. The unbalanced allocation of patients into treatment arms due to pure withincentre variation introduced chance imbalance (in both directions) into treatment  response relationship, hence ignoring centre effects completely (as in Model A) led to unbiased yet less efficient estimates for large ICC values. Model B could be less precise than Model A given small to moderate ICC values, a phenomenon previously reported by Pickering and Weatherall [7]. As in the balanced design, the fixed and randomeffects models performed comparably for various ICC values, largely because the fixed and random intercepts for study centres cancelled out in estimating effect contrast when we fit Models B and C, and had little impact on the estimation of the fixed effect contrast across centres. However, the fixedeffects model produced larger empirical standard deviation and average standard error in scenario 4, a study being composed of many centres each managing a few patients. Adjusting for betweencentre variation as random effects in Model C or using populationaveraged analysis in Model D allowed to borrow information across centres and resulted in greater precision.
Table 6. Properties of point estimates of the treatment effect from Models A  E in scenarios 4 to 6
Table 7. Coverage of the 95% interval estimate of the treatment effect and statistical power of Models A  E in scenarios 4 to 6
Properties of interval estimates
Similar results were observed relative to the balanced design. Patient  level models A  C guaranteed nominal coverage of confidence intervals at different ICC values, whereas the other models were likely to produce lower coverage under certain conditions. Among all models, Models C and D achieved the best empirical power that was closest to the nominal value of 91.8% across different centre sizes. When centre size was small and the number of centre was large (scenario 4), power for Models C and D also decreased with ICC, a pattern that was less obvious in scenarios 5 and 6. Models C and D achieved convergence in analyzing all simulated datasets.
Design with unequal centre sizes and chance imbalance
The properties of point and interval estimates in the scenarios 7 and 8 (with unequal centre sizes and chance imbalance) were close to results in the previous two designs. In particular, the comparative performance of six models lay in the middle ground between scenarios 2 and 5, as the level of imbalance between two treatments was no more than half of the block size within centres. As similar results were observed for block sizes 2 and 4, summary statistics based on block size 4 were plotted in Figure 2, 3, 4 and 5. Results suggested that unequal centre size had little impact on model performance, yet it was associated with slight enlargement of the empirical variance of in Model E  1. To summarize, although all six models produced unbiased point estimates, the fixed and mixedeffects models using patientlevel data provided the most accurate estimates of the standard error of given large ICC values, hence should be used in the analysis of multicentre trials when the ICC was nontrivial or unknown to control type I and type II error rates. For studies consisting of a large number of centres with only a few patients per centre, adjusting for centre as mixed effects produced most precise point estimate of treatment effect, hence were more preferable. The information sandwich method appeared to slightly underestimate the actual variance when patients were recruited from 17 centres in scenarios 7 or 8. Due to varying centre sizes, Model D did not converge for all simulated datasets (number varied between 1 and 93 out of 1000 simulations) after 2000 iterations, when ICC was less than or equal to 0.1 or greater than 0.4 for block size of 2 or 4. Such datasets were excluded for all models and extra data were simulated to attain a total number of 1000 simulations for any ICC value. In most cases, the nonconvergence of GEE occurred due to a nonpositive definite working correlation matrix.
Figure 2. Empirical standard deviation (SD) across 1000 simulations by ICC for scenario 8 (block size = 4).
Figure 3. Average of standard error (SE) across 1000 simulations by ICC for scenario 8 (block size = 4).
Figure 4. Coverage of 95% CI by ICC for scenario 8 (block size = 4).
Figure 5. Empirical power by ICC for scenario 8 (block size = 4).
In scenario 9, as a result of mimicking the particular centre composition of the COMPETE II trial, on average, three centres out of 46 contained no patients in one of the treatment groups per simulation. These centres were removed from the fixedeffects model (Model B), as no comparison patients in the same centre were available. About six centres out of 46 recruited less than two patients per treatment arm for each simulation. These centres were dropped from the centrelevel analyses, as the standard error for treatment difference per centre could not be obtained as input variables for 'metacont( )'. Performance of six models in scenario 9 was similar to that in scenarios 7 and 8, although point estimates from all models appeared to be marginally biased toward the null (Table 8). Estimates from patientlevel models were more precise and closer to 0.230, the best linear unbiased estimate of standard error based on the simulation model. Once again, the standard error was slightly biased upward in Model A and marginally biased downward in Model D. This resulted in wider and conservative interval estimates from Model A and slightly narrower intervals from Model D. Models B and C performed comparably, probably because on average only three centres each containing one patient were dropped from Model B, which did not affect the variance estimation much. Models C and D achieved convergence for all 1000 simulations in this scenario.
Table 8. Properties of point and 95% interval estimates calculated from Models A  E based on 1000 simulated datasets in scenario 9  unbalanced, 46 centres, same centre composition as the COMPETE II trial.
Discussion
In this paper, we investigated six modelling strategies in a Frequentist framework to study the effect of an experimental treatment compared to the control treatment in the context of multicentre RCTs with a continuous outcome. We focused on three designs with equal or varying centre sizes and a treatment allocation ratio of 1:1 in the absence of treatment by centre interaction. Results of this simulation study showed that, when the proportion of patients allocated to the experimental treatment was the same in each centre or subject to chance imbalance only, models using patientlevel and centrelevel data yielded unbiased point estimates of treatment effect across a wide spectrum of ICC values. Ignoring stratification by centre or withincentre correlation did not bias the estimated treatment effects even when ICC was large. In fact, Parzen et al showed that mathematically the usual twosample ttest, naively assuming independent observations of the response within centre was asymptotically unbiased in this context [30].
The simulation study also indicated that these models produced different standard errors of , and the properties of interval estimates were affected by several factors: whether and how centre effects were incorporated in analysis, the combination of centre size and number of participating centres, and the level of nonorthogonality of the observed data. Treating centre as a random intercept resulted in the most precise estimate, and nominal values of coverage and power were attained in all circumstances. The fixedeffects model had extremely similar performance compared with the mixedeffects model in balanced design, but was slightly less efficient when the number of centres was large (J > 20) in an unbalanced design. Pickering and Weatherall observed the same pattern in their simulation study comparing three patientlevel models with small ICC values [7]. The GEE model using information sandwich covariance method tended to underestimate the standard error across centre effects when the sample of centres was small, a property noticed by researchers [20,31]. This resulted in higher statistical power. That is, the treatment effect estimate was more likely to be significant with a smaller standard error, but was associated with a lower coverage of the conference interval. Marray et al suggested that at least 40 centres should be used to ensure reliable estimate of standard error in the context of cluster randomized trials [32]. Our simulation results suggested that such cut value was also applicable to multicentre RCTs. Failure to control for centre effects in any form resulted in inflation of standard error, falsely high interval coverage and sizable drop of power, as ICC increased. Parzen et al quantified the impact of correlation among observations within centre on the variance of in Model A as 1/(1ICC) [30]. Alternatively, one may consider a variant of robust variance estimation or a GEE model with an independent working correlation to control for the impact of ICC on variance estimation using ttest. Centrelevel models generally produced larger standard errors, lower coverage or power than the patientlevel models. Centrelevel randomeffects model incorporated variability of the treatment effect over centres, and was not a fair comparator to other models. Interestingly, this model seemed to fare better than the centrelevel fixedeffects model in terms of precision and coverage even though the simulated datasets contained no treatment by centre interaction. Despite that the randomeffects centrelevel model may be a reasonable alternative for patientlevel models when the number of patients per centre is large (≥30), centrelevel models cannot adjust for patientlevel covariates, a potential fatal drawback in the presence of patient prognostic imbalance.
Statisticians have different viewpoints on treating centre effects and treatment by centre interaction as fixed or random effects when analyzing multicentre RCTs [12,13,21,33]. Our simulation results demonstrated the advantage of treating centres as random intercepts in the absence of treatment by centre interaction. When many centres enrol a few patients and allocation is unbalanced, the random intercept models can give more precise estimates of the treatment effect than the fixed intercept models, because they recover intercentre information in unbalanced situations. For instance, in a multicentre RCT consisting of 45 centres each recruiting 4 patients, the empirical variance of the estimator of the treatment effect resulting from the fixedeffects model was 24.8% and 26.0% greater than that from the randomeffects model when the ICC was 0.01 and 0.05, respectively. In the sentence alluded to, we need to compare the empirical variance of 0.162^{2 }with the value of 0.145^{2 }for ICC = 0.01, and 0.174^{2 }to 0.155^{2 }for ICC = 0.05 (Table 6 scenario 4). We therefore take the same position as Grizzle [33] and Agresti and Hartzel [12] that, "Although the clinics are not randomly chosen, the assumption of random clinic effect will result in tests and confidence intervals that better capture the variability inherent in the system more realistically than clinical effects are considered fixed".
Our results have some implications for the design of multicentre RCTs in the absence of treatment by centre interaction. First, regardless of the predetermined allocation ratio, permutated block randomization (of relatively small block sizes) should be used to maintain approximate balance or orthogonality (i.e. same treatment allocation proportion across centres [7]) between treatments and centres, so that their individual effects can be evaluated independently. Variable block sizes can be used to strengthen allocation concealment. Second, for a given sample size, the number of patients randomized in majority of centres should be sufficiently large to ensure reliable estimate of withincentre variation. Third, it is essential for investigators to obtain a rough estimate of ICC for withincentre responses, through literature review or a pilot study. To reach nominal power of 80% or 90% (in the absence of clustering), centre effects should be taken into consideration in sample size assessment. When centre effects are included without treatment by centre interaction, the analysis becomes more powerful than a twosample ttest. One method to assess sample size is to start with a two sample ttest for continuous outcomes (ignoring centre effect) then multiple the original estimated error variance by an variation inflation factor of 1/(1ICC). This factor would have the effect of increasing the required sample size. Ignoring centre effects results in the larger sample size in the absence of interaction. Sample size determined using information sandwich covariance of GEE model could lead to slight loss of power, when the number of centres is small (≥40) and no proper adjustment is done. Lastly, there is no particular reason to require equal numbers of patients being enrolled in all participating centres and this is seldom the case in practice. Throughout the simulations, we observed similar results for studies of equal and varying centre sizes. In the study, we considered three scenarios representing the particular centre composition of the COMPETE II trial. For discussion on potential impact of enrolment patterns on the point and interval estimates of treatment effect, readers can refer to the publications on random enrolment verse determined enrolment, and relative efficiency between equal and unequal cluster sizes in the reference list [34,35].
The current ICH E9 guideline recommends that researchers investigate treatment effect using a model that allows for centre differences in the absence of treatment by centre interaction [1]. However, it is implausible or impractical to include centre effects in statistical modelling or stratify randomization by centre, when it is anticipated from the start that trials may have very few subjects per centre. As it is acknowledged in the document, these recommendations are based on fixedeffects models. Mixedeffects models on the other hand may also be used to explore the centre and centre by interaction effects, especially when the number of centres is large [1]. Our simulation results indicated that when a considerable number of centres contains only a few patients, adjusting for centre as a fixed effect may lead to reduced precision (depending on distribution of patients between arms) compared with the naïve unadjusted analysis. Our work complements the ICH E9 guideline, by studying the impact of intraclass correlation on the assessment of treatment effects  a challenge that is seldom discussed, although routinely faced by investigators in reality. Our investigation suggests that, (1) ignoring centre effects completely may cause substantial overestimation of the standard error, faulty increase of coverage of the confidence interval and reduction of power; and (2) mixedeffects models and GEE models, if employed appropriately, can produce accurate and precise effect estimates, regardless of the degree of clustering. We recommend consider these methods in developing future guidelines.
When the number of patients per centre is very small, it is not practical to include centre as a fixed effect to analyze patientlevel data, as centre effects cannot be reliably estimated, and precision of the treatment effect will be compromised. In fact for extremely small centres, all patients may be allocated to the same treatment group, and such centres will be ignored by the fixedeffects model [3639]. The alternatives include collapsing all centres to perform a twosample ttest, collapsing smaller centres to create an artificial centre and treating it as a fixed effect, and exploring other models discussed above. The mixedeffects model utilizes small centres more efficiently by "borrowing" information from larger centres. The GEE approach models the average treatment difference across all centres and adjusts for centre effects through a uniform correlation structure. This is an intuitively more efficient model which unfortunately does not always converge when the number of patients per centre was highly variable (simulation scenarios 7 and 8). In the current study, nonconvergence problems were more likely to arise for very small or large ICC values (less than 0.1 or greater than 0.4 for block size 2 or 4) due to nonpositive definite working correlation matrices, and the frequency could be as big as 10% after 2000 iterations. Conversely, convergence problems did not occur for the mixedeffects models in any scenarios. Our results show that analysis of trials consisting of very small centres (i.e. those containing less than 2 patients per arm) using centrelevel models may not be an optimal strategy, because the withincentre standard deviation of treatment difference cannot be estimated for such centres, and consequently these very small centres are excluded from the analysis.
Results of two large empirical studies and one systematic review of cluster RCTs in primary care clinics suggested that most ICC values on physical, functional and social measures were less than 0.10 [2628]. The estimated ICC in the COMPETE II trial using GEE and linear mixedeffects model, on the other hand, was 0.124 and 0.138, respectively. We chose to include rare yet possible large ICC values (00.75) in this simulation to examine the overall trend of model performance by ICC, and for the purpose of completeness and generalizability. Readers should anticipate the ICC values likely to emerge from their studies when interpreting these results. Throughout the work, we quantified correlation among subjects within centre using ICC, the most commonly used concept to assess clustering in biomedical literature. As indicated in previous sections, ICC reflects the interplay of two variance components in multicentre data: the betweencentre variance and withincentre variance. These variance components are relatively easy to interpret for analysis of continuous outcomes using linear models. For analysis of binary or timetoevent data from multicentre trials using generalized mixed and frailty models, interpretation of centre heterogeneity can present challenges because random effects are linked to the outcome via nonlinear functions [40]. Reparameterization of the probability density function may be used to assess the impact of within and betweencentre variance. Interested readers can refer to Duchateau and Janssen [40] for more details.
A major limitation of the study is that it did not address model performance when the treatment by centre interaction exists. The interactions may be due to different patient populations or variable standard of care. Interested readers may read Moerbeek et al [6] for formulas of variance of in different models and Jones et al [14] for simulation results. Future studies addressing interaction effects in multicentre RCTs are needed. Datasets in the current paper were generated based on a moderate treatment effect reflected by the standardized mean difference between the treatment and control group. More or less prominent treatment effects are also likely to occur in clinical studies and similar findings are expected. The current study investigated on continuous outcomes in two groups from a Frequentist perspective. The models discussed above can be naturally extended to compare three or more treatments. Agresti and Hartzel [12] surveyed different methods to evaluate treatments for binary outcomes in multicentre RCTs. Nonparametric approaches and Bayesian methods are also available to obtain treatment contrast. Interested readers can refer to Aitkin [41], Gould [11], Smith et al [42], Legrand et al [16], and Louis [43], to name a few.
Conclusions
We used simulations to investigate the performance of six statistical approaches that have been advocated to analyze continuous outcomes in multicentre RCTs. Our simulation study showed that all six models produced unbiased estimates of treatment effect in individual patient randomization multicentre trials. Adjusting for centre as random effects resulted in more efficient effect estimates in all scenarios over a wide spectrum of ICC values and various centre compositions. Fixedeffects model performed comparably to the mixedeffects model under most circumstances but lost efficiency when many centres contained a relatively small number of patients. The GEE model underestimated standard error of the effect estimates when a small number of centres were involved, and did not always converge when the centre size was variable for very large or small ICC values. Twosample ttest severely overestimated standard error given moderate to large ICC values. The relative efficiencyof statistical modelling of treatment contrasts was also affected by ICC, distribution of patient enrolment, centre size and the number of centres.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
RC participated in the design of the study, simulation, analysis and interpretation of data, and drafting and revision of the manuscript. LT contributed to the conception and design of the study, interpretation of data and revision of the manuscript. JM contributed to the design of the study and revision of the manuscript. AH contributed to acquisition of data and critical revision of the manuscript. EP and PJD advised on critical revision of the manuscript for important intellectual content. All authors have read and approved the final manuscript.
Acknowledgements
We thank the reviewers and Dr. Werner Vach for their constructive comments that have helped us to improve this work. We thank Ms. Ruthanne Cameron for proofreading the manuscript. Dr. Thabane is a clinical trials mentor for the Canadian Institutes of Health Research. Rong Chu was supported in part by funding from the CANNeCTIN (Canadian Network and Centre for Trials Internationally) training award, Father Sean O'Sullivan Research Center (FSORC) Studentship award and the Canadian Institute of Health Research (CIHR) doctoral research award.
References

International Conference on Harmonisation E9 Expert Working Group: ICH Harmonised Tripartite Guideline. Statistical principles for clinical trials

Lachin JM: Biostatistical methods: the assessment of relative risks:. 1st edition. New York: John Wiley and Sons; 2000.

Holbrook A, Thabane L, Keshavjee K, Dolovich L, Bernstein B, Chan D, Troyan S, Foster G, Gerstein H, COMPETE II Investigators: Individualized electronic decision support and reminders to improve diabetes care in the community: COMPETE II randomized trial.
CMAJ 2009, 181(12):3744. PubMed Abstract  PubMed Central Full Text

Donner A, Klar N: Design and analysis of cluster randomization trials in health research:. 1st edition. London: Arnold; 2000.

Kerry SM, Bland JM: The intracluster correlation coefficient in cluster randomisation.
BMJ 1998, 316(7142):1455. PubMed Abstract  PubMed Central Full Text

Moerbeek M, van Breukelen GJ, Berger MP: A comparison between traditional methods and multilevel regression for the analysis of multicenter intervention studies.
J Clin Epidemiol 2003, 56(4):341350. PubMed Abstract  Publisher Full Text

Pickering RM, Weatherall M: The analysis of continuous outcomes in multicentre trials with small centre sizes.
Stat Med 2007, 26(30):54455456. PubMed Abstract  Publisher Full Text

Worthington H: Methods for pooling results from multicenter studies.
J Dent Res 2004, 83(Spec No C):C11921. PubMed Abstract  Publisher Full Text

Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models.
Biometrika 1986, 73:1322. Publisher Full Text

Whitehead A: Metaanalysis of controlled clinical trials:. 1st edition. Chichester: John Wiley and Sons; 2002.

Gould AL: Multicentre trial analysis revisited.
Stat Med 1998, 17(1516):177997.
discussion 1799800
PubMed Abstract  Publisher Full Text 
Agresti A, Hartzel J: Strategies for comparing treatments on a binary response with multicentre data.
Stat Med 2000, 19(8):11151139. PubMed Abstract  Publisher Full Text

Fleiss JL: Analysis of data from multiclinic trials.
Control Clin Trials 1986, 7(4):267275. PubMed Abstract  Publisher Full Text

Jones B, Teather D, Wang J, Lewis JA: A comparison of various estimators of a treatment difference for a multicentre clinical trial.
Stat Med 1998, 17(1516):176777.
discussion 1799800
PubMed Abstract  Publisher Full Text 
Glidden DV, Vittinghoff E: Modelling clustered survival data from multicentre clinical trials.
Stat Med 2004, 23(3):369388. PubMed Abstract  Publisher Full Text

Legrand C, Ducrocq V, Janssen P, Sylvester R, Duchateau L: A Bayesian approach to jointly estimate centre and treatment by centre heterogeneity in a proportional hazards model.
Stat Med 2005, 24(24):37893804. PubMed Abstract  Publisher Full Text

Brown HK, Kempton RA: The application of REML in clinical trials.
Stat Med 1994, 13(16):16011617. PubMed Abstract  Publisher Full Text

McLean RA, Sanders WL: Approximating the degrees of freedom for SE's in mixed linear models. Proceedings of the Statistical Computing Section of the American Statistical Association. New Orleans, Louisiana; 1988.

Twisk J: Applied longitudinal data analysis for epidemiology: a practical guide. Cambridge: Cambridge University Press; 2003.

Ukoumunne OC, Carlin JB, Gulliford MC: A simulation study of odds ratio estimation for binary outcomes from cluster randomized trials.
Stat Med 2007, 26(18):34153428. PubMed Abstract  Publisher Full Text

Senn S: Some controversies in planning and analysing multicentre trials.
Stat Med 1998, 17(1516):175365.
discussion 1799800
PubMed Abstract  Publisher Full Text 
Lin Z: An issue of statistical analysis in controlled multicentre studies: how shall we weight the centres?
Stat Med 1999, 18(4):365373. PubMed Abstract  Publisher Full Text

Kallen A: Treatmentbycentre interaction: what is the issue.

DerSimonian R, Laird N: Metaanalysis in clinical trials.
Control Clin Trials 1986, 7(3):177188. PubMed Abstract  Publisher Full Text

Cohen J: Statistical Power Analysis for the Behavioral Sciences:. 2nd edition. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.

Adams G, Gulliford MC, Ukoumunne OC, Eldridge S, Chinn S, Campbell MJ: Patterns of intracluster correlation from primary care research to inform study design and analysis.
J Clin Epidemiol 2004, 57(8):785794. PubMed Abstract  Publisher Full Text

Parker DR, Evangelou E, Eaton CB: Intraclass correlation coefficients for cluster randomized trials in primary care: the cholesterol education and research trial (CEART).
Contemp Clin Trials 2005, 26(2):260267. PubMed Abstract  Publisher Full Text

Smeeth L, Ng ES: Intraclass correlation coefficients for cluster randomized trials in primary care: data from the MRC Trial of the Assessment and Management of Older People in the Community.
Control Clin Trials 2002, 23(4):409421. PubMed Abstract  Publisher Full Text

R Core Development Team: R: A language and environment for statistical computing:. 1st edition. Vienna: R Foundation for Statistical Computing; 2005.

Parzen M, Lipsitz S, Dear K: Does clustering affect the usual test statistics of no treatment effect in a randomized clinical trial?
Biomtrc J 1998, 40:385402. Publisher Full Text

Mancl LA, DeRouen TA: A covariance estimator for GEE with improved smallsample properties.
Biometrics 2001, 57(1):126134. PubMed Abstract  Publisher Full Text

Murray DM, Varnell SP, Blitstein JL: Design and analysis of grouprandomized trials: a review of recent methodological developments.
Am J Public Health 2004, 94(3):423432. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Grizzle JE: Letter to the editor.
Control Clin Trials 1987, 8(4):392393. PubMed Abstract  Publisher Full Text

van Breukelen GJ, Candel MJ, Berger MP: Relative efficiency of unequal cluster sizes for variance component estimation in cluster randomized and multicentre trials.
Stat Methods Med Res 2008, 17(4):439458. PubMed Abstract  Publisher Full Text

Fedorov V, Jones B: The design of multicentre trials.
Stat Methods Med Res 2005, 14(3):205248. PubMed Abstract  Publisher Full Text

Localio AR, Berlin JA, Ten Have TR, Kimmel SE: Adjustments for center in multicenter studies: an overview.
Ann Intern Med 2001, 135(2):112123. PubMed Abstract  Publisher Full Text

Breslow NE, Day NE: Statistical Methods in Cancer Research. Volume I: The Analysis of CaseControl Studies. Lyon: International Agency for Research on Cancer; 1980.

Cox DR, Hinkley DV: Theoretical Statistics:. London: Chapman & Hall; 1974.

Graubard BI, Korn EL: Regression analysis with clustered data.
Stat Med 1994, 13(57):509522. PubMed Abstract  Publisher Full Text

Duchateau L, Janssen P: Understanding heterogeneity in generalized mixed and frailty models.
The American Statistician 2005, 59(2):143146. Publisher Full Text

Aitkin M: A general maximum likelihood analysis of variance components in generalized linear models.
Biometrics 1999, 55(1):117128. PubMed Abstract  Publisher Full Text

Smith TC, Spiegelhalter DJ, Thomas A: Bayesian approaches to randomeffects metaanalysis: a comparative study.
Stat Med 1995, 14(24):26852699. PubMed Abstract  Publisher Full Text

Louis TA: Using empirical Bayes methods in biopharmaceutical research.
Stat Med 1991, 10(6):81127.
discussion 8289
PubMed Abstract  Publisher Full Text
Prepublication history
The prepublication history for this paper can be accessed here: