In many randomized and non-randomized comparative trials, researchers measure a continuous endpoint repeatedly in order to decrease intra-patient variability and thus increase statistical power. There has been little guidance in the literature as to selecting the optimal number of repeated measures.
The degree to which adding a further measure increases statistical power can be derived from simple formulae. This "marginal benefit" can be used to inform the optimal number of repeat assessments.
Although repeating assessments can have dramatic effects on power, marginal benefit of an additional measure rapidly decreases as the number of measures rises. There is little value in increasing the number of either baseline or post-treatment assessments beyond four, or seven where baseline assessments are taken. An exception is when correlations between measures are low, for instance, episodic conditions such as headache.
The proposed method offers a rational basis for determining the number of repeat measures in repeat measures designs.
Many studies measure a continuous endpoint repeatedly over time. In some cases, this is because researchers wish to judge the time course of a symptom or to evaluate how the effect of a treatment changes over time. For example, in a study of thoracic surgery, patients were evaluated every three months after thoracic surgery to determine the incidence and duration of chronic postoperative pain. The researchers found that the incidence of pain at one year was high and only slightly lower than at three months, showing that post-thoracotomy pain is common and persistent. In such studies, the number and timing of repeated measures needs to be decided on a study-by-study basis depending on the scientific interests of the investigators.
Measures may also be repeated in order obtain a more precise estimate of an endpoint. In simple terms, measure a patient once and they may be having a particularly good or bad day; measure them several times and you are more likely to get a fair picture of how they are doing in general. Repeat assessment reduces intra-patient variability and thus increases study power. This is of particular relevance to comparative studies. For instance, in a randomized trial of soy and placebo for cancer-related hot flashes, patients recorded the number of hot flashes they experienced each day during a baseline assessment period and then during treatment. In this case, the researchers were interested in the change between baseline and follow-up in each group so as to determine drug effect. The time course of symptoms was not at issue. The researchers therefore took a mean of each patient's hot flash score during the baseline period and subtracted the mean of the final four treatment weeks to create a change score. Change scores were compared between groups using a t-test . In addition to using means, post-randomization measures may also be summarized by area-under-the-curve  or slope scores, which are particularly relevant if treatment effects diverge over time.
There has been little guidance in the methodologic literature as to how researchers should select the number of repeated measures for repeated measures designs. In the few papers that have discussed power and repeat measurement (for example, Frison and Pocock), the number of measures is seen as a fixed design characteristic, with sample size derived accordingly. Perhaps as a corollary, randomized and other comparative trials involving repeated measures almost invariably lack a statistical rationale for the number of measures taken. Measures are most commonly taken at particular temporal "landmarks", such as the beginning of each chemotherapy cycle, or each day during treatment. Apparently little consideration is given to how increasing or reducing the number of measures affects power.
Consequently, it is not difficult to find studies that appear to have either too few or too many repeat assessments. In a trial of acupuncture for back pain, for example, pain was measured on a visual analog scale (VAS) once at baseline and once following treatment . The standard deviations were very high: mean post-treatment score was 38 mm with a standard deviation of 28 mm (recalculated using raw data from the authors). Part of this variability in pain scores reflects intra-patient variability that would have been reduced had the VAS been repeated several times. This would surely have been feasible in this population. There are also numerous studies where extremely large number of measures were taken, far beyond the point where additional measures would have improved precision to an important degree. For example, in a trial of a topical treatment for HIV-related peripheral neuropathy, patients were required to record pain four times a day for four weeks at baseline and at follow-up, a total of 224 data points . No rationale was provided for such extensive data collection and there was clearly a cost: 46% of patients dropped-out before the end of the trial. In the hot flashes example given above, symptoms were measured every day for four weeks at baseline and for 12 weeks following randomization, a total of 102 data points . The authors do not explain why such extensive data collection was required to answer the study question.
In this paper I argue that the number of repeat measures should not be seen as a fixed design characteristic, rather it is a design choice that can be informed by statistical considerations. I then outline a method for guiding decisions concerning the number of repeat measures and deduce several rules of thumb that can be applied in trial design.
To determine an optimal number of repeat measures, I use the premise that the ideal number from a statistical viewpoint is infinity, as this would maximally reduce intra-patient variance. However, it would be best in terms of researcher and patient time and effort if only a single assessment was made. Increasing the number of repeat assessments thus has a benefit in statistical terms that is offset by cost. Whereas cost can be estimated only in general terms by researchers (would patients put up with another questionnaire? how much time would it take for an additional range of motion assessment?) statistical efficiency benefits can be quantified. In the following, I describe the formulae for determining the relative benefit of additional repeat assessments for statistical power and deduce some general design principles.
The key question is the degree to which adding a further measure – for example, assessing pain five times rather than four times – increases statistical power. This is known as the "marginal benefit" of repeat measurement. We will start with the situation where data are recorded only after intervention. This is typical in trials of acute sequelae of a predictable event, for example, post-operative pain, chemotherapy nausea or muscle soreness following exercise. It can be shown (see Figure 1) that required sample size (n) patients per group is proportional to the number of measurements (r) and the mean correlation between measurements ().
Marginal change in sample size for r + 1 compared to r assessments is therefore:
This equation does not require that measurements be equally spaced or that correlations between measurements be constant.
It is common that trials investigate an endpoint that can be informatively measured before treatment. In trials of back pain, hypertension or obesity, for example, researchers want to test whether an intervention reduces patients' pain scores, blood pressure or weight from a baseline value. Typically the endpoint in such trials is measured one or more times at baseline and again following treatment. Baseline and post-treatment scores are summarized separately and change analyzed.
Analysis of covariance (ANCOVA) has been repeatedly demonstrated to be the most powerful method of analysis for this type of trial[5,8,9]. The following discussion will thus only include reference to ANCOVA (rather than say, t-test of change between baseline and follow-up). Frison and Pocock have derived a generalized sample size equation that can be used to assess power for ANCOVA where baseline measures are taken before treatment: p is the number of baseline measures; subscripts pre, post and mix refer, respectively, to the mean correlations within baseline measurements, within follow-up measurements and between baseline and follow-up measures (Figure 2).
As is the case for trials without baseline measures, there is no requirement that correlations be equal or that assessments be equally spaced.
Frison and Pocock report that typical figures for pre, post and mix are 0.7, 0.7, and 0.5.  Some figures from my own studies are given in table 1. In general, these data support Frison and Pocock's generalization. Exceptions include episodic conditions, such as headache, in which case correlations are lower, and where the study outcome is measured immediately before and after a single treatment session, in which case correlations are higher. The correlations in table 1 can be used to determine the marginal benefit of additional measures for typical trials.
Table 1. Empirical estimates of correlations from a variety of studies
Trials without baseline measures
Table 2 shows the marginal relative decreases in sample size given various numbers of assessments and correlations. For example, if correlation between measures is 0.65, increasing the number of measures from two to three decreases sample size requirements by about 6%. As correlation is reciprocally related to intra-patient variability, additional measures are of greatest value when correlation is low. It is also clear that repeating measurements more than a few times has little effect on power. For example, for a correlation of 0.65, taking four repeated measures only improves power by 3% compared to three assessments, a negligible value in the context of power calculation.
Table 2. Marginal decrease in sample size for increasing the number of measures given various correlations between measures. The table refers to the case where no baseline measures are taken.
Trials with baseline measures
Tables 3,4,5 show the effect on sample size of increasing the number of follow-up assessments and baseline assessments given different correlations for pre, post and mix. It is assumed for tables 1, 2, 3, 4, 5 that neither the mean of the measures nor the mean correlation between measures depends on the number of measures. This will generally be the case where, for example, a decision needs to be made whether to measure the severity of a chronic condition for one or two weeks at baseline. However, care should be taken with possible exceptions. An example might be if an endpoint was measured twice a day instead of just once. In this case, correlations between measurements 12 hours apart might be higher than those taken 24 hours apart. A second possible exception is acute conditions of limited duration: measuring pain after surgery for seven days rather than four days after surgery will not improve precision if few or no patients are in pain after day four.
Table 3. Sample sizes for various combinations of baseline (p) and follow-up (r) measures. Correlations for pre, post and mix are 0.7, 0.7 and 0.5. Results given relative to a trial with a single baseline and follow-up measure.
Table 4. Sample sizes for various combinations of baseline (p) and follow-up (r) measures. Correlations for pre, post and mix are 0.5, 0.5 and 0.5. Results given relative to a trial with a single baseline and follow-up measure.
Table 5. Sample sizes for various combinations of baseline (p) and follow-up (r) measures. Correlations for pre, post and mix are 0.9, 0.9 and 0.8. Results given relative to a trial with a single baseline and follow-up measure.
Table 3 gives the most common situation of moderate correlation between baseline and follow-up measures and high correlation within measures. Table 4 shows moderate correlation for within and between measures, typical in an episodic condition. Table 5 shows very high correlations for studies where assessments are taken close together, or in the case of measures with low intra-patient variability such as laboratory data.
As an example, given the most common case of pre, post and mix at 0.7, 0.7, and 0.5, a trial with four baseline and four follow-up measurements would require 60% of the number of patients of a trial with just one baseline and follow-up; a trial with seven assessments at baseline and follow-up would require 54% as many patients. The same figures are shown in a different format in table 6, which gives the relative decrease in sample size for a number of different combinations of follow-up and or baseline assessments. For example, a trial with seven baseline and follow-up measures would require 10% fewer patients than a trial with four of each type of measure where pre, post and mix are 0.7, 0.7, and 0.5.
Table 6. Relative decrease in sample size given various scenarios for increasing the number of baseline (p) or follow-up (r) measures.
Some general patterns emerge:
1. Repeating measures can have dramatic effects on power. Increasing the number of follow-up and / or baseline measures from a single one to three or four can reduce sample sizes by 35 – 70%. However, the increases in power for each additional measure rapidly decreases with increasing number of assessments.
2. Under the assumption that pre and post are similar, it is more valuable to increase the number of follow-up than baseline assessments. This makes intuitive sense: we should be more concerned about the precision of an endpoint than a covariate.
3. The marginal value of additional follow-up assessments is higher where baseline measurements are taken. Take the case where pre and post are 0.7 and mix is 0.5. For a trial without baseline measures, increasing the number of post-treatment assessments from four to seven decreases sample size by about 4%. The corresponding figures for trials with one or four baseline measures are 6% and 7%. Nonetheless, with the exception of the scenario described in point 4 below, there is little value increasing the number of either baseline or post treatment assessments beyond six or seven.
4. The only situation where it is worthwhile to make more than six or seven assessments is when correlation is moderate and similar between all time periods. This is most likely to be the case for episodic conditions such as headache, where scores at any one time will be poorly correlated with scores at any other time.
Investigators may measure a continuous endpoint repeatedly because they wish to judge the time course of a symptom. In such cases, the number of repeat measures will depend upon the scientific interests of the investigators. Alternatively, investigators may use repeat measurement to increase the precision of an estimate. Though this is a particular concern for randomized or non-randomized comparative studies, it is also pertinent to a variety of other research designs: for example, epidemiologic cohort studies may take a measure such as blood pressure, prostate specific antigen or serum micronutrient levels at baseline and then determine whether this predicts development of disease; repeating baselines will improve the precision of such predictions.
Where measures are repeated to improve precision, decisions about the number of repeated measures, that is, the number of within-patient observations, mirror those of standard power calculation, which concerns observations of separate patients. In both cases, statistical concerns to minimize variance are balanced by logistical concerns to minimize number of assessments. Whilst an extensive literature has developed on various methods for selecting a particular number of patients to study, the number of assessments per patient has received little attention, perhaps because this has tended to be seen as a fixed characteristic of any particular trial design. Here I have shown that simple statistical considerations can be used to guide the number of repeated measures in repeated measures designs. Given the most common correlation structure, taking four baselines and seven follow-up measures dramatically improves power compared to a single baseline and follow-up; where no baseline is taken, four follow-up measures importantly improves power; however, the marginal value of including additional measures rapidly diminishes.
Van Patten CL, Olivotto IA, Chambers GK, Gelmon KA, Hislop TG, Templeton E, Wattie A, Prior JC: Effect of Soy Phytoestrogens on Hot Flashes in Postmenopausal Women With Breast Cancer: A Randomized, Controlled Clinical Trial.
BMJ 1990, 300:230-235. PubMed Abstract
Stat Med 1992, 11:1685-1704. PubMed Abstract
Grant DJ, Bishop-Miller J, Winchester DM, Anderson M, Faulkner S: A randomized comparative trial of acupuncture versus transcutaneous electrical nerve stimulation for chronic back pain in the elderly.
Stat Med 1994, 13:197-198. PubMed Abstract
Irnich D, Behrens N, Molzen H, Konig A, Gleditsch J, Krauss M, Natalis M, Senn E, Beyer A, Schops P: Randomised trial of acupuncture compared with conventional massage and "sham" laser acupuncture for treatment of chronic neck pain.
Irnich D, Behrens N, Gleditsch J, Stor W, Schreiber MA, Schops P, Vickers AJ, Beyer A: Immediate effects of dry needling and acupuncture at distant points in chronic neck pain: results of a randomized, double-blind, sham-controlled crossover trial.
Kleinhenz J, Streitberger K, Windeler J, Gussbacher A, Mavridis G, Martin E: Randomised clinical trial comparing the effects of acupuncture and a newly designed placebo needle in rotator cuff tendinitis.
The pre-publication history for this paper can be accessed here: