Email updates

Keep up to date with the latest news and content from BMC Medical Research Methodology and BioMed Central.

Open Access Research article

A comparison of a new multinomial stopping rule with stopping rules of fleming and gehan in single arm phase II cancer clinical trials

John R Goffin1*, Greg R Pond2 and Dongsheng Tu3

Author Affiliations

1 McMaster University, Juravinski Cancer Centre, 699 Concession St., Hamilton, Ontario, L8V 5C2, Canada

2 McMaster University, Ontario Clinical Oncology Group (OCOG), Juravinski Hospital G(60) Wing. 1st Floor, 711 Concession Street, Hamilton, Ontario, L8V 1C3, Canada

3 Dongsheng Tu, NCIC Clinical Trials Group, Queen's University, 10 Stuart Street, Kingston, Ontario, K7L 3N6, Canada

For all author emails, please log on.

BMC Medical Research Methodology 2011, 11:95  doi:10.1186/1471-2288-11-95

The electronic version of this article is the complete one and can be found online at:

Received:5 November 2010
Accepted:21 June 2011
Published:21 June 2011

© 2011 Goffin et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Response rate (RR) alone may be insensitive to drug activity in phase II trials. Early progressive disease (EPD) could improve sensitivity as well as increase stage I stopping rates. This study compares the previously developed dual endpoint stopping rule (DESR), which incorporates both RR and EPD into a two-stage, phase II trial, with rules using only RR.


Stopping rules according to the DESR were compared with studies conducted under the Fleming (16 trials) or Gehan (23 trials) designs. The RR hypothesis for the DESR was consistent with the comparison studies (ralt = 0.2, rnul = 0.05). Two parameter sets were used for EPD rates of interest and disinterest respectively (epdalt, epdnul): (0.4, 0.6) and (0.3, 0.5).


Compared with Fleming, the DESR was more likely to allow stage two of accrual and to reject the null hypothesis (Hnul) after stage two, with rejection being more common with EPD parameters (0.4, 0.6) than (0.3, 0.5). Compared with Gehan, both DESR parameter sets accepted Hnul in 15 trials after stage I compared with 8 trials by Gehan, with consistent conclusions in all 23 trials after stage II.


The DESR may reject Hnul when EPD rates alone are low, and thereby may improve phase II trial sensitivity to active, cytostatic drugs having limited response rates. Conversely, the DESR may invoke early stopping when response rates are low and EPD rates are high, thus shortening trials when drug activity is unlikely. EPD parameters should be chosen specific to each trial.


The increase in drugs available for study along with the human and resource costs for the conduct of clinical trials requires investigators to revisit trial design [1,2]. Nowhere is this more evident than in oncology, which must contend with more first-in-class drugs, longer development times, more drugs entering large phase III studies, and generally greater costs than other therapeutic areas [3]. In addition, the development of targeted drugs, which may induce limited tumour response, demands phase II trial designs which both minimize resource use and are sensitive and specific to signals of drug activity [4].

When response rate (RR) is used as a single primary endpoint, two sets of stopping rules have served as the basis for many prior two-stage phase II trials. The stopping rules of Gehan stop trials at the first stage when no response was observed [5]. The sample size for the first stage is based on a specified RR of interest and a beta error rate. If at least one response was observed, the second stage accrues using a sample size based on the desired standard error for the RR estimation and the number of responses observed in stage one. For the stopping rules of Fleming, the investigator specifies RR's of interest and disinterest as well as desired alpha and beta error rates [6]. Calculations determine the sample size in each stage and the minimum responses in stage one required to proceed to the second stage. The trial may be stopped after stage I of accrual to accept or reject the null hypothesis. Variations of the two-stage rules, such as those of Simon [7], have been designed to minimize the expected number of enrolled patients when drug is inactive. Despite the introduction of new study methods, the designs of Gehan, Fleming, and Simon still in common use [8,9].

Although RR remains the most common primary endpoint in phase II trials [8], disease stabilization may be a more appropriate endpoint for some agents and has also been associated with improved survival [10,11]. Similarly, a high rate of early progressive disease (EPD), defined here as progression at the first tumour measurement after initiation of treatment, correlates with poor survival [12,13]. Conversely, a low EPD rate may suggest drug activity, and could serve as a warning against early discard of a new agent. A combination of response and EPD as a multinomial endpoint would identify an active drug which produces a high response rate or low EPD rate.

Zee et al first derived stopping rules for a two-stage clinical trial with a multinomial endpoint of RR and EPD [14]. However, it was found that these stopping rules only achieved the desired power for an alternate hypothesis requiring sufficiently high RR and sufficiently low EPD, whereas the study had sought power for an alternate hypothesis allowing for either a favourable RR or a favourable EPD [15]. Recently, a new rule set [16], the Dual Endpoint Stopping Rule (DESR), was derived to address this problem. The new stopping rules offer the desired power as well as high rates of early stopping for drugs meeting the null hypothesis, but have not been applied to real data from phase II clinical trials. The objective of this paper is to compare the DESR with the stopping rules of Fleming and Gehan in a series of phase II trials as summarized by Dent et al [14,17]


The Dual Endpoint Stopping Rule (DESR) for phase II trials with endpoints of response and early progressive disease (EPD) rates is described here briefly and in detail previously, where variations on the rules and sensitivity testing have been provided [16]. Specifically, DESR is based on testing of the following hypotheses:

where the response rates (rnul,ralt) and early progressive disease rates (epdnul,epdalt) of interest are prespecified. These hypotheses imply that a new drug would be considered of interest for further study if either the response rate, r, was sufficiently high or the early progressive disease rate, epd, was sufficiently low; it is not necessary that both outcomes occur.

After additional study parameters including the sample size for stage I (n1) and stage II (n2) of the trial and the desired alpha error rate and power are provided, stopping rules are generated by simulations performed using TreeAge Pro Healthcare software (Williamstown, Massachusetts) with the Borderline Value Method [16], which assumes that response and EPD rates of the desirable drugs are not better than r = ralt or epd = epdalt. With the DESR, the trial would be stopped at the first stage after n1 subjects are entered if n1r n1r-nul and n1p n1p-nul, where n1r and n1p are respectively the number of patients who responded and had early progression and n1r-nul and n1p-nul are thresholds of the DESR. Barring stopping, n2 more patients are recruited into the second stage. The null hypothesis will be rejected at the end of the second stage if n1r+ n2r n1r-alt+ n2r-alt or n1p+ n2p n1p-alt + n2p-alt, where n2r and n2p are respectively the number of patients who responded and had early progression at stage II, n1r-alt + n2r-alt represents the threshold number of responders required after stage II to conclude Halt, and n1p-alt + n2p-alt is similarly the threshold for the stage I and stage II subjects with early progression to conclude Halt.

Data from two sets of phase II trials previously studied by Dent et al [17], were used to evaluate the DESR and compare it with stopping rules of Fleming and Gehan. The first set of these phase II trials was undertaken by the National Cancer Institute of Canada Clinical Trials Group, using the two stage stopping rule of Fleming. Trials were designed based on testing of hypotheses Hnul: r ≤ 5% and Halt: r 20%, which allows for continuation to the second stage of accrual (with n2 = 15) if one or more responses are observed among the first n1 = 15 patients. At the second stage, Hnul is rejected if four or more responses are found. The second set of phase II trials was performed by the EORTC using the stopping rule of Gehan. The response rate of interest and beta error rate for the first stage were prespecified respectively as 20% and 0.05, which led to the sample size n1 = 14. Recruitment to the second stage occurs if at least one response is seen, with the size of n2 varying with the number of responses seen in the first stage in conjunction with a desired standard error rate. For comparison purposes, (rnul, ralt) was selected as (0.05, 0.2) to derive DESR thresholds. Based on the work of Zee et al and others [Zee, 1999;Sekine, 1999], two plausible parameter sets were selected for EPD, (epdnul,epdalt) = (0.6, 0.4) or (0.5, 0.3), to assess the impact of EPD on early stopping.

The alpha error rate and power used to derive DESR thresholds were respectively 0.05 and 0.8, although actual error rates vary from this according to the final thresholds selected by the program [Goffin, 2008]. The sample sizes for both stages were set the same as that in the Fleming rules or actual recruitment to the various EORTC studies when comparisons were made with the Fleming and Gehan stopping rules respectively.


Table 1 shows the thresholds of the DESR for the null and alternate hypothesis corresponding with the studies utilizing the rules of Fleming. The table is read along the first row of results as follows: With desired study parameters of rnul = 0.05, ralt = 0.2, epdnul = 0.6, epdalt = 0.4, alpha error 0.05, power 0.8, and two stages of accrual of 15 patients each, the trial would be stopped at the first stage to reject the drug (accept the null hypothesis) if there were 1 or fewer responding patients and 8 or more patients with early progressive disease. Otherwise, the second stage would accrue, at the end of which the drug would be accepted (null hypothesis rejected) if 4 or more patients had responded to the drug

14 or fewer progressed. This stopping rule would have an actual power of 0.796, alpha error of 0.025, and an expected number of 16.4 patients accrued if the drug under study was uninteresting (i.e. drug meeting Hnul). Two pairs for the null and alternate hypothesis for epd are shown.

Table 1. Thresholds by DESR to compare with rules of Fleming (n1 = 15, n2 = 15, power = 0.8, alpha = 0.05)

Thresholds for DESR trials sized to match the studies conducted under the rules of Gehan are shown in Tables 2 and 3. Table 2 gives values for epdalt = 0.4, epdnul = 0.6, while Table 3 gives values for epdalt = 0.3, epdnul = 0.5.

Table 2. Thresholds by DESR to compare with the rules of Gehan (ralt = 0.2, rnul = 0.05, epdalt = 0.4, epdnul = 0.6, power = 0.8, alpha = 0.05)

Table 3. Thresholds by DESR to compare with rules of Gehan (ralt = 0.2, rnul = 0.05, epdalt = 0.3, epdnul = 0.5, power = 0.8, alpha = 0.05)

Comparison with the Stopping Rules of Fleming

The comparison of the DESR and Fleming stopping rules for first stage stopping and second stage rejection of the null hypothesis is shown in Table 4. The DESR was more permissive at the first stage. For the EPD parameters epdalt = 0.4, epdnul = 0.6, the DESR allowed 6 of the 10 studies stopped by the Fleming rule to continue to the second stage of accrual, all on the basis of an acceptably low EPD rate. Using the EPD parameters epdalt = 0.3, epdnul = 0.5, the DESR allowed only 2 of these same 10 studies to continue to the second stage. In all cases where the DESR allowed accrual to the second stage but the rules of Fleming did not, the final conclusions about activity of the drugs from DESR were unknown since there was no data from the second stage of the trials and we could find no published phase III trial and no U.S. Food and Drug Administration (FDA) indication for the drugs and diseases under study in these phase II trials.

Table 4. Comparison of the DESR and Fleming for Early Stopping and Rejection of Hnul

While six studies (Trials 11 through 16) were permitted to accrue to the second stage according to the Fleming rule, one study (Trial 11) was stopped by the investigators and this same study would have been stopped at stage one by the DESR. In the remaining five studies, Hnul was rejected at end of study by the Fleming rule in two (12 and 16). By comparison, for the EPD parameters epdalt = 0.4, epdnul = 0.6, the DESR rejected Hnul in all five trials at the end of stage II as a result of acceptable rates of EPD. Conversely, for the EPD parameters epdalt = 0.3, epdnul = 0.5, the DESR stopped three of the five trials at stage I, and rejected Hnul after stage II in two trials (studies 12 and 15), with one consistent with the conclusion from Fleming rule (Trial 12). The differences again lay in the threshold for epd in the hypotheses under testing, with the EPD parameter set (epdalt = 0.3, epdnul = 0.5) requiring a lower observed rate of EPD for rejection of Hnul than the EPD parameter set (epdalt = 0.4, epdnul = 0.6). In all cases where the DESR rejected Hnul but Fleming did not, we found no phase III trial to confirm or deny drug activity, and no disease-specific FDA indication was found. The same lack of confirmation was found for study 16 which rejected Hnul by the Fleming rule but not by the DESR with EPD parameters epdalt = 0.3, epdnul = 0.5.

Comparison with the Stopping Rules of Gehan

Comparing the DESR rules based on two sets of EPD parameters in the cohort of phase II trials conducted under the Gehan design, the choice of null and alternate values for epd did not alter the likelihood of early stopping or rejection of the null hypothesis by the DESR, in part as a result of consistently high rates of EPD in trials 1-15 (see Table 5).

Table 5. Comparison of the DESR and Gehan for Early Stopping and Rejection of Hnul

Of the 23 trials conducted using the Gehan stopping rules, eight would have been stopped at stage I for acceptance of Hnul by both Gehan and the DESR. In actuality, investigators continued seven of those trials (studies 1-7) through the second stage, although in all cases the studies were ultimately negative.

In the other 15 trials (9 to 23), accrual to the second stage was permitted under the stopping rules of Gehan. Of these, seven trials would have been stopped at the first stage by the DESR as a result of high epd rates in conjunction with only a single responding subject in each trial, and in all seven of these trials the rules of Gehan found the same results after accrual of the second stage (i.e., Hnul accepted). In the final eight trials, Hnul was rejected after the second stage by both the Gehan stopping rule and the DESR.


The DESR uses the signal provided by the rate of early progressive disease in an attempt to better discern drug effectivess compared with response alone [16]. It has been demonstrated that rules can be generated that meet the specified alpha error rate and power; this study assesses the relevance of the DESR when applied to actual patient data from phase II clinical trials [17].

Compared with the stopping rules of Fleming, the DESR was more likely to allow accrual of the second stage. This was more common with the rules specifying epdnul = 0.6 than epdnul = 0.5, as a higher EPD rate was tolerated without early drug rejection in the former case. At the second stage, the DESR with design parameters epdalt = 0.4, epdnul = 0.6 rejected Hnul more frequently than either the Fleming stopping rules or the DESR with parameters epdalt = 0.3, epdnul = 0.5.

A somewhat different result was seen when comparing the DESR and the stopping rules of Gehan. In this instance, 15 studies were stopped at the first stage by the DESR (using both epd design parameter pairs), while only 8 were stopped by Gehan at the first stage, with high rates of EPD triggering the more frequent early stopping by the DESR. The discrepant seven studies ultimately accepted Hnul at the end of the second stage under Gehan stopping rules. For the remaining eight studies allowed to continue to the second stage by the Gehan stopping rules and the DESR, conclusions on Hnul were consistent between the rules.

The DESR is designed to find drugs that have either a desirable rate of response or a desirably low level of early progression. However, because it is designed to find the 'good' drugs among a mixed (50/50) population of drugs having either good response or early progression rates, it appears to require a higher response rate at the end of stage one to allow recruitment of stage two than that required if response is considered in isolation. For this reason, compared with the Gehan stopping rule, the DESR was more likely to stop trials after the first stage of accrual despite a single response being observed in stage I. Conversely, as noted above, the DESR was less likely than the Fleming rules to stop a study at stage I despite a lack of any response, as EPD rates were low enough that the drugs under study might have met the specified level for an interesting agent.

For trials in which response is the clear priority, a set of rules devoted to response only may be more appropriate. However, in the present age of molecularly targeted anti-cancer agents, the likelihood of an investigational agent inducing tumour shrinkage or preventing tumour growth is often unclear prior to initiating phase II studies.

In the absence of suitable rules, examples are readily found of investigators setting a primary endpoint of response, a drug failing to meet that response, but the drug being declared interesting for further study based on other desirable characteristics [18,19]

Other authors have investigated the use of multiple endpoints in phase II trials. Zee et al generated a set of stopping rules similar to the DESR, but later found that the rules generated had poorer power than intended [14,15]. However, results for the comparisons between DESR and the stopping rules of Zee with Gehan's stopping rules were very similar in the same data set [17]. Although only the design parameter pair epdalt = 0.4, epdnul = 0.6 was considered in the paper which applied their rules [17], both the DESR and the stopping rules of Zee et al stop the first 15 trials at stage I and reject Hnul after stage II in the remaining trials, with high EPD rates being the common reason for early stopping. Conversely, considering drugs studied under the Fleming stopping rules, the DESR was less likely to accept Hnul at the end of stage I, and so to recruit to stage II. The conclusions at the end of stage II were more difficult to compare, as many of the actual trials did not recruit to the second stage. While the DESR remained more likely to reject Hnul for the design parameter pair epdalt = 0.4, epdnul = 0.6, it may have been less likely to reject Hnul with the pair epdalt = 0.3, epdnul = 0.5, suggesting the sensitivity of the results to changes in the design EPD parameters.

In an analogous paper, Panageas et al consider a rule set where response is divided into complete and partial response, and levels of interest and disinterest are again specified for the null and alternate hypothesis [20]. This rule set is potentially attractive for highly responsive cancers such as germ cell tumours, where complete responses are more frequent. However, it may be less applicable in the setting of most phase II trials involving previously treated malignancies and targeted drugs with uncertain tumour effects. In this setting, complete responses may be infrequent, and modest response rates or non-progression may suggest drug activity and lead to drug approval [8]. A slight modification to this design can be made which substitutes response and stable disease for complete response and partial response, similar to the DESR design. However, the study power calculated when using the Panageas design may actually be overestimated, thus underestimating the number of patients needed. This is because power is calculated assuming ralt and epdalt are simultaneously at the exact minimum response rate and maximum early progressive disease rate of interest for further study for the novel agent. The DESR design using the borderline method varies ralt and epdalt while maintaining power. Both endpoints do not have to be simultaneously at the boundary of interest, potentially giving a more accurate estimate of statistical power.

One limitation to the present study is that it applies arbitrary epdalt and epdnul pairs to existing data. Individualized epd rates may be more relevant to a given drug and give different results, although the pairs chosen were felt to be commonly plausible. Additionally, although the results presented are only for trials in which the Hnul for response rate is 0.05, the DESR method can be implemented for trials with higher null response rates. This comparison was not performed due to a critical lack of published phase II trials which present response and EPD rates at both stage I and II. It is also unknown whether actual efficacy might have been seen when the DESR rejected Hnul but the Fleming rule did not, as subsequent phase III studies were not conducted.


In conclusion, while the number of trials in our study is small, different patterns of early stopping and final rejection of Hnul are evident with the addition of EPD as an endpoint. With limited follow-up in terms of phase III studies, the final benefit in terms of drug development is not certain. However, the DESR may shorten studies where response rates are low but high EPD rates suggest the ultimate efficacy will be poor. Conversely, the DESR will allow accrual to the second stage in the absence of response when there are few patients with EPD, and this may allow more sensitive detection of drug activity. Based on the comparisons in this paper, the epdalt = 0.3, epdnul = 0.5 pair appears to offer the better balance of these outcomes, but the design parameters for a particular trial should be individualized.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JRG designed the study, programmed simulations, analyzed data, and drafted the manuscript. GRP designed the study, analyzed data, and drafted the manuscript. DT designed the study, analyzed data, and drafted the manuscript. All authors read and approved the final manuscript.


No funding source.


  1. DiMasi JA, Hansen RW, Grabowski HG: The price of innovation: new estimates of drug development costs.

    J Health Econ 2003, 22:151-185. PubMed Abstract | Publisher Full Text OpenURL

  2. Booth B, Glassman R, Ma P: Oncology's trials.

    Nat Rev Drug Discov 2003, 2:609-610. PubMed Abstract | Publisher Full Text OpenURL

  3. DiMasi JA, Grabowski HG: Economics of new oncology drug development.

    J Clin Oncol 2007, 25:209-216. PubMed Abstract | Publisher Full Text OpenURL

  4. Dhani N, Tu D, Sargent DJ, Seymour L, Moore MJ: Alternate endpoints for screening phase II studies.

    Clin Cancer Res 2009, 15:1873-1882. PubMed Abstract | Publisher Full Text OpenURL

  5. Gehan EA: The determination of the number of patients required in a preliminary and a follow-up trial of a new chemotherapeutic agent.

    J Chronic Dis 1961, 13:346-353. PubMed Abstract | Publisher Full Text OpenURL

  6. Fleming TR: One-sample multiple testing procedure for phase II clinical trials.

    Biometrics 1982, 38:143-151. PubMed Abstract | Publisher Full Text OpenURL

  7. Simon R: Optimal two-stage designs for phase II clinical trials.

    Control Clin Trials 1989, 10:1-10. PubMed Abstract | Publisher Full Text OpenURL

  8. El-Maraghi RH, Eisenhauer EA: Review of phase II trial designs used in studies of molecular targeted agents: outcomes and predictors of success in phase III.

    J Clin Oncol 2008, 26:1346-1354. PubMed Abstract | Publisher Full Text OpenURL

  9. Thezenas S, Duffour J, Culine S, Kramar A: Five-year change in statistical designs of phase II trials published in leading cancer journals.

    Eur J Cancer 2004, 40:1244-1249. PubMed Abstract | Publisher Full Text OpenURL

  10. Cesano A, Lane SR, Poulin R, Ross G, Fields SZ: Stabilization of disease as a useful predictor of survival following second-line chemotherapy in small cell lung cancer and ovarian cancer patients.

    Int J Oncol 1999, 15:1233-1238. PubMed Abstract | Publisher Full Text OpenURL

  11. Rapp E, Pater JL, Willan A, Cormier Y, Murray N, Evans WK, Hodson DI, Clark DA, Feld R, Arnold AM, et al.: Chemotherapy can prolong survival in patients with advanced non-small-cell lung cancer--report of a Canadian multicenter randomized trial.

    J Clin Oncol 1988, 6:633-641. PubMed Abstract | Publisher Full Text OpenURL

  12. Sekine I, Tamura T, Kunitoh H, Kubota K, Shinkai T, Kamiya Y, Saijo N: Progressive disease rate as a surrogate endpoint of phase II trials for non-small-cell lung cancer.

    Ann Oncol 1999, 10:731-733. PubMed Abstract | Publisher Full Text OpenURL

  13. Lara PN Jr, Redman MW, Kelly K, Edelman MJ, Williamson SK, Crowley JJ, Gandara DR: Disease control rate at 8 weeks predicts clinical benefit in advanced non-small-cell lung cancer: results from Southwest Oncology Group randomized trials.

    J Clin Oncol 2008, 26:463-467. PubMed Abstract | Publisher Full Text OpenURL

  14. Zee B, Melnychuk D, Dancey J, Eisenhauer E: Multinomial phase II cancer trials incorporating response and early progression.

    J Biopharm Stat 1999, 9:351-363. PubMed Abstract | Publisher Full Text OpenURL

  15. Freidlin B, Dancey J, Korn EL, Zee B, Eisenhauer E: Multinomial phase II trial designs.

    J Clin Oncol 2002, 20:599. PubMed Abstract | Publisher Full Text OpenURL

  16. Goffin JR, Tu D: Phase II stopping rules that employ response rates and early progression.

    J Clin Oncol 2008, 26:3715-3720. PubMed Abstract | Publisher Full Text OpenURL

  17. Dent S, Zee B, Dancey J, Hanauske A, Wanders J, Eisenhauer E: Application of a new multinomial phase II stopping rule using response and early progression.

    J Clin Oncol 2001, 19:785-791. PubMed Abstract | Publisher Full Text OpenURL

  18. Gallagher DJ, Milowsky MI, Gerst SR, Ishill N, Riches J, Regazzi A, Boyle MG, Trout A, Flaherty AM, Bajorin DF: Phase II Study of Sunitinib in Patients With Metastatic Urothelial Cancer.

    Journal of Clinical Oncology 2010, 28:1373-1379. PubMed Abstract | Publisher Full Text OpenURL

  19. Schiller JH, Larson T, Ou SH, Limentani S, Sandler A, Vokes E, Kim S, Liau K, Bycott P, Olszanski AJ, et al.: Efficacy and safety of axitinib in patients with advanced non-small-cell lung cancer: results from a phase II study.

    J Clin Oncol 2009, 27:3836-3841. PubMed Abstract | Publisher Full Text OpenURL

  20. Panageas KS, Smith A, Gonen M, Chapman PB: An optimal two-stage phase II design utilizing complete and partial response information separately.

    Control Clin Trials 2002, 23:367-379. PubMed Abstract | Publisher Full Text OpenURL

Pre-publication history

The pre-publication history for this paper can be accessed here: