Home | About | Journals | Submit | Contact Us | Français |

**|**NIHPA Author Manuscripts**|**PMC2673021

Formats

Article sections

- SUMMARY
- 1 Introduction
- 2 UMVUE - Review
- 3 Comparison and Selection for Randomized Phase II Designs
- 4 Discussion
- References

Authors

Related links

J Biopharm Stat. Author manuscript; available in PMC Apr 24, 2009.

Published in final edited form as:

PMCID: PMC2673021

NIHMSID: NIHMS89741

Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, 27710, U.S.A

The publisher's final edited version of this article is available at J Biopharm Stat

See other articles in PMC that cite the published article.

In a phase II trial, we may randomize patients to multiple arms of experimental therapies and evaluate their efficacy to determine if any of them is worth of a large scale phase III trial. Usually the primary objective of such study is to identify experimental therapies that are efficacious compared to a historical control. Each arm is independently evaluated using a standard design for single-arm phase II trial, e.g. Simon’s optimal or minimax design. When more than one arms are accepted through such a randomized trial, we may want to select the winner(s) among them. There are methods for between-arm comparisons in the literature, but most of them have drawbacks. They have a large false selection probability (type I error) when the competing arms have a small difference in efficacy, or the statistical tests used in the selection procedure do not properly reflect the small sample sizes and multi-stage design of the trials. In this paper, we propose between-arm comparison methods for selection in randomized phase II trials addressing these issues.

Phase II clinical trials are designed to screen out experimental therapies with low efficacy before they proceed to a large scale phase III trial. Often, we have multiple experimental therapies for efficacy screening with respect to the same patient population. Usually, the resources for clinical trials are limited, so that we may want to choose only small number of therapies, ideally one therapy, to be compared with a standard therapy through a phase III trial. In this setting, we may take one of two approaches: (i) Conduct multiple separate phase II trials, one for each experimental therapy, and evaluate them independently using a standard phase II trial design method for a single-arm phase II trial; (ii) Conduct a single phase II trial with multiple arms, randomize patients into the arms, and choose the best arm(s) using a selection method. The former approach requires more research resources due to the multiplicity of the studies. Also, the individual phase II trials may potentially have different patient characteristics, and the comparison among different therapies can be biased.

To avoid these issues, the second approach is attractive. However, the statistical approaches for analyzing randomized phase II trials are limited. Simon, Wittes and Ellenberg (1985) consider randomizing *n* patients to each of *K* treatment arms through a single stage and picking the winner, the arm with the largest estimated response rate, among them. this approach is based on the statistical methods of ranking and selection, the basic concepts of which were introduced over 50 years ago by Beckhofer (1954), with a substantial literature since that time. They show that, depending on the design setting, *n* = 16 to 70 patients are required for a 0.9 correct selection probability when there exists a difference of 0.15 in response rate among the *K* arms. Liu, LeBlanc and Desai (1999) point out that this approach has a high selection probability even when the treatment arms have the same response rates. Sargent and Goldberg (2001) consider a similar approach by allowing selection based on other factors when the difference in observed response rates is small.

Thall, Simon and Ellenberg (1989) consider studies with one control and *K* experimental arms. In the first stage, *n*_{1} patients are randomized to each of *K* experimental arms, and the winner is chosen for the second stage if its observed efficacy is larger than that for the historical control by 10%. The trial is stopped early if the winner does not satisfy this condition. In the second stage, *n*_{2} patients are randomized to each of the control arm and the selected experimental arm from stage 1, and one-sided testing is conducted to see if the experimental arm is better than the control. They require *n*_{1} = 30 to 80 patients and *n*_{2} = 90 to 140 patients under different design setting.

Palmer (1991) proposes a two-stage design for selection of the best of three treatments. In stage 1, cohorts of three patients are randomized to Arms A, B and C, and a decision is made to continue to accrue the next cohort or to stop and choose the better two arms. In stage 2, cohorts of two patients are randomized to the two arms chosen at stage 1, and a decision is made to continue to accrue the next cohort or to stop and choose the winner. Given the maximum number of patients available for the study, the stopping time for each stage is chosen to minimize the number of future failures using a Baysian approach. This method requires rapid determination of responses to be able to apply the sequential tests.

Steinberg and Venzon (2002) propose two-stage designs for a phase II trial with two experimental arms. In stage 1, *n*_{1} patients are randomized to each arm. The trial is stopped after stage 1 if the difference in number of responders between the two arms are larger than *d*, which is chosen so that, when the two arms have a difference of 0.15 in response rate, the probability of selecting the inferior arm is controlled at a specified level. Otherwise, the trial proceeds to stage 2 to randomize an additional *n*_{2} patients to each arm. After stage 2, the winner is chosen based on the cumulative responses through the two stages. Given *n* = *n*_{1} + *n*_{2}, one can choose *n*_{1} = *n*_{2} = *n*/2 or to minimize the expected sample size for the specified response rates with 0.15 of difference. This approach does not control the overall error probabilities through the two stages.

Most of these existing methods do not accurately control the type I error and the power for the whole selection procedure. Furthermore, they do not allow unequal designs among different arms. We propose exact and efficient between-arm comparison methods for analyzing randomized phase II trials designed for independent evaluation of each arm. The proposed methods can be used for comparing the response data from multiple single-arm trials on competing therapies with similar patient populations as well. We use the uniformly minimum variance unbiased estimator (UMVUE) since, as shown by Jung and Kim (2004), for 2-stage phase II trial designs, the maximum likelihood estimator (MLE) can be seriously biased, and the efficiency of UMVUE is comparable to that of MLE. In Section 2, we briefly review the UMVUE for multi-stage designs. We derive between-arm comparison methods under various conditions in Section 3. Some numerical studies are conducted in Section 4.

We consider two-stage design for a single-arm phase II trial in this section. For an experimental cancer therapy, let *p*_{0} denote the maximum unacceptable response rate and *p*_{1} denote the minimum acceptable response rate (*p*_{0} < *p*_{1}). Also, let *p* denote the true response rate of the therapy. A typical two-stage phase II trial is conducted as follows. During stage 1, *n*_{1} patients are enrolled and treated. If the number of responders is less than or equal to *a*_{1}, the trial is terminated for lack of efficacy and it is concluded that the treatment does not warrant further investigation (i.e., accept *H*_{0}: *p* = *p*_{0}). Otherwise, the study is continued to stage 2 during which an additional *n*_{2} patients are enrolled and treated. If the cumulative number of responders after stage 2 does not exceed *a*, it is concluded that the treatment lacks sufficient efficacy (i.e., accept *H*_{0}: *p* = *p*_{0}). Otherwise, it is concluded that the treatment has a sufficient efficacy, and the treatment will be considered for further investigation in subsequent trials (i.e., accept *H*_{1}: *p* = *p*_{1}).

Refer to Simon (1989), Jung, Carey and Kim (2001), Jung et al. (2004) for the search of optimal two-stage designs. One may employ an upper boundary to stop the trial early when a significantly high efficacy is observed from stage 1 (Chang et al, 1987; Spiegelhalter, Freedman, Blackburn, 1986). However, there being no compelling ethical argument and thus rarely used, we consider early stopping only for lack of efficacy in this paper.

A two-stage design is defined by the number of patients to be accrued during stages 1 and 2, *n*_{1} and *n*_{2}, and the boundary values *a*_{1} and *a* (*a*_{1} < *a*). So, we specify a two-stage design by (*a*_{1}/*n*_{1}*, a*/*n*), where *n* = *n*_{1} + *n*_{2}, called the maximum sample size. Let *M* denote the stopping stage and *S* = *S _{M}* denote the total number of responders accumulated up to the stopping stage.

The UMVUE = (*m, s*), of *p* is given as

where *u* ∧ *v* = min(*u, v*) and *u* *v* = max(*u, v*). Refer to Jung and Kim (2004) for details. The distribution of the UMVUE is derived using the probability mass function of (*M, S*) in a two-stage design with lower stopping boundaries only is given as

We consider two-arm randomized phase II studies, each arm with a two-stage design for independent evaluation as the primary objective. Following example study is used throughout this section.

Suppose that we randomize non-Hodgkin lymphoma patients who relapsed from a rituximab-containing combination regimen to rituximab alone (Arm R, *n* = 90) or ritux-imab+lenalidomide (Arm R+L, *n* = 45) with 2-to-1 probability. The two arms have the following two-stage designs:

Arm R: (*a*_{1}/*n*_{1}*, a*/*n*) = (10/57, 19/90) for 4% type I error at *p*_{0} = 0.15 and 95% power at *p*_{1} = 0.30.

Arm R+L: (*a*_{1}/*n*_{1}*, a*/*n*) = (4/21, 10/45) for 5% type I error at *p*_{0} = 0.15 and 89% power at *p*_{1} = 0.35.

Arm R is a potential control arm for a future phase III trial in case Arm R+L is accepted in this trial, but it is included in this phase II trial because there is not enough historical data on the regimen. Twice as many patients will be accrued to Arm R than to Arm R+L to allow more precise estimation of the clinical parameters to be used in designing a future phase III trial. Arm R+L may not be investigated further if it does not seem to be more efficacious than Arm R. We want to compare the two arms accounting for the two-stage design for each arm.

In general, we call the two arms *x* and *y*, respectively. For an outcome (*m _{k}, s_{k}*), let

In this subsection, we assume that the two arms have the same two-stage design (*a*_{1}/*n*_{1}*, a*/*n*) for independent evaluation.

One may want to conduct a randomized phase II trial to evaluate an experimental therapy comparing with a prospective control. This may happen when the control therapy has been used as a standard without formal evaluation through a prospective study, or needs more testing in an extended patient population. In this case, the prospective control arm may also be evaluated using a standard two-stage design for phase II trials.

When such a trial is completed, we may want to test if the experimental arm (arm *y*) is better than the control (arm *x*) or not. The hypotheses associated with this type of comparison are

This is a one-sided test. So in this case, we usually would not want to accept the experimental arm *y* if it is not accepted in the independent evaluation. Thus, we want to accept the experimental arm (or, reject *H*_{0}) if it is accepted in the independent evaluation, i.e. *m _{y}* = 2 and

(1)

where *I*(·) is the indicator function and *f*(*m, s*|*p*) denotes the probability mass function of (*M, S*) under the common two-stage designs. More generally, the probability of an event *A* in
^{2} is calculated as

In contrast to common asymptotic tests, such as the two-sample t-test, the operating characteristics of our exact test depends on the null response probability *p*, an unknown nuisance parameter. In order to remove the nuisance parameter, we control the type I error by maximizing the probability in (1) over the whole parameter space *p* [0, 1], or over a subset of interest [0, 1]. See Berger and Boos (1994) for the rationale for such an approach. Given *α*, we want to choose a critical value *c* = *c _{α}* so that the probability of accepting arm

(2)

We will refer to probability (2) as the type I error. Let *p*_{0} denote the response rate of a historical control. Then, we may choose a small interval such as = [*p*_{0} − 0.2, *p*_{0} + 0.2]. In our experience, the maximum type I error usually occurs within this range. Of course, if we want type I error control under any possible situation, we have to choose = [0, 1]. We use the latter in this paper.

Let *H*(*c*) = max_{p}_{} *h*(*c*|*p*). Obviously, *h*(*c*|*p*) is monotone in *c*. Given *c*, however, *h*(*c*|*p*) can have local maxima over *p* . For example, when both arms have the same design as that of Arm R+L in Example 1, (*a*_{1}/*n*_{1}*, a*/*n*) = (4/21, 10/45), Figure 1 displays *h*(*c* = 0.1|*p*) over *p* [0, 1]. Note that there are two maxima. So, given *α*, calculation of the critical value *c _{α}* requires a 2-stage numerical search procedure. For a given critical value

Given *p _{x}* and

Suppose that arm *y* is accepted in the independent evaluation, and *ĉ* = * _{y}* −

Suppose that arm *x* is a control and arm *y* is an experimental therapy, both with the same two-stage design (*a*_{1}/*n*_{1}*, a*/*n*) = (4/21, 10/45) as in Arm R+L of Example 1. With *α* = 0.1, we have *c _{α}* = 0.1520 and the Type I error is maximized at

Note that the above comparison rule controls the type I error of selecting the experimental arm when both arms have an equal response rate. This rule may be considered too strict. A phase II trial is designed not to show the superiority of an experimental therapy compared to the control, but to screen out ineffective therapies. If an experimental therapy is shown to have no worse efficacy than the control, its superiority may be investigated in a phase III trial using a more definitive endpoint, such as overall survival. In this sense, one may want to loosen the control of type I error somewhat in the phase II design. Let *δ*(> 0) denote the maximum clinically insignificant difference in response rate, e.g. *δ* = 0.05. Suppose that we do not care about falsely accepting arm *y* as far as *p _{y}* is within

We choose a critical value *c* = *c _{α}* satisfying

Given *p _{x}* and

For an observed difference *ĉ* = * _{y}* −

We will allow maximum clinically insignificant difference *δ* in the remainder of this paper if not stated otherwise.

Consider Example 2 with *δ* = 0.05. With *δ* = 0.05 and *α* = 0.1, we have *c _{α}* = 0.0925 and the Type I error is maximized at (

Suppose now that there are two experimental therapies, *x* and *y*, under investigation. The primary objective is to evaluate each therapy compared to a historical control. As a secondary analysis, we want to compare the two experimental arms and choose one that will be investigated further in a phase III trial. Given the maximum clinically negligible difference *δ*, the hypotheses may be expressed as

In this case, the associated testing is two-sided. As in the one-sided case, we do not want to select an experimental arm if it is not accepted in the independent evaluation. That is, we want to select an experimental arm if it is accepted in the independent evaluation and the UMVUE is significantly larger than that of the other arm.

For a chosen critical value *c*, we select arm *x* if

is true, and arm *y* if

is true. Since the two designs are identical for each arm, (*a*_{1}/*n*_{1}*, a*/*n*), the error probabilities *P*(*A _{x}*|

Note that probabilities *P*(*A _{x}*) and

Given *p _{x}* and

Suppose that arm *y* is accepted in the independent evaluation and *ĉ* = * _{y}* −

We select neither arm if both arms are rejected in the independent evaluation, and select both arms if both arms are accepted in the independent evaluation and |* _{y}* −

Consider Example 3, but with both arms considered as experimental. With *δ*= 0.05 and *α* = 0.1, we have *c _{α}* = 0.1520 and the Type I error is maximized at (

If we choose *δ* = 0.1, then we have *c _{α}* = 0.0925 for

In a randomized phase II trial, we may want to use different designs for different arms. For example, we may want to have more patients in the control arm to allow more efficient estimation of parameters in patient subgroups to be used in designing a phase III trial. Or, we may want to use a less strict early stopping rule in the control arm. If we want to compare two experimental therapies evaluated by separate single-arm phase II trials, it is very likely that the two trials will have different designs. In this section, we consider selection problems when two arms have different 2-stage designs.

In Section 3.1, we considered phase II trials randomizing patients to two arms with exactly the same two-stage designs for independent evaluation. In this case, we did not want to select an arm that is rejected in the independent evaluation. However, when the two arms have different two-stage designs, the selection rules in this section are based only on the comparison of the estimators of the response response rates.

As before, let *x* be the control arm and *y* the experimental arm and, for a maximal clinically negligible difference *δ*, we want to test

We choose a critical value *c* = *c _{α}* satisfying

where
* _{k}*,

The power for Δ and *p _{x}* (

can be similarly calculated as the type I error. For an observed difference, *ĉ* = * _{y}* −

Consider *δ* = 0.05 in Example 1. Then with *α* = 0.1, we have *c _{α}* = 0.0717 and the Type I error is maximized at (

Suppose that both arms are experimental with different designs. For a maximal difference clinically negligible *δ*, we want to test

We choose a critical value *c* = *c _{α}* satisfying

(3)

Note that the two misselection errors in the left hand side of (3) are not the same if the two arms have different designs. We fail to select one arm against the other if |* _{x}* −

The power for Δ and *p _{x}* (

For an observed difference, *ĉ* = |* _{x}* −

Suppose that both arms in Example 1 are experimental. Then with *δ* = 0.05 and *α* = 0.1, we have *c _{α}* = 0.1174 and the Type I error is maximized at (

In this section we consider two situations, one where all arms are experimental and another where one of them is a control. Each arm is independently compared to a historical control through a two-stage design. For arm *k* = 0, 1, …, *K*, let
* _{k}*,

Suppose that patients are randomized to a control (Arm 0) and *K* experimental arms (Arms 1, …, *K*). We want to identify experimental arms that are significantly efficacious compared to the control arm. When *K* ≥ 2, we have to control the familywise error rate (FWER) to adjust for the multiplicity of the testing. The marginal type I error control applied in the previous sections will increase the misselection probability. For a maximal difference clinically negligible *δ*, we want to test

against

Given a FWER level *α*, such as 0.1, we accept Arm *k*(= 1, … *K*) if * _{k}* −

If more than one arm is accepted, we may conduct pairwise comparisons among accepted arms to identify a smaller number of arms for a phase III trial as described in 3.1.2 and 3.2.2 as a secondary analysis.

Under a specific alternative hypothesis,

with Δ* _{k}* > 0, the power is obtained as

Suppose that patients are randomized to *K* experimental arms. In this case, we want to test

against

We reject *H*_{0} if max_{1≤}_{k}_{≤}* _{K} _{k}* − min

Under a specific alternative hypothesis, *H _{a}: p*

We propose between-arm comparison methods for a randomized phase II trial. If one wants to compare two therapies evaluated through two separate single-arm phase II trials, our methods can be used if the two trials are conducted with similar populations. Each arm to be compared may have a two-stage design for independent evaluation of the therapy, so that statistical procedures based on single-stage designs, such as two-sample t-test, may result in biased results. Our methods accurately compare two arms reflecting the design aspect and the small sample sizes. We have considered two-stage designs, but extension to multi-stage designs is straightforward. The between-arm comparison proposed in this paper is conducted when competing experimental therapies are independently evaluated compared to a historical control. Jung (2007) proposed design methods when patients are randomized between a prospective control and experimental therapies, and each experimental arm is compared with the control through multiple stages.

- Beckhofer RE. A single-sample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics. 1954;25:16–39.
- Berger R, Boos DD. P values maximized over a confidence set for the nuisance parameter. J of American Statistical Association. 1994;89:1012–1016.
- Chang MN, Therneau TM, Wieand HS, Cha SS. Designs for group sequential phase II clinical trials. Biometrics. 1987;43:865–874. [PubMed]
- Jung SH. Randomized phase II trials with a prospective control. To appear in Statistics in Medicine 2007
- Jung SH, Carey M, Kim KM. Graphical search for two-stage designs for phase II clinical trials. Controlled Clinical Trials. 2001;22:367–372. [PubMed]
- Jung SH, Kim KM. On the estimation of the binomial probability in multistage clinical trials. Statistics in Medicine. 2004;23:881–896. [PubMed]
- Jung SH, Lee TY, Kim KM, George SL. Admissible two-stage designs for phase II cancer clinical trials. Statistics in Medicine. 2004;23:561–569. [PubMed]
- Liu PY, LeBlanc M, Desai M. False positive rates of randomized phase II designs. Controlled Clinical Trials. 1999;20:343–352. [PubMed]
- Palmer CR. A comparative phase II clinical trials procedure for choosing the best of three treatments. Statistics in Medicine. 1991;10:1327–1340. [PubMed]
- Sargent DJ, Goldberg RM. A flexible design for multiple armed screening trials. Statistics in Medicine. 2001;20:1051–1060. [PubMed]
- Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials. 1989;10:1–10. [PubMed]
- Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treatment Reports. 1985;69:1375–1381. [PubMed]
- Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: Conditional or Predictive power? Controlled Clinical Trials. 1986;7:8–17. [PubMed]
- Steinberg SM, Venzon DJ. Early selection in a randomized phase II clinical trial. Statistics in Medicine. 2002;21:1711–1726. [PubMed]
- Thall PF, Simon R, Ellenberg SS. A two-stage design for choosing among several experimental treatments and a control in clinical trials. Biometrics. 1989;45:537–547. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |