Clin Trials. Author manuscript; available in PMC 2011 August 1.
Published in final edited form as:
Published online 2010 June 23.
PMCID: PMC3085081
NIHMSID: NIHMS288455

Calculating Sample Size in Trials Using Historical Controls

Song Zhang
Department of Clinical Sciences UT Southwestern Medical Center Dallas, TX
Jing Cao
Department of Statistical Science Southern Methodist University Dallas, TX

Abstract

Background

Makuch and Simon [1] developed a sample size formula for historical control trials. When assessing power, they assumed the true control treatment effect to be equal to the observed effect from the historical control group. Many researchers have pointed out that the M-S approach does not preserve the nominal power and type I error when considering the uncertainty in the true historical control treatment effect.

Purpose

To develop a sample size formula that properly accounts for the underlying randomness in the observations from the historical control group.

Methods

We reveal the extremely skewed nature in the distributions of power and type I error, obtained over all the random realizations of the historical control data. The skewness motivates us to derive a sample size formula that controls the percentiles, instead of the means, of the power and type I error.

Results

A closed-form sample size formula is developed to control arbitrary percentiles of power and type I error for historical control trials. A simulation study further demonstrates that this approach preserves the operational characteristics in a more realistic scenario where the population variances are unknown and replaced by sample variances.

Limitations

The closed-form sample size formula is derived for continuous outcomes. The formula is more complicated for binary or survival time outcomes.

Conclusions

We have derived a closed-form sample size formula that controls the percentiles instead of means of power and type I error in historical control trials, which have extremely skewed distributions over all the possible realizations of historical control data.

Keywords: clinical trial design, sample size, historical controls, percentiles of type I error and power

1 Introduction

Randomized clinical trials (RCT) have become the gold standard in comparing the effects between treatments. Despite the rigorous scientific basis, there are situations where RCTs are infeasible due to concerns of ethics, patient preference, cost, and regulatory acceptability. For example, the resources required by an RCT might be prohibitive for some phase II trials which are intended to obtain preliminary data on the effectiveness of a new treatment [2]. Another example is that when evidence already exists showing the superiority of a new treatment over the standard one, it might be unethical for a RCT to assign patients to a potentially inferior treatment. One solution is to use a historical control trial (HCT), where the experimental therapy is compared with a control therapy (referred to as historical control or HC) that has been evaluated in a previously conducted trial. Because an HCT can be smaller in size and easier to conduct, it has been widely applied in clinical research [3, 4, 5, 6, 7, 8, 9].

Makuch and Simon [1] developed a sample size formula for HCTs with a binary outcome. In power calculation, they assumed that the observed response rate from the HC group was the true control response rate. Their formula was based on the two-sample test statistic employed in RCTs but the power calculation only accounted for the sampling variability in the experimental group. The sample size solution was obtained through a numerical search. Using a similar idea, Dixon and Simon [10] provided a sample size formula for HCTs with exponential survival outcomes. Chang et al. [11] presented a two-stage design for phase II clinical trials with HC and continuous outcomes. More discussions about the HCT sample size calculation can be found in [12] and [13].

The estimated sample size for an HCT is usually much smaller than that required by an RCT. Lee and Tseng [14] pointed out that the sample size reduction in HCT is largely unjustified due to the strong assumption that the observed historical control response rate is equal to the true control response rate. They proposed a uniform power method to control the expected power, taking into account the uncertainty in the HC response rate. The resulting sample size is closer to the RCT sample size than the one based on Makuch and Simmon's method (M-S). Korn and Freidlin [15] compared three approaches to HCT design: M-S approach, RCT approach, and one-sample approach (based on a one-sample test that the experimental treatment effect is greater than the observed HC treatment effect). The authors suggested to adopt the RCT approach because it preserves the unconditional power over the random HC observations.

In this study, we investigate the sample size calculation for HCT with continuous outcomes, accounting for uncertainty caused by the unknown true HC treatment effect. We provide a unified framework for the M-S, RCT and one-sample approaches, where they are shown to either control the mean or the median of the random power and type I error, obtained over all the possible realizations of the HC data given the true HC effect. We further demonstrate through simulation that the distributions of power and type I error are extremely skewed. This extreme skewness leads to undesirable properties of sample sizes calculated to control the means of power and type I error. One revealing example in our simulation is that with the mean power controlled at 0.8, a slight decrease in the mean type I error from 0.06 to 0.05 leads to a drastic increase in sample size from 286 to 487. This observation motivates us to develop a sample size formula that controls the percentiles, instead of the means, of the random power and typer I error. To our knowledge, it is the first study in HCT design to demonstrate the extreme skewness in the distributions of power and type I error, and to estimate sample size based on the percentiles of power and type I error. It provides researchers a sensible way to assess the risk in an HCT. The proposed formula has a closed form, which can be easily computed using a scientific calculator.

The rest of this paper is organized as follows. In Section 2 we review the three different approaches (M-S, RCT, and one-sample) to sample size calculation in HCT under a unified framework. A simulation study is conducted to demonstrate the extreme skewness in the distributions of power and type I error. In Section 3 we present a sample size formula to control arbitrary percentiles of power and type I error. We evaluate its performance through simulation in two scenarios: an ideal scenario (population variances known) and a more realistic scenario (population variances unknown). In Section 4 we provide a real application of the proposed method. The final section is devoted to discussion.

2 A Unified Framework

We briefly review the M-S, RCT, and one-sample approaches to HCT sample size calculation. Suppose in a clinical trial we compare the outcomes between an experimental group and an HC group. The outcome variable is continuous following a normal distribution. Let $Y1,⋯,Ym~iidN(θ0,σ02)$ be the m observations from the HC group, and $X1,⋯,Xn~iidN(θ0,σ12)$ be the n observations from the experimental group. We define Y = {Y1, , Ym} and X = {X1, , Xn}. The variances $σ02$ and $σ12$ are assumed to be known. With null hypothesis H0: θ1 = θ0 and alternative hypothesis H1: θ1 > θ0, the standard test statistic is

$Z(X,Y)=X‒−Y‒σ12∕n+σ02∕m,$
(1)

where $X‒=∑i=1nXi∕n$ and $Y‒=∑j=1mYj∕m$ are the sample means from the two groups. Given type I error α, power 1 − β, m, $σ02$, $σ12$, and difference in treatment effects Δ = θ1 − θ0, the sample size n is obtained by solving

$P(X‒−Y‒σ12∕n+σ02∕m>z1−α)=1−β,$
(2)

where z1−α is the 100(1 − α)th percentile of the standard normal distribution. Here and in the rest of the paper we do not differentiate the estimated sample size or the solution to the sample size equations, with the understanding that the final sample size is the smallest integer greater than or equal to the solution.

In the M-S approach, the Yjs from the HC group are considered not subject to sampling variability because they have been observed before the clinical trial. This consideration leads to the following manipulation of (2),

$P((X‒−Y‒)−(Δ+θ0−Y‒)σ12∕n>z1−ασ12∕n+σ02∕mσ12∕n−Δ+θ0−Y‒σ12∕n)=1−Φ(z1−ασ12∕n+σ02∕mσ12∕n−Δ+θ0−Y‒σ12∕n)=1−β,$
(3)

where Φ(·) is the cumulative distribution function of the standard normal distribution. Thus we find n by solving

$z1−ασ12∕n+σ02∕mσ12∕n−Δ+θ0−Y‒σ12∕n=zβ.$
(4)

Since the true HC effect (θ0) is usually unknown, it is cancelled out in the equation by assuming $θ0=Y‒$. This is a strong assumption especially when the number of HC observations is limited. Traditionally Equation (4) has been solved through a numeric search [11]. Here we present a closed-form solution and a sufficient and necessary condition for its existence.

Theorem 1. Define $a=z1−α2σ02∕(mσ12)−Δ2∕σ12$, $b=2Δz1−β∕σ12$, and $c=z1−α2−z1−β2$. In clinical trials we usually specify α and β such that α ≤ β. Equation (4) has a unique sample size solution if and only if $Δ>z1−ασ02∕m$, and the solution is

$n0=(−b−b2−4ac2a)2.$
(5)

Proof. See Appendix A.1.

Theorem 1 helps researchers avoid time-consuming numerical search. It also points out a potential pitfall in the M-S approach where too small an assumed difference in the treatment effects ($Δ≤z1−ασ02∕m$) would lead to no solution for the sample size.

In the one-sample approach, the hypotheses are specified as $H0:θ1=Y‒$ and $H1:θ1>Y‒$ based on the assumption that $θ0=Y‒$. The one-sample test statistic $Z1(X,Y)=(X‒−Y‒)∕σ12∕n$ is employed. The sample size estimate is $n1=σ12(z1−α+z1−β)2∕Δ2$.

In the RCT approach, the HC group is treated as a regular control arm in an RCT. The sample size is estimated by $n2=σ12(z1−α+z1−β)2m∕[mΔ2−(z1−α+z1−β)2σ02]$, which is based on a two-sample test.

The three approaches produce drastically different sample size estimates. For example, given m = 80, $σ02=σ12=1$, Δ = 0.3, α = 0.05, and 1 − β = 0.8, the estimated sample sizes are n0 = 144, n1 = 69, and n2 = 487, respectively. The formulas of n0, n1 and n2 do not depend on the specific values of HC sample mean ($Y‒$) or true mean (θ0). The test statistics, however, are calculated based on the HC sample mean. In a particular study, the unknown difference between $Y‒$ and θ0 has a great impact on the realized power and type I error. We conduct Simulation 1 to compare the performance of n0, n1 and n2. Details of the simulation algorithm are presented in Appendix A.2.

The realization of a particular HC data (Y(k)) leads to a conditional power ($qv(k)$) and a conditional type I error ($hv(k)$) under sample size nv. Over the random realizations of Y, we have a random power, qv, and a random type I error, hv. The distributions of qv and hv provide a global view of the variability in power and type I error for HCTs given an unknown HC treatment effect.

Without loss of generality, we set the true HC effect at θ0 = 0. We also assume m = 80, $σ02=σ12=1$, Δ= 0.3, α = 0.05, and 1 − β = 0.8. Figure 1 shows the results of Simulation 1. The graphs in the first column indicate that both the conditional power and type I error decrease monotonically as the difference between the observed and true HC treatment effect, ($Y‒(k)−θ0$), increases. Table 1 lists the conditional power and type I errors given Y(k), with $Y‒(k)$ changing between two standard errors below and above θ0. For example, under n0, when the observed HC effect ($Y‒(k)$) is one standard error away ($±σ02∕m$) from the true effect (θ0), the mean power changes from 0.313 to 0.987 and the mean type I error from 0 to 0.088, which deviate far from the nominal levels of 1 − β = 0.8 and α = 0.05. Note that such a deviation is not a rare event because for a particular HC data set, there is a 32% chance that the sample mean is one standard error or further away from its true mean. The second and third columns of Figure 1 show that the distributions of type I error and power are extremely skewed, which is also revealed by the difference between their means and medians (achieved at $Y‒(k)=θ0$) in Table 1.

The type I errors ($hv(k)$) and powers ($qv(k)$) under n0, n1, and n2. The first column plots $hv(k)$ and $qv(k)$ versus the difference between the observed and true HC effects, $Y‒(k)−θ0$, with black for $hv(k)$ and red for $qv(k)$. The second ...
Simulation 1, Conditional Type I Errors and Powers Given Y(k)

We briefly explain why the power and type I error have skewed distributions over the random realizations of the HC data. Taking the one-sample approach (n1) for example, for a particular HC data Y(k), it can be shown that the conditional type I error is

$h1(k)=1−Φ(z1−α−θ0−Y‒(k)σ12∕n1).$

In the parentheses, z1−α is usually the dominant term and shifts the probability computation to the tail area of the normal distribution. As a result, although the sample mean ($Y‒(k)$) is symmetric around the true mean (θ0), the impact of the sample mean being greater or smaller than the true mean is different. For $Y‒(k)>θ0$, the conditional type I errors have a range of (0, α), which is narrow under commonly specified significance levels. On the other hand, for $Y‒(k)<θ0$, the conditional type I errors have a much wider range, (α, 1). In summary, it is because researchers usually set power and significance level in the tail area (i.e., α close to 0 and 1 − β close to 1) that the random power and type I error are skewly distributed.

Finally, Table 1 provides empirical evidence for the theory presented in Theorem 2, which states a unified framework for nv (v = 0, 1, 2).

Theorem 2. The sample sizes (n0, n1, n2) control the random power and type I error in such a way that

1. The M-S approach (n0) controls the mean of type I error at α and the median of power at 1 − β;
2. The one-sample approach (n1) controls the medians of type I error and power at α and 1 − β, respectively;
3. The RCT approach (n2) controls the means of type I error and power at α and 1 − β, respectively.

Proof. See Appendix A.3.

Theorem 2 suggests that the M-S approach tries to reach a compromise between the one-sample and RCT approaches by controlling the mean type I error at α, while the median power at 1 − β.

3 Sample Size Controlling the Percentiles

Simulation 1 shows that the distributions of power and type I error, observed over all the random realizations of HC data, are extremely skewed. For random variables with extremely skewed distributions, making decisions based on a location parameter such as a percentile is usually more desirable than based on the mean. We propose a sample size formula to control arbitrary percentiles of the random power and type I error. It provides a more sensible way to assess the risk in HCTs.

Theorem 3. Suppose in an HCT the goal is to control the (1 − pq)th percentile of the power at 1 − β, and the phth percentile of the type I error at α. Then the required sample size is

$n∗=(z1−α+z1−β)2σ12[Δ−(zpq+zph)σ02∕m]2,$
(6)

and the null hypothesis is rejected if

$Z∗(X,Y)=X‒−Y‒−zphσ02∕mσ12∕n∗>z1−α.$

The parameters pq and ph can be specified arbitrarily as long as the condition $Δ>(zpq+zph)σ02∕m$ holds.

Proof. See Appendix A.4.

According to Theorem 3, let q* and h* be the random power and type I error under sample size n*. Then we have P(q* > 1 − β) = pq and P(h* < α) = ph. Suppose an HCT is conducted to assess the effectiveness of a new drug. Given a particular HC data set, the realized power and type I error depend on the random difference between the HC sample mean ($Y‒$) and its true mean (θ0). However, if the researchers enroll n* subjects, they can be confident that, over all the possible HC realizations, the power of the clinical trial would be greater than 1 − β with probability pq, and the type I error would be smaller than α with probability ph. It is easy to check that the sample size under the one-sample approach (n1) is a special case of n* at pq = ph = 0.5.

In other words, we propose sample size n* to achieve the goal that, the operational characteristics (realized power and type I error) of an HCT are more desirable than the nominal levels with certain pre-specified probabilities (pq and ph). Controlling the power and type I error by percentiles instead of means (the RCT approach) is more effective when their distributions are extremely skewed. Furthermore, the arbitrarily specified pq and ph provide flexibility in accommodating researchers' preference for risk control.

Based on the same setting as in Simulation 1, we conduct Simulation 2 to explore the properties of n*. We consider different combinations of pq and ph, ranging from 0.5 to 0.9. Table 2 lists the estimated sample size n*, the empirical means of type I error and power, $h‒∗=∑k=1Kh∗(k)∕K$ and $q‒∗=∑k=1Kq∗(k)∕K$, and the empirical percentiles, $p^h=∑k=1KI(h∗(k)<α)∕K$ and $p^q=∑k=1KI(q∗(k)>1−β)∕K$. From Table 2 we have two observations. First, for sample size calculation, pq and ph are exchangeable in the sense that switching their values leads to the identical sample size, which is also clear from Equation (6). Second, when the distributions of type I error and power are extremely skewed, it is more sensible to control the percentiles instead of means. For example, under (ph = 0.8, pq = 0.7), we are confident that by enrolling 286 patients, the type I error would be smaller than 0.05 with probability 0.8, and the power would be greater than 0.8 with probability 0.7. It provides a high assurance for researchers. Note that in this case the mean power is 0.8 but the mean type I error is 0.06, slightly higher than the nominal α = 0.05. As demonstrated in Table 1, the required sample size is n2 = 487 (under the RCT approach) if we control the mean power and mean type I error at 0.8 and 0.05, respectively. That is, in order to reduce the mean type I error from 0.06 to 0.05, researchers need to enroll 201 additional patients, due to the skewness in the type I error distribution.

Simulation 2, Type I Errors and Powers under n*

In simulations 1 and 2, we have assumed the population variances of the HC and experimental groups ($σ02$ and $σ12$) to be known, which is usually unrealistic in practice. We conduct Simulation 3 to further access the performance of n* in a more realistic scenario. It proceeds as follows: a) To compute the required sample size n*, the assumed Δ and $σ12$ will be plugged into Equation (6). However, $σ02$ is replaced by $s02$, the HC sample variance. b) In hypothesis testing, we compute the test statistic Z* (X, Y) with $σ02$ replaced by $s02$, and $σ12$ replaced by $s12$, the sample variance in the experimental group. Thus the estimated sample size includes additional uncertainty from $s02$, and the test statistic includes additional uncertainty from $s02$ and $s12$. The detailed algorithm of Simulation 3 is presented in Appendix A.5.

Table 3 lists the results of Simulation 3. The sample size n* becomes random when we replace the HC population variance ($σ02$) with a random sample variance ($s02$). When pq = ph = 0.5, the impact of the additional randomness is negligible. That is, after applying the integer restriction on sample size, the calculated n* is constant at 69, the same as its counterpart in Table 2. As pq or ph increases, the standard deviation of n* increases, and the mean of n* deviates from the fixed sample size (in Table 2) computed under the population variance. This is because under large pq and ph, the distribution of n* is heavily skewed to the right. Such skewness also arises from the tail behavior of the normal distribution. For example, under (pq = 0.9, ph = 0.8), the random sample size has a mean of 3065.59 and a standard deviation of 29348.36. Its distribution has an extremely long tail (the 99th percentile is greater than 200000). The skewness is much more severe under (pq = ph = 0.9), where we omit the simulation due to computer overflow. On the other hand, the operational characteristics of the clinical trial remain unchanged with additional randomness from the sample variances. Specifically, the realized controlling percentiles, $p^h$ and $p^q$, agree with the nominal levels. Furthermore, the means of power and type I error are close to those (in Table 2) obtained under the population variances. Taken together, the proposed sample size n* successfully controls the percentiles of power and type I error even when the population variances are unknown.

Simulation 3, Type I Errors and Powers under n*

4 Example

The safety and effcacy of laparoscopic rectopoxy for rectal prolapse will be compared with those of open rectopexy procedure, which was conducted several months ago by the same group of surgeons at the same institution [16]. Data will be collected prospectively for the laparoscopic rectopoxy group and by hospital chart review for the HC group. The HC group includes 24 consecutive patients who had undergone conventional open rectopexy without having concomitant gynecologic procedures. These patients required an average of 71.5 milligrams of morphine during the first 48 hours after procedure with a standard deviation of 45.9 milligrams. It is expected that the average amount of morphine needed during the first 48 hours of laparoscopic rectopexy will be 41.5 milligrams with a standard deviation of 35.0 milligrams. We estimate the number of patients needed to detect the difference in morphine requirement during the first 48 hours between open and laparoscopic procedures, controlling the 70th percentile (ph = 0.7) of type I error at 5%, and the 30th percentile (pq = 0.7) of power at 80%. The numbers of patients needed in the laparaoscopic group are n0 = 14 (the M-S approach), n1 = 9 (the one-sample approach), n2 = 22 (the RCT approach), and n* = 19 (the proposed approach), respectively. Note that the above sample sizes are obtained by replacing the unknown HC population variance with the sample variance.

5 Discussion

We have provided a unified framework for three existing approaches (M-S, one-sample, and RCT) in HCTs, by showing that they either control the mean or the median of power and type I error. We further developed a closed-form sample size formula to control arbitrary percentiles of the random power and type I error. It provides more flexibility in assessing the risk in HCTs and accommodates the extreme skewness in the distributions of power and type I error. We limited our discussion to the HCTs with continuous outcomes. In the future we will extend it to HCTs with binary and survival time outcomes.

Similar to the existing approaches, the proposed sample size formula (n*) requires the population variances of the HC and the experimental group to be known. Through simulation study, we demonstrated that the proposed approach successfully controls the percentiles of power and type I error in a more realistic scenario, where the true variances are unknown and they are replaced with observed sample variances. One reviewer kindly pointed out that in situations where the measurements are continuous with bounded support, say on (a, b), a sample size formula can be derived without requiring the population variances. Specifically, we can define $P^=(X‒−a)∕(b−a)$ with $0. Applying the arcsin transformation on $P^$, we can calculate sample size based on sin−1($P^$), whose variance is free of the sampled population's true variance.

Lee and Tseng [14] presented sample size calculation for HCTs with binary outcomes controlling the means of power and type I error. Theorem 2 states that the same goal is achieved by n2 for HCTs with continuous outcomes. The computation in [14] is more complicated due to the transformation performed on binary data. For continuous outcomes, when the HC variance is assumed to be known, the sample size formula does not depend on observations from the HC group. Thus one pair of null and alternative hypotheses leads to one unique sample size estimate. For binary outcomes, the sample size formula computed under the arcsin transformation depends on the observations from the HC group. Thus one pair of hypotheses leads to many possible sample size estimates, each determined by a random realization of the HC data. In [14], the authors had to deal with the expectation of sample sizes.

The term ($Y‒+zphσ02∕m$) in the numerator of Z*(X, Y) is the phth percentile of posterior distribution [θ0 | Y] under a flat prior, which suggests a potential connection of the proposed approach to a Bayesian sample size calculation. Nonetheless, in Appendix A.4., the derivation of n* is strictly in the frequentist paradigm, where the randomness of type I error and power comes from the uncertainty in the HC data Y, not from random variable θ0 (as in a Bayesian method). For example, we set type I error h* = α at $Y‒=θ0−zphσ02∕m$, the (1 − ph)th percentile of $Y‒~N(θ0,σ02∕m)$. Because h* is monotonically decreasing in $Y‒$, the phth percentile of h* is controlled at α.

6 Acknowledgments

This study is supported in part by NIH grants UL1 RR024982 and P50 CA70907. The authors thank the two reviewers and associate editor for their constructive comments and suggestions.

Appendix A.1. Proof of Theorem 1

Proof. Assuming $θ0−Y‒=0$ and applying some simple algebra, we transform (4) to

$z1−α1+nσ02mσ12=Δnσ12−z1−β.$

Squaring on both sides and rearranging, we have

$(z1−α2σ02mσ12−Δ2σ12)n+2Δz1−βσ12n+(z1−α2−z1−β2)=0.$
(7)

From (7) we can find a closed form solution for $n$ subject to constraint that $n>0$.

First we need b2 − 4ac ≥ 0, where a, b and c are defined in (5). This condition implies $Δ≥z1−ασ02∕m1−z1−β2∕z1−α2$, and two possible roots

$r1=−b+b2−4ac2aand−b−b2−4ac2a.$

Fact 1. No plausible solution exists under $z1−ασ02∕m1−z1−β2∕z1−α2≤Δ≤z1−ασ02∕m$.

• If $Δ=z1−ασ02∕m$ then a = 0, and the solution to (7) is r = −c/b. Because c > 0 when α < β and b > 0 by definition, we eliminate r due to the positive constraint on $n$.
• If $z1−ασ02∕m1−z1−β2∕z1−α2≤Δ then a > 0 and 4ac > 0. It is easy to show that r1 < 0 and r2 < 0.

Suffciency: We demonstrate that the condition $Δ>z1−ασ02∕m$ implies (5) being the unique sample size solution. From the condition we have a < 0 and 4ac < 0. Together with b > 0, it is easy to show that r1 < 0 and r2 > 0. Thus (5) is the unique sample size solution.

Necessity: We demonstrate that (5) being the unique sample size solution implies $Δ>z1−ασ20∕m$. (5) being the unique solution is equivalent to r2 being the unique solution for $n$. There are two scenarios:

1. b2 − 4ac = 0 or $z1−ασ02∕m1−z1−β2∕z1−α2$. It is eliminated due to Fact 1.
2. b2 − 4ac > 0 and r1 < 0 and r2 > 0. Note that the condition b2 − 4ac > 0 implies $Δ>z1−ασ02∕m1−z1−β2∕z1−α2$. We eliminate $z1−ασ02∕m1−z1−β2∕z1−α2<Δ≤z1−ασ02∕m$ based on Fact 1. The validity of $Δ>z1−ασ02∕m$ is established by Sufficiency.

Thus we complete the proof.

Appendix A.2. Algorithm of Simulation 1

Simulation 1. First we compute sample sizes nv for a given set of (m, $σ02$, $σ12$, Δ, α, β), where v = 0, 1, 2 denote the M-S, one-sample, and RCT approach, respectively. Then we generate null experimental data sets $Xv0(l)={X10(l),⋯,Xnv0(l)}$ from $N(θ0,σ12)$, and alternative experimental data sets $Xv(l)={X1(l),⋯,Xnv(l)}$ from $N(θ0+Δ,σ12)$, for l = 1, …, L and v = 0, 1, 2. The superscript 0 indicates that the null distribution is true, and the superscript(l) indicates the lth experimental data set generated. For iteration k = 1, …, K,

1. Simulate HC data $Y(k)={Y1(k),⋯,Ym(k)}$ from $N(θ0,σ02)$;
2. Estimate the conditional type I error given Y(k) by $hv(k)≈Σl=1LI(Zv(Xv0(l),Y(k))>z1−α)∕L$. Note that Zv(X, Y) = Z(X, Y) for v = 0 and 2.
3. Estimate the conditional power given Y(k) by $qv(k)≈Σl=1LI(Zv(Xv(l),Y(k))>z1−α)∕L$.

The superscript(k) of $hv(k)$ and $qv(k)$ suggests that they are computed given the kth simulated HC data. We set K = L = 5000.

Appendix A.3. Proof of Theorem 2

Proof. We first state the fact that $Y‒~N(θ0,σ02∕m)$, $X‒v~N(θ0+Δ,σ12∕nv)$, $X‒v0~N(θ0,σ12∕nv)$, and median($Y‒$) =θ0. For n0 and n2, the null hypothesis is rejected if Z(X, Y) > z1−α. Marginalizing with respect to Y is equivalent to marginalizing with respect to $Y‒$. Thus for v = 0, 2,

$E(hv)=∫P(X‒v0−Y‒σ12∕nv+σ02∕m>z1−α∣Y‒)f(Y‒)dY‒=∫I(X‒v0−Y‒σ12∕nv+σ02∕m>z1−α)f(X‒v0)f(Y‒)dX‒v0dY‒=∫z1−α+∞f(Uv)dUv=P(Uv>z1−α)=α.$

We have the third equality through random variable transformation, where $Uv=(X‒v0−Y‒)∕σ12∕nv+σ02∕m$ and it is easy to show that Uv ~ N(0, 1).

In the similar fashion, we can show that E(q2) = 1 − β,

$E(q2)=∫P(X‒2−Y‒σ12∕n2+σ02∕m>z1−α∣Y‒)f(Y‒)dY‒=P(U>z1−α−Δσ12∕n2+σ02∕m)=1−β.$

We have the second equality by defining $U=(X‒−Y‒−Δ)∕σ12∕nv+σ02∕m$. The third equality is obtained by plugging the expression of n2 and recognizing U ~ N(0, 1)

We then show that median(q0) = 1 − β. From (3) we have

$q0(k)=1−Φ(z1−ασ12∕n0+σ02∕mσ12∕n0−Δ+θ0−Y‒(k)σ12∕n0).$

First we recognize that $q0(k)$ is a decreasing function of $Y‒(k)$. Second, n0 is the solution to $q0(k)$ = 1 − β by setting $Y‒(k)$ = θ0 = midian($Y‒$). These two points lead to the conclusion that median(q0) = 1 − β. Note that E(q2) and $q0(k)$ have different expressions because the former marginalizes with respect to random $Y‒$, while the latter is defined conditional on a particular Y(k).

Now we show that median(h1) = α,

$h1(k)=P(X‒00−Y‒(k)σ12∕n1>z1−α)=1−Φ(z1−α−θ0−Y‒(k)σ12∕n1).$

Thus $h1(k)$ is a decreasing function of $Y‒(k)$ and $h1(k)=α$ at $Y‒(k)=θ0=median(Y‒)$. Thus we conclude median(h1) = α. Similar argument leads to the conclusion that median(q1) = 1 − β.

Appendix A.4. Proof of Theorem 3

Proof. First we demonstrate that based on Z*(X, Y), the phth percentile of type I error is controlled at α: For a given $Y‒$, the type I error is

$h∗=P(X‒0−Y‒−zphσ02∕mσ12∕n∗>z1−α)=1−Φ(z1−α+Y‒−(θ0−zphσ02∕m)σ12∕n∗).$

Thus h* = α when $Y‒=θ0−zphσ02∕m$, which is the (1−ph)th percentile of $Y‒~N(θ0,σ02∕m)$. Together with the fact that h* is a monotonically decreasing function in $Y‒$, we have $P(h∗<α)=P(Y‒>θ0−zphσ02∕m)=ph$. Note that this statement holds for any n*.

Then we solve for n* which controls the (1 − pq)th percentile of power at 1 − β: The conditional power given $Y‒$ is

$q∗=P(X‒−Y‒−zphσ02∕mσ12∕n∗>z1−α)=1−Φ(z1−α+Y‒−θ0−Δ+zphσ02∕mσ12∕n∗).$
(8)

It is obvious that q* is a monotonically decreasing with $Y‒$. Using this property, if we set q* = 1 − β at $Y‒=θ0+zpqσ02∕m$, the pqth percentile of $Y‒$, we can achieve the goal of controlling the (1 − pq)th percentile of power at 1 − β, because

$P(q∗>1−β)=P(Y‒<θ0+zpqσ02∕m)=pq.$

Thus by plugging q* = 1 − β and $Y‒=θ0+zpqσ02∕m$ into (8), we can solve for n* from the following equation,

$1−β=1−Φ(z1−α+(θ0+zpqσ02∕m)−θ0−Δ+zphσ02∕mσ12∕n∗).$

The solution for n*, equation (6), can be obtained after some algebra. The condition $Δ>(zpq+zph)σ02∕m$ is due to the positive constraint on $n∗$.

Appendix A.5. Algorithm of Simulation 3

Simulation 3. For iteration k = 1, …, K,

1. Generate HC data $Y(k)={Y1(k),⋯,Ym(k)}$ from $N(θ0,σ02)$. Compute the sample variance $s02(k)$;
2. Estimate the required sample size n*(k) based on Formula (6), with $σ02$ replaced by $s02(k)$;
3. Given sample size n*(k), generate null experimental data sets $X0(k,l)={X10(l),⋯,Xn∗(k)0(l)}$ from $N(θ0,σ12)$, and alternative experimental data sets $X(k,l)={X1(l),⋯,Xn∗(k)(l)}$ from $N(θ0+Δ,σ12)$, for l = 1, …, L;
4. Compute the empirical type I error by $h∗(k)≈∑l=1LI(Z∗(X0(k,l),Y(k))>z1−α)∕L$. Note that we replace the population variances ($σ02$ and $σ12$) in Z*(X0(k,l),Y(k) by sample variances ($s02(k)$ and $s12(k,l)$). Here $s12(k,l)$ is the sample variance of X0(k,l). Similarly, we compute the empirical power q*(k).

References

[1] Makuch RW, Simon RM. Sample size considerations for non-randomized comparative studies. Journal of Chronic Diseases. 1980;33(3):175–181. [PubMed]
[2] Vickers AJ, Ballen V, Scher HI. Setting the bar in phase II trials: The use of historical data for determining ”go/no go” decision for definitive phase III testing. Clinical Cancer Research. 2007;13(3):972–976. [PubMed]
[3] Cho SD, Krishnaswami S, Mckee JC, Zallen G, Silen ML, Bliss DW. Analysis of 29 consecutive thoracoscopic repairs of congenital diaphragmatic hernia in neonates compared to historical controls. Journal of Pediatric Surgery. 2009;44(1):80–86. [PubMed]
[4] Abe T, Kakemura T, Fujinuma S, Maetani I. Successful outcomes of emr-l with 3d-eus for rectal carcinoids compared with historical controls. World Journal of Gastroenterology. 2008;14(25):4054–4058. [PubMed]
[5] Storm C, Steffen I, Schefold JC, Krueger A, Oppert M, Jorres A, Hasper D. Mild therapeutic hypothermia shortens intensive care unit stay of survivors after out-of-hospital cardiac arrest compared to historical controls. Critical Care. 12(3):2008. [PubMed]
[6] Van Rooij WJ, De Gast AN, Sluzewski M. Results of 101 aneurysms treated with polyglycolic/polylactic acid microfilament nexus coils compared with historical controls treated with standard coils. American Journal of Neuroradiology. 2008;29(5):991–996. [PubMed]
[7] Ando R, Nakamura A, Nagatani M, Yamakawa S, Ohira T, Takagi M, Matsushima K, Aoki A, Fujita Y, Tamura K. Comparison of past and recent historical control data in relation to spontaneous tumors during carcinogenicity testing in fischer 344 rats. Journal of Toxicologic Pathology. 2008;21(1):53–60.
[8] Song JY, Chung BS, Choi KC, Shin BS. A 5-year period clinical observation on herpes zoster and the incidence of postherpetic neuralgia (2002–2006); a comparative analysis with the historical control group of a previous study (1995–1999) Korean Journal of Dermatology. 2008;46(4):431–436.
[9] Loudon I. The use of historical controls and concurrent controls to assess the effects of sulphonamides, 1936–1945. Journal of the Royal Society of Medicine. 2008;101(3):148–155. [PubMed]
[10] Dixon DO, Simon R. Sample size considerations for studies comparing survival curves using historical controls. Journal of Clinical Epidemiology. 1988;41(12):1209–1213. [PubMed]
[11] Chang MN, Shuster JJ, Kepner JL. Group sequential designs for phase II trials with historical controls. Controlled Clinical Trials. 1999;20(4):353–364. [PubMed]
[12] Kepner J, Wackerly D. Some observations on the makuch/simon approach to sample size determination in clinical trials with historical controls. Communications in Statistics Part B: Simulation and Computation. 2001;30(3):611–621.
[13] Chang MN, Shuster JJ, Kepner JL. Sample sizes based on exact unconditional tests for phase II clinical trials with historical controls. Journal of Biopharmaceutical Statistics. 2004;14(1):189–200. [PubMed]
[14] Lee JJ, Tseng C. Uniform power method for sample size calculation in historical control studies with binary response. Controlled Clinical Trials. 2001;22(4):390–400. [PubMed]
[15] Korn EL, Freidlin B. Conditional power calculations for clinical trials with historical controls. Statistics in Medicine. 2006;25(17):2922–2931. [PubMed]
[16] Abraham NS, DuraiRaj R, Young JM, Young CJ, Solomon MJ. How does an historic control study of a surgical procedure compare with the ”gold standard”? Diseases of the Colon and Rectum. 2006;49(8):1141–1148. [PubMed]

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.