We make three recommendations concerning the design and analysis of a randomized trial of cancer screening.
(1) Use death from cancer as the primary endpoint, but review death records carefully and report all causes of death
The primary endpoint of most cancer screening trials is death from cancer. Recently Black [3
] identified two types of biases that can affect the assessment of the cancer death endpoint. Sticky-diagnosis bias arises when deaths from an uncertain cause are more likely to be attributed to cancer if there was a previous diagnosis of cancer, especially if the diagnosis was relatively recent. If there were overdiagnosis, sticky-diagnosis would induce a higher cancer death rate in the intervention group than actually the case. Slippery linkage, the second type of bias, occurs because deaths that are caused or triggered by screening, work-up, or a subsequent therapy (e.g. perforation of the colon and perhaps cardiovascular deaths) are not attributed to screening.
Using all deaths as an endpoint avoids these biases but leads to prohibitive sample sizes as shown in the following calculations based on a power of 80% and a one-sided type I error of 2.5%.
First consider the design of a randomized trial with a cancer death endpoint. Under the null hypothesis, the probability of cancer death in each group is p. Under the alternative hypothesis the probability of cancer death is p in the control group and p-d in the study group, where d is the probability of cancer death in the control group minus the probability of cancer death in the screened group. For computing sample size, we assume d is positive. Assuming a Poisson distribution for the number of cancer deaths, the sample size (for both groups combined) for a cancer death endpoint is
Ncancer= 2 (1.96 Sqrt [2 vcancerH0] + .84 Sqrt [vcancerH0 + vcancerHA ])2/d2,
where vcancerH0 = p and vcancerHA= p-d are the variances for one subject under the null and alternative hypotheses respectively.
Now consider the design of a randomized trial with an all death endpoint. Let k denote the probability of death from causes unrelated to either cancer or screening. Under the null hypothesis the probability of death from all causes is p + k in each group. Under the alternative hypothesis the probability of death from all causes is p + k in the control group and (p + k)-(d-e) in the screened group, where e is the additional probability of non-cancer deaths due to screening. Therefore d-e is the probability of death from all causes in the control group minus the probability of death from all causes in the screened group. Assuming a binomial distribution for the number of deaths from all causes, the sample size (for both groups combined) for an all death endpoint is
Nall = 2 (1.96 Sqrt [2 vallH0 ] + .84 Sqrt [vallH0 + vallHA])2/(d-e)2,
where vallH0 = (p + k)(1-p-k) and vallHA= (p+k-d+e) (1-p-k+d-e) are the variances for one subject under the null and alternative hypotheses respectively.
For purposes of illustration, suppose that p
=. 005, k
=. 15 (these values are based roughly on data from a colorectal cancer screening trial [4
]), and d
=. 001. To minimize Nall
, we set e
= 0. With these specifications, a study with a cancer death endpoint would require Ncancer
= 150,000 participants while a study with an all death endpoint would require Nall
= 4.1 million participants.
For practical considerations, we recommend using cancer death as an endpoint with careful review of the death records to minimize sticky-diagnosis and slippery linkage bias. We also recommend that "cancer" deaths include any non-cancer deaths attributable to screening or treatment for the cancer.
We also recommend that all deaths and their causes be reported. If, after adjusting for multiple comparisons, there is a statistically significant difference between groups in the estimated probability of a particular non-cancer cause of death, the investigators should reexamine the death records to check for potential biases. If there are no potential biases, the investigators will need to consider the possibility that screening or treatment was responsible for the difference.
(2) Use a simple "causal" estimate to adjust for nonattendance and contamination occurring immediately after randomization
Two complications in the analysis of many randomized trials for cancer screening are (a) non-attendance, whereby some subjects randomized to a screening invitation do not attend the screening, and (b) contamination, whereby some subjects randomized to no screening invitation receive screening outside the trial. The standard approach for handling these complications is to fold them into the interpretation of an intent-to-treat estimate. Let p0 (p1) denote the cumulative fraction of subjects in the control (intervention) group who died from cancer. The intent-to-treat estimate, dITT= p1-p0, is the estimated effect of randomization to a screening invitation versus no screening invitation. However, in the presence of non-attendance and contamination, the intent-to-treat estimate is a biased estimate of the efficacy of screening, which is the effect of receiving screening.
If some reasonable assumptions hold (to be discussed) there is a simple, but not well-known, method for obtaining unbiased estimates of the effect of receiving screening in the presence of non-attendance and contamination. Let f0 (f1) denote the fraction of subjects in the control (intervention) group who receive screening, where f1 > f0. As discussed below, the "causal" estimate is
dcausal =(p1- p0) / (f1- f0),
which is the estimated effect (change in the probability of cancer death) of receiving screening among subjects who would receive screening if randomized to the intervention group but not if randomized to the control group. This estimate is not unique to screening but applies to any trial in which nonattendance or contamination occurs soon after randomization.
Glaziou et al [5
] proposed using dcausal
to estimate the effect of receiving screening. Baker and Lindeman [6
] and Angrist [7
] independently proposed a "causal" model for all-or-none compliance in comparative studies that gives rise to this type estimate and sharpens the interpretation. By "causal" we mean a formulation based on potential outcomes, as for example whether or not a subject receives screening if
randomized to a particular group. See [7
] for a more precise definition. For related models applied to cancer screening see also Baker [8
], Cuzick [9
], and McIntosh [10
]. The "causal" model relies on the following two assumptions if estimates are to be unbiased.
There are three types of subjects: always-takers who would receive screening if randomized to either group, never-takers who would not receive screening if randomized to either group, and compliers who would receive screening if randomized to the intervention group but not the control group. (In other words, no subjects would receive screening if randomized to the control group but not randomized to the intervention group).
For always-taker and never-takers the probability of cancer death is the same for each treatment group. (In other words, when a control subject switches to screening immediately after randomization, the screening regime is identical to that in intervention group, and when an intervention subject immediately refuses screening, the lack of screening is identical to that in the control group.)
Unfortunately neither of the assumptions is verifiable, but they are reasonable, and therefore have "face" validity. Although the analysis is not by intent-to-treat, it makes use of the randomization to avoid selection bias.
When computing f0 and f1, it is important to count only subjects who switch treatment immediately after randomization, so as not to violate Assumption 2. With this modification dcausal is unbiased even if additional subjects switch treatment later in the study, as for example, if some subjects are screened initially but refuse subsequent screenings. The effect of later switching is folded into the interpretation. Thus dcausal, is the estimated effect of immediately receiving screening with the understanding that the effect is likely attenuated from later switching of treatments.
In designing a randomized trial of cancer screening one should adjust the sample size for anticipated non-attendance and contamination. Suppose the anticipated fraction receiving immediate screening is f0
for the control and intervention groups, respectively. As derived by Zelen [11
], the adjusted sample size is the sample size if there were full attendance and no contamination divided by (f1
(3) Use a simple adaptive estimate to adjust for dilution following the last screen
In a typical randomized trial of cancer screening, screening is offered for a limited time and subjects are followed after screening has stopped. This leads to a dilution of treatment effect, as will be explained. Consider a special baseline variable B such that B = 1 if (i) the subject would not be detected with cancer if screened, (ii) the subject would become a cancer case after the time of the last screen, and (iii) the subject would die from the cancer during the follow-up period. Otherwise B = 0. In other words B = 1 indicates a set of cancer deaths that could not have benefited from screening. We can identify subjects with B = 1 in the screened group but not in the control group. Let D denote the number of subjects with B = 1 in the screened group. By virtue of the randomization, there will be approximately D subjects with B = 1 in the control group. As the length of follow-up after the last screening increases, the amount of dilution D increases, which increases the variance of the estimated treatment difference.
In estimating the relative risk of randomization to screening or no screening, the value of D affects the point estimate because D is added to both the numerator and denominator. But when estimating a difference in treatment effect between the groups, the value of D cancels. Nevertheless, the point estimate of a difference in treatment effect will likely change systematically during follow-up. The reason is that as follow-up increases, the point estimate includes longer-term effects of screening on cancer mortality. For example, suppose that screening reduces cancer mortality up to five years after the last screening. If one used the estimated difference in cancer mortality at the end of a 3-year follow-up period, this estimate would likely be biased relative to the true difference at 5 years. Thus, the longer the longer the follow-up period (up to some point) the less chance for bias due to excluding long-term effects of screening. But as mentioned previously, the longer the follow-up period the greater the dilution. Thus with longer follow-up, there is a variance-bias tradeoff for estimating the difference in cancer mortality.
Because of this variance-bias trade-off, the results of a randomized screening trial vary with the length of follow-up after the last screening. For example, consider data from the Health Insurance Plan of Greater New York (HIP) Study [12
] in which approximately 62,000 women were randomized to either no screening or an invitation for four annual breast cancer screenings. We estimated the reduction in the probability of cancer death among compliers at years 5, 10, and 15 since randomization pretending each of these times was fixed in advance of the study (Figure ). At 5 and 10 years after randomization, the lower bound of the 95% confidence interval was above zero; however this was not the case for 15 years after randomization. A major problem is how to best to analyze these data.
Figure 1 Effect of Follow-up on Estimated Reduction in Breast Cancer Deaths Data are from the HIP Study of breast cancer screening. The plot shows point estimates and 95% confidence intervals for estimated reduction in breast cancer deaths, per 10,000 compliers (more ...)
One approach is a limited mortality analysis [2
] that counts cancer deaths over the entire follow-up period but only among participants with cancer up to time tcatch-up
after randomization. The time tcatch-up
is the time when the number of cases in the control group first equals or surpasses (catches-up to) the number of cases in the intervention group. The presumption is that cases surfacing after tcatch-up
only dilute the estimated effect. One problem is that tcatch-up
does not occur if there is overdiagnosis. A related problem is that tcatch-up
might not occur for a very long time, making its calculation impractical. Another problem is that equal numbers of cases in both groups do not guarantee an unbiased test [13
A second approach is to test if screening reduces cancer mortality rates using a special weighted logrank statistic for survival data [14
A third approach is to select follow-up times based on maximum power given parameter estimates from previous trials and the effect size that one would like to detect [16
As a fourth approach, we propose a simple adaptive method to compute estimates and confidence intervals for the effect of screening when there is follow-up after the last screen. To the best of our knowledge this method is new to the screening literature. In this analysis, "adaptive" refers to using the data to select the follow-up time, with appropriate adjustment in computing confidence intervals. Let p0(t) and p1(t) denote the cumulative fraction of subjects who die from cancer up to time t in the control and intervention groups, respectively. Letting n denote the number of subjects in each group, we define
z(t)= (p0(t) - p1(t)) / (Sqrt [p0(t) + p1(t)]/n),
which is the difference between p0
(t) and p1
(t) divided by its standard error, i.e., the z-value associated with a normally distributed random variable. If screening reduces the probability of cancer death, z(t) will generally increase over the time t that screening is offered and perhaps a little longer. However at some point after screening has stopped z(t) will generally decrease over time because p0
(t) and p1
(t) will each increase by roughly the same amount from cases that arose after screening had stopped (i.e. the effect of dilution). See also [16
] for a justification of this behavior of z(t) based on modeling natural history in breast cancer screening. We assume that screening does not cause cancer deaths; otherwise it would be possible for z(t) to decrease for reasons other than dilution. This motivates selecting as the follow-up time the time t* that maximizes z(t) with an estimated effect of
dcausal(t*)=(p1(t*) -p0(t*))/ (f1- f0).
We interpret dcausal(t*) as the effect of receiving screening in compliers before dilution attenuates any effects. For dcausal(t*) to be correctly interpretable as an effect of receiving screening, we assume that after perhaps some initial fluctuations p1(t) -p0(t) is generally increasing or constant over time until dilution reduces z(t). In other words, although there may be a brief increase in cancer deaths due to screening soon after the start of the trial, we assume that after screening stops, screening does not start causing more cancer deaths than in the control group. Otherwise we might incorrectly attribute a small difference between p1(t) -p0(t) to the effect of dilution when it is due to delayed harms of screening and early treatment.
Computing confidence intervals by ignoring the fact that t* was based on the data represents "cutpoint optimization" [17
] and is thus inappropriate. To compute a confidence interval for dcausal
(t*) that accounts for the adaptive choice of t*, we use the following bootstrap [18
For purposes of illustration we applied this method to data in [2
] on breast cancer screening from the Health Insurance Plan of Greater of New York (HIP) Study. For each year after randomization we randomly generated a number of cancer deaths in each group based on a Poisson distribution with mean value equal to the observed number of deaths in that year and group. From these randomly generated data we computed t* and dcausal
(t*). We repeated this calculation 10,000 times to obtain distributions for t* and dcausal
(t*). The mean value of these distributions is the estimate and the lower 2.5 % and upper 97.5% quantiles gives the 95% confidence interval. For t* we obtained an estimate of 7.3 years with a 95% confidence interval of 4 to 13 years. For dcausal
(t*), the estimate and 95% confidence interval are shown in Figure .
To compute sample size for a randomized trial with follow-up after the last screening, we propose the following approach to account for the adaptive nature of the test statistic. The first step is to create anticipated data with m subjects per group under the null and alternative hypotheses. The second step is to treat the anticipated data as observed data and compute bootstrap estimates of the variance. Let vadpativeH0 and vadaptiveHA denote the bootstrap estimate of the variance divided by m under the null and alternative hypothesis, respectively. In other words vadpativeH0 and vadaptiveHA are the bootstrap estimates of variance for one subject. The sample size with cancer death endpoint and adjustment for non-attendance and contamination is
Nadaptive= 2((1.96 Sqrt [2 vadpativeH0] + .84 Sqrt [vadpativeH0 + vadaptiveHA])2/d2)/(f1- f0)2.
One other issue in design is the duration of screening. It should be sufficiently long so that any reduction in cancer mortality would be apparent before dilution has an effect.