We conducted several simulation studies to evaluate the proposed auxiliary variable estimator and the two-stage auditing procedure. Although PFS is a combination of progression and death, we focus on the progression time, which was taken to have an underlying exponential distribution. Because progression is measured at intervals, we discretize progression times according to the following process. We assume that for the first year, patients undergo evaluations every six weeks, followed by visits every three months. We further assume that patients undergo evaluations not at their scheduled times, but arrive within a +/− 2 week period of the scheduled visit, using a uniform distribution. The true observed event time is not observed until its following evaluation time. The time of the observed “true” discretized event was then subjected to the errors associated with radiologists’ evaluations. We assume that LE and BICR have a probability of detecting the true event that is a multinomial distribution around the true discretized time. Multinomial probabilities, multi(p−3, p−2, p−1, p0, p1, p2), describe the likelihood of calling a progression around the true event time as follows: p0 describes the probability of calling progression at the true time, while p−3, p−2, and p−1 give probabilities of calling a progression for each of the three evaluation times prior to the event, and p1, p2 give the probability of calling progression for the each of two time points after the true event.
We generate data for 360 patients in each treatment arm, assuming patients are enrolled uniformly over 2 years, with an additional follow-up of 26 weeks after enrollment. The underlying median PFS in the control arm was set to 24 weeks. We consider four different treatment-effect sizes: none, small, moderate and large, corresponding to log-hazard ratios of 0, −0.288, −0.511, and −0.773 (hazard ratios: 1, 0.75, 0.6, 0.46). For each of these, we consider settings with no, small and large reader-evaluation bias. The multinomial probabilities are the same for both treatment arms under the no reader-evaluation bias model for both the local evaluation and BICR. Under scenarios with small and larger reader-evaluation bias, the probabilities in control arm changes (for the local evaluations only) so that progression times are more likely to be called earlier. describes the assigned probabilities for each of the simulations presented here. For example, under the scenario with no bias, the probability of calling an event three evaluation times before the true time, p−3, is 2.5% in both treatment arms. Under “large” bias, the probability increases to 10% in the control arm, but remains 2.5% in the experimental arm. In all simulations, the probabilities given to the BICR are specified in the “None” row in . For each scenario, we generated 10,000 data sets. The initial BICR sample to estimate ρ was chosen to ensure about 100 events total. For each effect size, this meant the initial sample was 19%, 20%, 21% and 22%, for the null, small, moderate and large effect sizes, respectively.
Table 1 Multinomial probabilities describing reader error associated with determining time of progression. Observed event times are shifted according to these probabilities. Local evaluations follow the probabilities described here, while the BICR follow the (more ...)
provides results about the disagreement rates in the time of progression or censoring for the LE and BICR. If the times agreed within a +/− 6 week interval, this was considered an agreement. Without reader bias, the disagreement rates were 32% in the control arm, but range from 32% to 23% in the experimental arm, depending on the effect size. Observe that the disagreement rates differ by treatment arm even without reader-evaluation bias. The disagreement rate decreases in the experimental arm with increasing effect size because more subjects become undergo administrative censoring. For the control arm, with small reader bias, the disagreement rate was 46%, while with large reader bias the disagreement rate was 56%. The correlation between the LE and BICR log-hazard ratios is given in the last column of , ranging from 0.70 to 0.84.
Summary of average disagreement rates and correlations for simulation scenarios. Agreement is defined as same date of progression or censoring within +/− 6 week interval.
To evaluate the properties of the proposed estimator in (3
), summarizes simulations irrespective of whether the LE results were significant. Results are summarized for the estimators of the log-hazard ratio based on the LE, full BICR, the sample audited BICR data (without the auxiliary variable incorporated) and the proposed auxiliary variable-based estimator. The true log-hazard ratio that generated the continuous event times is presented, along with the approximate asymptotic BICR log-hazard ratio. The true BICR log-hazard was computed empirically by generating 1 million observations per treatment group. Observe that with no reader bias, the estimators of the effect size for all methods are approximately unbiased. Under the audit, however, the auxiliary-variable estimator is approximately twice as efficient as the simple audit estimator (last column). With reader-evaluation bias, the LE results are biased in favor of the experimental therapy. However, because of the impact of informative censoring, the BICR-based log-hazard ratios are biased towards superiority of the control treatment as compared to the true effect size. The proposed variance estimator of C
performs well, as is seen by the agreement of the simulated mean of
and the empirical SD of the C
. Finally, note that the standard deviations of the BICR and LE differ slightly, even without reader-evaluation bias. This is because, under the BICR, events are censored when a LE event time is called before the BICR event time. As can be seen in the simulations, this does not cause bias in the estimates of the log-hazard ratios (when there is no reader bias).
summarizes results from the two-stage audit approach, comparing it to a full BICR. We set δ1=0.7. The proportion of LE results that reject the null hypothesis H0: θ ≥ 0, used a one-sided 0.025 level test. Because the BICR is not prompted unless the LE result is significant, for the full BICR and the proposed two-stage BICR strategy, we report the proportion of times both the LE and the BICR results reject H0: θ ≥ 0. For a full BICR, α = 0.05, one-sided. We applied the Hochberg procedure as described in section 2.3, with α = 0.05. As expected, under the null hypothesis (θ = 0), with “small” and “large” amounts of reader bias, the type I error for the LE is inflated considerably; rejection rates are 31% and 65% for these two scenarios, respectively. However, under a full BICR and the two-stage audit strategy, the overall rejection rates are conservative. Indeed, the full BICR and the BICR audit never reject the null hypothesis in these cases. This is because informative censoring, which occurs when an early progression call by LE cannot be confirmed by BICR, biases BICR estimators towards superiority of the control treatment. The resulting reduction in power from this bias towards the null hypothesis (under BICR) is also observed. Finally, note the slight loss in power of the BICR audit relative to the full BICR in two scenarios (in the small bias and moderate effect size setting; and in the large bias and large effect size setting). Under the large reader bias and large effect size setting, the full BICR rejected 86% of the time, while the audit procedure rejected 81% of the time. This is the cost of the two-stage procedure selected for these simulations. A more stringent α-level may be preferred for the initial audit, which would reduce power loss, but increase the audit size.
Summary of simulation studies of two-stage auditing strategy. LE is local evaluation. Full BICR is a complete-case blinded independent central review. BICR two-stage audit is the proposed audit strategy.