Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3218246

Formats

Article sections

- SUMMARY
- 1. Introduction
- 2. Overview of methodology
- 3. Models for predicted results
- 4. Application
- 5. Discussion
- 6. Supplementary Materials
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2013 March 1.

Published in final edited form as:

Published online 2011 August 13. doi: 10.1111/j.1541-0420.2011.01646.x

PMCID: PMC3218246

NIHMSID: NIHMS306269

Stuart G. Baker, National Cancer Institute, EPN 3131, 6130 Executive Blvd MSC 7354, National Cancer Institute, Bethesda, MD 20892-7354, USA;

Stuart G. Baker: vog.hin@i61bs

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Using multiple historical trials with surrogate and true endpoints, we consider various models to predict the effect of treatment on a true endpoint in a target trial in which only a surrogate endpoint is observed. This predicted result is computed using (1) a prediction model (mixture, linear, or principal stratification) estimated from historical trials and the surrogate endpoint of the target trial and (2) a random extrapolation error estimated from successively leaving out each trial among the historical trials. The method applies to either binary outcomes or survival to a particular time that is computed from censored survival data. We compute a 95% confidence interval for the predicted result and validate its coverage using simulation. To summarize the additional uncertainty from using a predicted instead of true result for the estimated treatment effect, we compute its multiplier of standard error. Software is available for download.

In recent years, the medical and biostatistical literature has devoted considerable attention to surrogate versus true endpoints, where a true endpoint is the health outcome of interest and a surrogate endpoint is an outcome observed before the true endpoint that is used to make conclusions about the effect of intervention on the true endpoint (Weir and Walle, 2006, and Lassere, 2008). By using surrogate instead of true endpoints, clinicians can draw conclusions sooner, thereby potentially helping more future patients. However the use of surrogate endpoints to draw these conclusions involves the additional uncertainty associated with extrapolating estimates to an unobserved true endpoint.

To clearly and concisely discuss this extrapolation, we introduce the following terminology. A *target trial* is a randomized trial of the intervention of interest in which data are available on surrogate but not true endpoints. An *historical* trial is a previously conducted randomized trial with the same surrogate endpoint as in the target trial and the same true endpoint that would be observed with sufficiently long follow-up in the target trial, but typically comparing different interventions than in the target trial. A *true result* is the estimated effect of an intervention on the true endpoint in the target trial based on a comparison of the true endpoints from each arm of the target trial that would be observed with sufficiently long follow-up. A *surrogate result* is the estimated effect of an intervention on the surrogate endpoint in the target trial based on a comparison of surrogate endpoints from each arm of the target trial. A *predicted result* is the estimated effect of an intervention on the true endpoint in the target trial based on a comparison of predicted true endpoints from each arm of the target trial that are derived from the surrogate endpoints in the target trial and a model relating surrogate and true endpoints that is fit to data from the historical trials.

Analyzing trial data with surrogate endpoints requires an *extrapolation procedure,* which is a method to draw conclusions about the true result in the target trial based on either the surrogate or predicted result. When substituting a surrogate result for a true result in the target trial, the extrapolation procedure is testing a null hypothesis of no effect of treatment on the surrogate endpoint and drawing conclusions about a null hypothesis of no effect of treatment on true endpoint. When the surrogate endpoint is binary, this extrapolation procedure yields correct conclusions if the following criteria hold in the target trial: (*i*) the Prentice Criterion, namely that the true endpoint is conditionally independent of the randomization group given the surrogate endpoint, (*ii*) treatment is prognostic for the surrogate and true endpoints, and (*iii*) the surrogate endpoint is prognostic for the true endpoint (Prentice, 1989; Buyse and Molenberghs, 1998). When substituting a predicted result for the true result, an extrapolation procedure consists of computing a 95% confidence interval for the predicted result (Daniels and Hughes, 1997; Gail et al., 2002; Korn, Albert, and McShane, 2005).

Before using an extrapolation procedure to draw conclusions from a surrogate endpoint, it is important to compute a *validation measure,* which is a statistic based on historical trials to determine if the extrapolation procedure will likely yield correct conclusions. For a surrogate result, one validation measure is a test of the Prentice Criterion in historical trials. However, failure to reject the Prentice Criterion does not ensure it holds. Another validation measure for a surrogate result is the proportion of treatment effect explained in a historical trial (Freedman, Graubard, and Schatzkin, 1992). However, it is difficult to select an acceptable value of this measure. For a predicted result, a commonly used validation measure is
${\text{R}}_{\text{trial}}^{2}$, which equals one minus the ratio of the estimated variances of the true result adjusting, versus not adjusting, for the surrogate result in the historical trials (Buyse et al., 2000). However, selecting a threshold for
${\text{R}}_{\text{trial}}^{2}$ that suffices for validation is difficult (Burzykowski and Buyse, 2006).

We consider surrogate and true endpoints involving probabilities of binary outcomes or probabilities of survival to a particular time computed from censored survival data. In this context, we make the following three contributions to the evaluation of an intervention in the target trial using the predicted result. First, we propose, as an extrapolation procedure, a 95% confidence interval for the predicted result which is derived from a prediction model and an estimated random extrapolation error. Second, we propose, as a validation measure, the coverage of the 95% confidence interval for the predicted result using a simulation based on the historical trials. Third, we propose, as a summary of the additional uncertainty when using predicted instead of true results, a standard error multiplier based on the estimated variances of predicted and true results.

In this section we discuss the basic formulation, a key assumption, the extrapolation procedure, the validation measure, and the standard error multiplier.

The basic formulation involves probabilities of binary surrogate and true outcomes that are later extended to survival data. Let subscript “*M*” denote the prediction model, which is discussed in detail in Section 3. The key idea is to write the predicted result for target trial *i*, denoted (predicted result)* _{Mi}*, as the sum of a model-based predicted result, denoted (model result)

$${(\text{predicted}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mi}={(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mi}+{\widehat{\epsilon}}_{Mi},$$

(1)

where * _{Mi}* has mean

$${(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mi}={\widehat{\pi}}_{Mi1}-{\widehat{\pi}}_{Mi0},$$

(2)

where * _{Miz}* is the estimated model-based probability of true endpoint in arm

$${(\text{true}\phantom{\rule{0.16667em}{0ex}}\text{result}\phantom{\rule{0.16667em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}\text{left}-\text{out}\phantom{\rule{0.16667em}{0ex}}\text{trial})}_{j}={(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result}\phantom{\rule{0.16667em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}\text{left}-\text{out}\phantom{\rule{0.16667em}{0ex}}\text{trial})}_{Mj}+{\epsilon}_{Mj},$$

(3)

where *ε _{Mi}* has expected value

$${(\text{true}\phantom{\rule{0.16667em}{0ex}}\text{result}\phantom{\rule{0.16667em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}\text{left}-\text{out}\phantom{\rule{0.16667em}{0ex}}\text{trial})}_{j}={\widehat{\pi}}_{Tj1}-{\widehat{\pi}}_{Tj0},$$

(4)

where * _{Tjz}* is the estimated probability of true endpoint in arm

$${\widehat{\epsilon}}_{Mj}={(\text{true}\phantom{\rule{0.16667em}{0ex}}\text{result}\phantom{\rule{0.16667em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}\text{left}-\text{out}\phantom{\rule{0.16667em}{0ex}}\text{trial})}_{j}-{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result}\phantom{\rule{0.16667em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}\text{left}-\text{out}\phantom{\rule{0.16667em}{0ex}}\text{trial})}_{Mj},$$

(5)

which has mean and estimated variance

$${\widehat{\mu}}_{M}=\sum _{j=1}^{k}\frac{{\widehat{\epsilon}}_{Mj}}{k}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{\widehat{\sigma}}_{M}^{2}=\sum _{j=1}^{k}\frac{{({\widehat{\epsilon}}_{Mj}-{\widehat{\mu}}_{M})}^{2}}{k-1},\text{respectively}.$$

(6)

The extrapolation procedure consists of computing a 95% confidence interval (*CI*) for predicted result)* _{Mi}* using a binomial estimate of the variance of (model result)

$${CI}_{Mi}={(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mi}+{\widehat{\mu}}_{M}\pm 1.96\sqrt{\widehat{\mathit{var}}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mi}+{\widehat{\sigma}}_{M}^{2}},$$

(7)

where
$\widehat{\mathit{var}}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mi}={\sum}_{z=0}^{1}{\widehat{\pi}}_{\mathit{Miz}}(1-{\widehat{\pi}}_{\mathit{Miz}})/{n}_{iz}$ and *n _{iz}* is the number of persons in arm

To draw conclusions about the effect of treatment on the true endpoint in target trial *i* we assume that (predicted result)* _{Mi}* in Equation (1) is an unbiased estimate of the effect of treatment on the unobserved true endpoint in target trial

The validation measure is the coverage of the 95% confidence interval for the predicted result as computed in a simulation of *k* historical trials and 1 target trial based on the data from the *k* historical trials. Ideally this coverage should be close to 95%. Each iteration of the simulation consists of the following steps.

- Step 1 Compute the mean and variance of the parameter estimates under the prediction model for each of
*k*historical trials. - Step 2 Randomly generate “population” parameters for
*k*simulated historical trials and 1 simulated target trial based on the mean and variance in Step 1. - Step 3 Compute “population” counts from the “population” parameters generated in Step 2.
- Step 4 Randomly generate “sample” counts from “population” counts in Step 3 using a multinomial model for the data from each trial arm.
- Step 5 From the “sample” counts in Step 4, use Equation (7) to compute the 95% confidence intervals for the predicted result in the simulated target trial.
- Step 6 From the “population” counts in Step 3, compute the “population” true result for the simulated target trial.
- Step 7 Set an indicator to 1 if the 95% confidence interval for the predicted result in Step 5 covers the “population” true result for the simulated target trial in Step 6, and set to 0 otherwise.

These steps are repeated for 1000 iterations. In Step 3, we use the sample sizes of the historical trials as sample sizes for the *k* simulated historical trials. For the simulated target trial, which corresponds to a yet unknown target trial, we use an average of the sample sizes over the *k* historical trials. If the target trial were known, its sample size would be used for the simulated target trial. In Step 4, to avoid numerical problems we added 0.1 to zero counts; adding 0.2 had little impact (not shown). The simulation coverage of each confidence interval of the predicted result is the fraction of iterations with indicator equal to 1 in Step 7. Details are given in Web Appendix A.

Reasonably good coverage in the simulation is a necessary condition for the surrogate endpoint to be useful in computing predicted results. However it is not sufficient. There is a “price” for good coverage, namely wider confidence intervals that account for the variability of extrapolation. We quantify the increase width in the confidence intervals using the standard error multiplier, which is the average, over *k* historical trials, of the ratio of the standard error of the predicted result for each left-out trial to the standard error of the true result for each left-out trial,

$${(\text{standard}\phantom{\rule{0.16667em}{0ex}}\text{error}\phantom{\rule{0.16667em}{0ex}}\text{multiplier})}_{Mj}=\frac{1}{k}\sum _{j=1}^{k}\frac{\sqrt{\widehat{\mathit{var}}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mj}+{\widehat{\sigma}}_{Mj}^{2}}}{\sqrt{\widehat{\mathit{var}}{(\text{true}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{Mj}}},$$

(8)

where
$\widehat{\mathit{var}}{(\text{true}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{j}={\sum}_{z=0}^{1}{\widehat{\pi}}_{\mathit{Tiz}}(1-{\widehat{\pi}}_{\mathit{Tiz}})/{n}_{iz}$ and * _{Mj}* is computed using only the left-in historical trials. By modifying Equation (7), we also computed the 95% confidence interval for the predicted result of left-out target trial

We discuss the computation of (model result)* _{Mi}* for three types of models: mixture, linear, and principal stratification. Before discussing these models, we introduce basic parameters, an extension to survival data, weighted averages, and a method for labeling trial arms.

Let *S* = 0, 1 denote the binary surrogate endpoint, and *T* = 0, 1 denote the binary true endpoint, where *S* = 1 and *T* = 1 corresponds to favorable outcomes. Let *j* index historical trials in a set *H* of historical trials, and let *i* index a target trial. The basic parameters are

$$\begin{array}{l}{\pi}_{\mathit{Sjz}}=pr(S=1\text{historical}\phantom{\rule{0.16667em}{0ex}}\text{trial}\phantom{\rule{0.16667em}{0ex}}j,\text{arm}\phantom{\rule{0.16667em}{0ex}}z),& {\pi}_{\mathit{Tjz}}=pr(T=1\text{historical}\phantom{\rule{0.16667em}{0ex}}\text{trial}\phantom{\rule{0.16667em}{0ex}}j,\text{arm}\phantom{\rule{0.16667em}{0ex}}z),& {\theta}_{T\mathit{jzs}}{\pi}_{\mathit{Siz}}^{}\end{array}$$

(9)

Let *x _{jzst}* denote the number of participants in historical trial

$${\widehat{\pi}}_{\mathit{Sjz}}=\frac{{x}_{jz1+}}{{x}_{jz++}},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\widehat{\pi}}_{\mathit{Tjz}}=\frac{{x}_{jz+1}}{{x}_{jz++}},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\widehat{\theta}}_{T\mathit{jzs}}$$

(10)

where the subscript “+” denotes summation over the indicated index.

With survival data, we compute the basic parameters by first transforming the survival data into counts for binary outcomes. Let *S* = 0 if an unfavorable surrogate event (e.g. cancer recurrence) occurs before time *τ _{S}*, and 1 otherwise. Let

$$\begin{array}{l}{\stackrel{~}{x}}_{jz00}={n}_{jz}(1-{\stackrel{~}{\pi}}_{\mathit{Sjz}})(1-{\stackrel{~}{\theta}}_{Tjz0}{\stackrel{~}{x}}_{jz10}={n}_{jz}(1-{\stackrel{~}{\pi}}_{\mathit{Sjz}})\phantom{\rule{0.38889em}{0ex}}{\stackrel{~}{\theta}}_{Tjz0}\hfill & {\stackrel{~}{x}}_{jz10}={n}_{jz}\phantom{\rule{0.16667em}{0ex}}{\stackrel{~}{\pi}}_{\mathit{Sjz}}(1-{\stackrel{~}{\theta}}_{Tjz1}{\stackrel{~}{x}}_{jz11}={n}_{jz}\phantom{\rule{0.16667em}{0ex}}{\stackrel{~}{\pi}}_{\mathit{Sjz}}\phantom{\rule{0.16667em}{0ex}}{\stackrel{~}{\theta}}_{Tjz1}\hfill \hfill \hfill \end{array}$$

(11)

As shown in Appendix A, substituting * _{jzst}* from Equation (11) into Equation (10) yields the same estimates and similar variances to those obtained directly from individual-level survival data. The counts in Equation (11) can be made publicly available (Web Tables 1, 2, 3) even when individual-level survival data cannot be made available due to confidentiality issues. Public availability of data is important so that other researchers can reproduce the results.

When computing model-based predicted results, we use a weighted average of model-based predicted results from each historical trial, where the weights are proportional to sample size. We use the notation *avg _{j}*

Sometimes is not clear how to label treatment arms as *Z* = 0 or *Z* = 1, particularly when a treatment is a control in one trial and experimental in another trial. To make labeling objective, we assign *Z* = 1 to the arm of the trial in which the estimate of *pr*(*S* = 1) is largest and *Z* = 0 to the arm of the trial in which the estimate of *pr*(*S* = 1) is smallest (Baker, 2008). Recall that *pr*(*S* = 1) is either the probability of survival or the probability of the more favorable outcome.

The mixture model (Baker, 2008) is based on the mathematical identity

$${\pi}_{\mathit{Tjz}}={\theta}_{Tjz0}$$

(12)

Equation (12) motivates the following model (MIX1) for the predicted probability of the true endpoint in target trial *i* based on an estimate of *θ _{T\izs}* from historical trial

$${\widehat{\pi}}_{\text{MIX}1\mathit{izj}}={\widehat{\theta}}_{Tjz0}$$

(13)

No additional assumptions are needed for estimation. The model-based predicted result for target trial *i* is the average difference in predicted probabilities,

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{MIX}1i}={\mathit{avg}}_{jH}& ={\widehat{\pi}}_{\text{MIX}1i1}-{\widehat{\pi}}_{\text{MIX}1i0},\phantom{\rule{0.16667em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}{\widehat{\pi}}_{\text{MIX}1iz}={\mathit{avg}}_{jH}\end{array}$$

(14)

The additional assumption of the Prentice Criterion, namely that _{T\jzs}_{T}_{|}* _{js}*, leads to mixture model MIX2,

$${\widehat{\pi}}_{\text{MIX}2\mathit{izj}}={\widehat{\theta}}_{Tj0}$$

(15)

Based on Equation (15), the model-based predicted result under MIX2 is

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{MIX2}i}={\mathit{avg}}_{jH}& ={\widehat{\pi}}_{\text{MIX}2i1}-{\widehat{\pi}}_{\text{MIX}2i0},\text{where}\phantom{\rule{0.16667em}{0ex}}{\widehat{\pi}}_{\text{MIX}2iz}={\mathit{avg}}_{jH}& =({\widehat{\pi}}_{Si1}^{}\end{array}$$

(16)

which is proportional to the surrogate result, namely ${\widehat{\pi}}_{Si1}^{}$.

A random effects linear model for predicted results can be derived by assuming a joint multivariate normal distribution for parameters modeling surrogate and true endpoints in the historical trials (Buyse et al., 2000; Gail et al., 2000). The following model (LIN1) is a variation of this model involving binary surrogate and true endpoints,

$$\begin{array}{l}{\mathit{\pi}}_{j}=({\pi}_{Si0},{\pi}_{Ti0},{\pi}_{Si1},{\pi}_{Ti1})~\text{Normal}(\mathit{\pi},\mathbf{\sum}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}\\ \mathit{\pi}=({\pi}_{S0},{\pi}_{T0},{\pi}_{S1},{\pi}_{T1})\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}\mathbf{\sum}=\left(\begin{array}{cccc}{d}_{S0S0}& {d}_{S0T0}& {d}_{S0S1}& {d}_{S0T1}\\ {d}_{S0T0}& {d}_{T0T0}& {d}_{S1T0}& {d}_{T0T1}\\ {d}_{S0S1}& {d}_{S1T0}& {d}_{S1S1}& {d}_{S1T1}\\ {d}_{S0T1}& {d}_{T0T1}& {d}_{S1T1}& {d}_{T1T1}\end{array}\right).\end{array}$$

(17)

A standard formula for a conditional multivariate normal distribution transforms Equation (17) into the multivariate linear regression,

$$\begin{array}{l}E\{\left(\begin{array}{c}{\pi}_{Tj0}\\ {\pi}_{Tj1}\end{array}\right)\left(\begin{array}{c}{\pi}_{Sj0}\\ {\pi}_{Sj1}\end{array}\right)\}=\left(\begin{array}{c}{\pi}_{T0}\\ {\pi}_{T1}\end{array}\right)+\mathit{M}\{\left(\begin{array}{c}{\pi}_{Sj0}\\ {\pi}_{Sj1}\end{array}\right)-\left(\begin{array}{c}{\pi}_{S0}\\ {\pi}_{S1}\end{array}\right)\},\phantom{\rule{0.16667em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}& \mathit{M}\left(\begin{array}{cc}{m}_{00}& {m}_{01}\\ {m}_{10}& {m}_{11}\end{array}\right)=\left(\begin{array}{cc}{d}_{S0T0}& {d}_{S0T1}\\ {d}_{S1T0}& {d}_{S1T1}\end{array}\right)\phantom{\rule{0.16667em}{0ex}}{\left(\begin{array}{cc}{d}_{S0S0}& {d}_{S0S1}\\ {d}_{S0S1}& {d}_{S1S1}\end{array}\right)}^{-1}.\end{array}$$

(18)

The model-based predicted result under LIN1 for trial *i* is

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{LIN1}i}={\widehat{\pi}}_{\text{LIN}1i1}-{\widehat{\pi}}_{\text{LIN}1i0},\phantom{\rule{0.16667em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}{\widehat{\pi}}_{\text{LIN}1iz}={\mathit{avg}}_{jH}& \left(\begin{array}{c}{\widehat{\pi}}_{\text{LIN}1i0j}\\ {\widehat{\pi}}_{\text{LIN}1i1j}\end{array}\right)=\left(\begin{array}{c}{\widehat{\pi}}_{T0j}\\ {\widehat{\pi}}_{T1j}\end{array}\right)+{\widehat{\mathit{M}}}_{\text{LIN}1}\{(\begin{array}{c}{\widehat{\pi}}_{Si0}^{}& {\widehat{\pi}}_{Si1}^{}\\ )\end{array}-\left(\begin{array}{c}{\widehat{\pi}}_{S0j}\\ {\widehat{\pi}}_{S1j}\end{array}\right)\},\end{array}$$

(19)

and _{LIN1} is computed from an estimate of **Σ** (Web Appendix B).

The model LIN2 is similar to LIN1 but with a logit transformation for the model-based predicted result,

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{LIN2}i}={\widehat{\pi}}_{\text{LIN2}i1}-{\widehat{\pi}}_{\text{LIN2}i0},\phantom{\rule{0.16667em}{0ex}}\text{where}\phantom{\rule{0.16667em}{0ex}}{\widehat{\pi}}_{\text{LIN2}iz}={\mathit{avg}}_{jH}\end{array}$$

(20)

where _{LIN2} is computed using the delta method (Web Appendix C).

The model LIN3 is a special case of LIN1 under the following assumptions: (*i*) *d _{S}*

$${(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{LIN}3i}={\widehat{\alpha}}_{\text{LIN}3}+{\widehat{\beta}}_{\text{LIN}3}{(\text{surrogate}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{i}.$$

(21)

We compute _{LIN3} and _{LIN3} using weighted least squares with weights proportional to the sample size. To put the model-based predicted result for LIN3 into our standard form of a difference in probabilities, we rewrite Equation (21) as

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{LIN}3i}={\widehat{\pi}}_{\text{LIN}3i1}-{\widehat{\pi}}_{\text{LIN}3i0},\phantom{\rule{0.16667em}{0ex}}\text{where}\\ {\widehat{\pi}}_{\text{LIN}3i1}={\widehat{\alpha}}_{\text{LIN}3}/2+{\widehat{\beta}}_{\text{LIN}3}{\widehat{\pi}}_{Si1}^{}& {\widehat{\pi}}_{\text{LIN}3i0}=-{\widehat{\alpha}}_{\text{LIN}3}/2+{\widehat{\beta}}_{\text{LIN}3}\phantom{\rule{0.16667em}{0ex}}{\widehat{\pi}}_{Si0}^{}\end{array}$$

(22)

arbitrarily splitting _{LIN3} in half.

Model LIN4 is a special case of LIN3 that computes _{LIN4} from a weighted least squares model with a zero intercept,

$${(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{LIN}4i}={\widehat{\pi}}_{\text{LIN}4i1},\phantom{\rule{0.38889em}{0ex}}\text{where}\phantom{\rule{0.38889em}{0ex}}{\widehat{\pi}}_{\text{LIN}4iz}={\widehat{\beta}}_{\text{LIN}4}{\widehat{\pi}}_{\mathit{Siz}}^{}$$

(23)

Permutt and Hebel (1989), Baker and Lindeman (1994), Imbens and Angrist (1994) followed by Angrist, Imbens, and Rubin (1996), and Cuzick, Edwards, and Segnan (1997) independently proposed a potential outcomes model involving a group 0 in which all subjects would ideally receive treatment T0 and a group 1 in which all subjects would ideally receive treatment T1 with the following two key features. First, it involved four potential outcome categories: (*i*) receive treatment T1 regardless of group, (*ii*) receive treatment T0 if in group 0 and T1 if in group 1, (*iii*) receive treatment T1 if in group 0 and T0 if in group 1, and (*iv*) receive treatment T0 regardless of group. Second, under plausible assumptions of no person in category (*iii*) and the effect of treatment not depending on group in categories (*i*) and (*iv*), the model yields an unbiased estimate of the effect of receipt of treatment in category (*ii*), which is then generalized to all subjects.

Frangakis and Rubin (2002) extended the four potential outcomes categories to any post-randomization variable, including a binary surrogate endpoint, and called the four categories principal strata. An appealing feature of this model is that the principal strata are baseline variables. Frangakis and Rubin (2002) proposed using principal stratification to predict a true endpoint using data from surrogate endpoints in a target trial and a single historical trial, but did not discuss estimation. We extend their proposal to multiple historical trials and discuss estimation. Let *S _{PS}* = (

$$\begin{array}{l}{\pi}_{Sj(ab)}=pr({S}_{PS}=(a,b)\text{historical}\phantom{\rule{0.16667em}{0ex}}\text{trial}\phantom{\rule{0.16667em}{0ex}}j),& {\theta}_{Tjz(ab)}{\pi}_{Si(ab)}^{}\end{array}$$

(24)

Without assumptions, the general principal stratification model (PSM),

$${\pi}_{\mathit{Tjz}}=\sum _{a=0}^{1}\sum _{b=0}^{1}{\theta}_{Tjz(ab)},$$

(25)

is not identifiable because the number of parameters exceeds the number of independent cell counts. To obtain an identifiable model, we make the following assumptions which are analogous to the aforementioned assumptions in potential outcomes model involving treatments T0 and T1.

No subject would have a surrogate endpoint at level *s* = 1 if randomized to arm *z* = 0 and a surrogate endpoint at level *s* = 0 if randomized to arm *z* = 1, implying *π _{Sj}*

The probability of true endpoint does not depend on arm *z* among subjects whose level of surrogate endpoint *s* would remain unchanged with a different arm *z*, implying *θ _{T}*

*Assumption PSM-1* is plausible because of the labeling scheme that defines *z* = 1 as the arm with the highest estimated probability that *s* = 1. *Assumption PSM-2* is plausible if treatment affects only recurrence of the primary cancer; however it could be violated if treatment affects microscopic metastases that could lead to a secondary tumor. Incorporating these assumptions gives the principal stratification model

$${\pi}_{\mathit{Tjz}}=\sum _{a=0}^{1}{\theta}_{Tj(aa)}$$

(26)

Based on Equation (26), the model-based predicted result under PSM is

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{PSM}i}={\widehat{\pi}}_{\text{PSM}i1}-{\widehat{\pi}}_{\text{PSM}i0},\phantom{\rule{0.16667em}{0ex}}\text{where}\\ {\widehat{\pi}}_{\text{PSM}iz}=\sum _{a=0}^{1}{\mathit{avg}}_{jH}\end{array}$$

(27)

Maximum likelihood estimates of parameters from the historical trials are computed as a perfect fit estimate, if admissible, or otherwise via an EM algorithm (Baker, 2011) with _{T}_{|}_{iz}_{(01)} restricted between 0.01 and 0.99. For the target trial, the perfect fit estimates are
${\widehat{\pi}}_{Si(00)}^{}$ and
${\widehat{\pi}}_{Si(01)}^{}$ which implies that the model-based predicted result is proportional to the surrogate result,

$${(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{PSM}i}={\mathit{avg}}_{jH}$$

(28)

If the perfect fit estimate for ${\pi}_{Si(01)}^{}$ is less than 0, we instead set ${\widehat{\pi}}_{Si(11)}^{}$ and ${\widehat{\pi}}_{Si(01)}^{}$.

We applied our method to three data sets of historical trials (Web Tables 1, 2, and 3) and an artificial data set that combined subsets of these three data sets. Before looking at the data, we specified the survival times for surrogate and true events, *τ*_{S} and *τ*_{T}, that are needed for classification into surrogate and true endpoints.

Data Set 1 consists of 10 randomized trials for early colon cancer (Sargent et al., 2005) with *S* = 0 if cancer recurred before 3 years, and *S* = 1 otherwise. Also *T* = 0 if overall mortality occurred before 5 years, and *T* = 1 otherwise. Eight trials had two arms, one trial had three arms, and one trial had four arms. All of the trials compared different sets of treatments, with some treatments given in multiple trials. To estimate *θ _{T}*

Data Set 2 consists of 10 randomized trials for advanced colorectal cancer (Meta-Analysis Group in Cancer, 2004; Buyse et al., 2007) with *S* = 0 if cancer progressed before 6 month, and *S* = 1 otherwise. Also *T* = 0 if overall mortality occurred before 12 months and *T* = 1 otherwise. All of the trials had two arms. In 7 of the ten trials, the intervention was FU + LV versus FU, with FU+LV labeled as the control arm in 6 of those trials and the experimental arm in the other trial. In the remaining 3 trials, the intervention was FU+LV versus RA. This is not the ideal situation in which all the interventions differ. To estimate *θ _{T}*

Data Set 3 consists of 27 randomized trials for advanced colorectal cancer (Burzykowski et al., 2004) with *S* = 0 if the cancer was stable or progressive at 3–6 months, and *S* = 1 if complete or partial response of the tumor occurred at 3–6 months. Also *T* = 0 if overall mortality occurred before 12 months, and 0 otherwise. All of the trials had two arms. A few trials in Data Set 3 were the same as in Data Set 2. Because the surrogate endpoint was binary, all subjects contributed to the estimation of *θ _{T}*

Data Set 4 is an artificial data set of 10 randomized trials that consists of data from the first three trials in Data Sets 1 and 2 and the first four trials in Data Set 3. The surrogate and true endpoints differ in each of three data sets. Also the relationship of surrogate and true endpoints likely differs substantially among the data sets. Therefore a prediction rule based on a combination of historical trials from three data sets will not likely perform well for a target trial from only one of the data sets. As will be discussed, the standard error multipliers for Data Set 4 were larger than for the other data sets, which is important for showing that the methodology can identify a poor surrogate endpoint.

Applying Equation (7) to the data sets, we computed 95% confidence intervals for predicted result for left-out trials and plotted these confidence intervals along with the 95% confidence intervals for the true result for left-out trials (Figure 1 and Web Figures 1–4). To illustrate the computations, we present estimates for model LIN4 in Data Set 1 (Table 1). Because the 95% confidence intervals for the predicted results include the estimated variance of the estimated extrapolation error, they are wider than the corresponding 95% confidence intervals for the true results. Due to random variability, the confidence interval for the true result for the left-out trial is not a perfect gold standard for the confidence interval for the predicted result for the left-out trial. The validation measure provides better information on the appropriateness of the confidence interval for the predicted result.

We computed the validation measure of coverages of 95% confidence intervals for each model using separate simulations under mixture, linear, and principal stratification models derived from the data (Table 2). For Data Sets 1, 2, and 3, most of the coverages were close to the target of 95%, and for the artificial Data Set 4 the coverages were slightly worse. In our framework, this validation measure does not distinguish between good and poor surrogate endpoints because reasonable coverage can be obtained with a poor surrogate endpoint due to its compensating large standard error multiplier. Instead the validation measure checks a necessary condition for drawing conclusions from the extrapolation procedure.

We evaluated the quality of the surrogate endpoint using the standard error multiplier (Table 3). The smallest standard error multipliers generally corresponded to models MIX1, LIN3, LIN4, and PSM. Not surprisingly, the standard error multipliers were largest for Data Set 4, which was created to reflect a poor surrogate endpoint. It is possible to decrease the standard error multiplier by increasing the sample size of a target trial. However, for our data sets, an increase in sample size of 20% in the left-out target trials yielded only a small reduction in the standard error multiplier (Table 3) because the extrapolation error accounts for a large fraction of the variance.

With survival data, our focus is on probabilities of survival at specific times, as opposed to median survival times or hazard functions. Mathematically, an advantage of using the probability of survival at a specific time rather than the median survival time is the ability to use a binomial distribution to compute a variance for the model-based predicted result. However, the probability of survival at specific time is also of great interest. In oncology, cancer survival rates, which are the percentage of people who survive a certain type of cancer for a specific amount of time, are often used to inform patients (Mayo Clinic staff, 2009). For the true endpoint, the specified time for survival should be the survival time used in standard evaluations of treatment for the disease. For a surrogate endpoint, the specified time for survival could be based on expert judgment that weighs the benefit of reaching conclusions sooner against the loss of information. Alternatively, for surrogate endpoints the specified time for survival could be chosen to give an overall (combining data over trial arms) survival probability for the surrogate event similar to the overall survival probability for the true event at its specified time.

The labeling of trial arms deserves additional discussion. One way to reduce sensitivity to labeling is by modeling the predicted results as proportional to the surrogate results; however labeling can still affect estimation for the extrapolation error. Our labeling method involving the highest and lowest probabilities of surrogate endpoints has the desirable feature of not depending on a standard of care that can change over time. Nevertheless, for a sensitivity analysis we also computed standard error multipliers using labels based on the standard of care at the time of the trial (Web Table 4). The standard error multipliers under the two labeling methods were similar, particularly for Data Sets 1, 2, and 3. For a trial with three arms, reporting two pair-wise comparisons may not be desirable because one arm would contribute twice to the calculations. Our labeling method avoids this problem by selecting two arms from a trial with more than two arms.

The standard error multipliers provide information on the usefulness of the predicted result and hence on the usefulness of the underlying surrogate endpoint. Under most of the models, the predicted results in Data Sets 1 and 3 had smaller standard error multipliers than the predicted results in Data Set 2, indicating better surrogacy in Data Sets 1 and 3 than in Data Set 2. This differs from previous findings of acceptable surrogate endpoints in Data Sets 1 and 2 and a poor surrogate endpoint in Data Set 3 (Burzykowski, et al 2004; Buyse et al 2008). A possible explanation is that the previous methods estimated the effect of treatment on a hazard function for the true endpoint while our method estimates the effect of treatment on a difference in probabilities of a true endpoint, which may be more closely related to the difference in probabilities of the binary surrogate endpoint.

A comparison of the standard error multipliers by model is also of interest. It is not clear why the simplest linear model LIN4 performed the best of the linear models. One possible explanation is that the model-based predicted result under LIN4 is proportional to the surrogate result which can reduce bias from incorrect labeling if the mean of the extrapolated error is small. Another possible explanation is less overfitting than with a more complicated linear model. In contrast, the mixture model MIX1 performed substantially better than the simpler model MIX2 in which the predicted result is proportional to the surrogate result. Apparently MIX2 is too overly restrictive. The principal stratification model PSM performed similarly to MIX1, perhaps because both represent saturated models. Among the best fitting models MIX1 and LIN4 are the simplest to compute and hence recommended based on this investigation.

The key to drawing correct conclusions from the predicted result is the appropriateness of the extrapolation assumption that the predicted result derived from historical trials correctly applies to the target trial. Unfortunately this assumption is not testable unless there is follow-up to observe the true endpoints, in which case the use of surrogate endpoints would be superfluous. To buttress the extrapolation assumption, investigators should strive to satisfy the following desiderata:

- all trials should apply to the same disease,
- the assigned treatments in all trials should affect surrogate and true endpoints via the same general mechanism,
- any intervention administered after the surrogate endpoint and before the true endpoint should be similar among all trials.

To further strengthen conclusions, investigators can use the scheme of Lassere (2008) that also considers the quality of historical trials, the clinical characteristics of the true endpoint, and evidence from animal experiments and epidemiological studies. Also investigators need to decide if there is sufficient time to observe any detrimental side effects in the target trial when there is no follow-up to observe the true endpoint (Ellenberg 1993).

Interested readers can reproduce results or evaluate other data sets using software written in Mathematica 8.0 (Wolfram Research, 2008), which is available for download at http://prevention.cancer.gov/programs-resources/groups/b/software/endpoint.

Web Appendices and Tables referenced in Sections 2, 3, and 4 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

This work was supported by the National Cancer Institute. The authors are grateful to the MAGIC (Meta-Analysis Group in Cancer) and ACCENT (Adjuvant Colon Cancer Endpoints) collaborators, listed in Burzykowski (2008), for providing the data. The authors thank the reviewers for helpful comments.

Without loss of generality, consider two cells representing survival or not. First we show that estimates of survival are the same when computed from expected counts or from the individual-level survival data. Let *S _{y}* denote the estimated probability of surviving

$${x}_{0}=n(1-{S}_{y})\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{x}_{1}=n\phantom{\rule{0.16667em}{0ex}}{S}_{y},$$

(A.1)

Let *S _{BIN}* denote the estimate survival probability based on the estimated counts,

$${S}_{\mathit{BIN}}={x}_{1}/{x}_{+}.$$

(A.2)

Substituting (A.1) into (A.2) gives *S _{BIN}* =

Second we show that estimates of the variances of the estimated survival probability are similar when computed from expected counts or from the individual-level survival data. The estimated variance of *S _{y}* based on the individual survival data is

$${\widehat{\mathit{var}}}_{SU\phantom{\rule{0.16667em}{0ex}}RV}({S}_{y})={S}_{y}^{2}\sum _{u=1}^{y}\frac{{h}_{u}}{(1-{h}_{u}){r}_{u}},$$

A.3)

where *r _{u}* is the number at risk in time interval

$${\widehat{\mathit{var}}}_{SU\phantom{\rule{0.16667em}{0ex}}RV}({S}_{y})=\frac{{S}_{y}}{n}[{h}_{k}+{h}_{k-1}(1-{h}_{k})+{h}_{k-2}(1-{h}_{k-1})(1-{h}_{k})+\dots ].$$

(A.4)

Because the quantity in brackets in (A.4) is the estimated probability of failure in *y* intervals, it equals 1 − *S _{y}*, so that

$${\widehat{\mathit{var}}}_{SU\phantom{\rule{0.16667em}{0ex}}RV}({S}_{y})={S}_{y}(1-{S}_{y})/{x}_{+}.$$

(A.5)

The estimated variance of *S _{y}* based on the estimated counts is

$${\widehat{\mathit{var}}}_{\mathit{BIN}}({S}_{\mathit{BIN}})={S}_{\mathit{BIN}}(1-{S}_{\mathit{BIN}})/{x}_{+}.$$

(A.6)

Substituting (A.2) into (A.6) gives ${\widehat{\mathit{var}}}_{\mathit{BIN}}({S}_{\mathit{BIN}})={\widehat{\mathit{var}}}_{SR\phantom{\rule{0.16667em}{0ex}}RV}({S}_{y})$.

Equation (19) can be written as

$$\begin{array}{l}{(\text{model}\phantom{\rule{0.16667em}{0ex}}\text{result})}_{\text{LIN}1i}=({\overline{\pi}}_{T1j}-{\overline{\pi}}_{T0j})+({m}_{11}-{m}_{01})({\widehat{\pi}}_{Si1}^{}=({\overline{\pi}}_{T1j}-{\overline{\pi}}_{T0j})+({m}_{11}-{m}_{01})\{({\widehat{\pi}}_{Si1}^{}\end{array}$$

Under the assumptions for LIN3, {(*m*_{10} − *m*_{00}) + (*m*_{11} − *m*_{01})} = 0, giving the general form of Equation (21) in which the model-based predicted result for LIN3 equals a constant plus a multiple of the surrogate result.

Stuart G. Baker, National Cancer Institute, EPN 3131, 6130 Executive Blvd MSC 7354, National Cancer Institute, Bethesda, MD 20892-7354, USA.

Daniel J. Sargent, Mayo Clinic, Rochester USA.

Marc Buyse, IDDI, Louvain-la-Neuve, and Hasselt University, Diepenbeek, Belgium.

Tomasz Burzykowski, Hasselt University, Diepenbeek, Belgium.

- Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. Journal of the American Statistical Association. 1996;92:444–455.
- Baker SG, Lindeman KS. The paired availability design: a proposal for evaluating epidural analgesia during labor. Statistics in Medicine. 1994;13:2269–2278. [PubMed]
- Baker SG. A simple meta-analytic approach for using a binary surrogate endpoint to predict the effect of intervention on true endpoint. Biostatistics. 2006;7:57–70. [PubMed]
- Baker SG. Two simple approaches for validating a binary surrogate endpoint using data from multiple trials. Statistical Methods in Medical Research. 2008;17:505–514. [PubMed]
- Baker SG. Estimation and inference for the causal effect of receiving treatment on a multinomial outcome: an alternative approach. Biometrics. 2011;67:319–325. [PMC free article] [PubMed]
- Burzykowski T, Buyse M. Surrogate threshold effect: An alternative for meta- analytic surrogate endpoint validation. Pharmaceutical Statistics. 2006;5:173–186. [PubMed]
- Burzykowski T, Molenberghs G, Buyse M. The validation of surrogate end points by using data from randomized clinical trials: a case-study in advanced colorectal cancer. Journal of the Royal Statistical Society Series A. 2004;167:103–124.
- Burzykowski T, Buyse M, Yothers G, Sakamoto J, Sargent D. Exploring and validating surrogate endpoints in colorectal cancer. Lifetime Data Analysis. 2008;14:54–64. [PubMed]
- Buyse M, Burzykowski T, Carroll K, Michiels S, Sargent D, Miller LL, Elfring GL, Pignon JP, Piedbois P. Progression-free survival is a surrogate for survival in advanced colorectal cancer. Journal of Clinical Oncology. 2007;25:5218–5224. [PubMed]
- Buyse M, Burzykowski T, Michiels S, Sargent D, Carroll K. Individual- and trial- level surrogacy in colorectal cancer. Statistical Methods in Medical Research. 2008;17:467–475. [PubMed]
- Buyse M, Molenberghs G. The validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
- Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1:49–67. [PubMed]
- Cuzick J, Edwards R, Segnan N. Adjusting for non-compliance and contamination in randomized clinical trials. Statistics in Medicine. 1997;16:1017–1029. [PubMed]
- Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Statistics in Medicine. 1997;16:1965–1982. [PubMed]
- Ellenberg SS. Surrogate endpoints. British Journal of Cancer. 1993;68:457–459. [PMC free article] [PubMed]
- Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58:21–29. [PubMed]
- Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic disease. Statistics in Medicine. 1992;11:167–178. [PubMed]
- Gail MH, Pfeiffer R, Houwelingen HC, Carroll RJ. On meta-analytic assessment of surrogate outcomes. Biostatistics. 2000;3:231–246. [PubMed]
- Imbens GW, Angrist JD. Identification and estimation of local average treatment effects. Econometrica. 1994;62:467–475.
- Jackson D, White IR, Thompson SG. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Statistics in Medicine. 2010;29:1282–1297. [PubMed]
- Korn EL, Albert PS, McShane LM. Assessing surrogates as trial endpoints using mixed model. Statistics in Medicine. 2005;24:63–182. [PubMed]
- Lassere MN. The Biomarker-Surrogacy Evaluation Schema: a review of the biomarker-surrogate literature and a proposal for a criterion-based, quantitative, multidimensional hierarchical levels of evidence schema for evaluating the status of biomarkers as surrogate endpoints. Statistical Methods in Medical Research. 2008;17:303–340. [PubMed]
- Mayo Clinic staff. Cancer survival rate: What it means for your prognosis. 2009 http://www.mayoclinic.com/health/cancer/CA00049.
- Meta-Analysis Group in Cancer. Modulation of fluorouracil by leucovorin in patients with advanced colorectal cancer: an updated meta-analysis. Journal of Clinical Oncology. 2004;22:3766–3775. [PubMed]
- Permutt T, Hebel R. Simultaneous-equation estimation in a clinical trial of the effect of smoking on birth weight. Biometrics. 1989;45:619–622. [PubMed]
- Prentice RL. Surrogate endpoints in clinical trials: Definitions and operational criteria. Statisticis in Medicine. 1989;8:431–430. [PubMed]
- Sargent DJ, Wieand S, Haller DG, Gray R, Benedetti J, Buyse M, Labianca R, Seitz JF, Callaghan CJO, Francini G, Grothey A, O’Connell M, Catalano PJ, Blanke CD, Kerr D, Green E, Wolmark N, Andre T, Goldberg RM, De Gramont A. Disease-free survival (DFS) vs. overall survival (OS) as a primary endpoint for adjuvant colon cancer studies: Individual patient data from 20,898 patients on 18 randomized trials. Journal of Clinical Oncology. 2005;23:8664–8670. [PubMed]
- Weir CJ, Walle RJ. Statistical evaluation of biomarkers as surrogate endpoints: a literature review. Statistics in Medicine. 2006;25:183–203. 431–430. [PubMed]
- Wolfram Research, Inc. Mathematica, Version 8.0. Champaign, IL: 2010.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |