Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2726718

Formats

Article sections

- Summary
- 1. Introduction
- 2. Statistical and Principal Surrogates
- 3. Causal Effect Predictiveness Estimands
- 4. Estimating the CEP Surface and Marginal CEP Curve
- 5. Simulation Study
- 6. Discussion
- Supplementary Material
- REFERENCES

Authors

Related links

Biometrics. Author manuscript; available in PMC 2009 August 13.

Published in final edited form as:

Published online 2008 March 24. doi: 10.1111/j.1541-0420.2008.01014.x

PMCID: PMC2726718

NIHMSID: NIHMS92233

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Frangakis and Rubin (2002, *Biometrics* **58,** 21–29) proposed a new definition of a surrogate endpoint (a “principal” surrogate) based on causal effects. We introduce an estimand for evaluating a principal surrogate, the *causal effect predictiveness (CEP) surface*, which quantifies how well causal treatment effects on the biomarker predict causal treatment effects on the clinical endpoint. Although the CEP surface is not identifiable due to missing potential outcomes, it can be identified by incorporating a baseline covariate(s) that predicts the biomarker. Given case–cohort sampling of such a baseline predictor and the biomarker in a large blinded randomized clinical trial, we develop an estimated likelihood method for estimating the CEP surface. This estimation assesses the “surrogate value” of the biomarker for reliably predicting clinical treatment effects for the same or similar setting as the trial. A CEP surface plot provides a way to compare the surrogate value of multiple biomarkers. The approach is illustrated by the problem of assessing an immune response to a vaccine as a surrogate endpoint for infection.

Identifying biomarkers that can be used as approximate surrogates for clinical endpoints in randomized trials is useful for many reasons including shortening studies, reducing costs, sparing study participants discomfort, and elucidating treatment effect mechanisms. As a motivating example, a central objective of placebo-controlled preventive HIV vaccine efficacy trials is the evaluation of vaccine-induced immune responses as surrogate endpoints for HIV infection. An immunological surrogate would be useful for several purposes including guiding iterative development of immunogens between basic and clinical research, informing regulatory decisions and immunization policies, and bridging efficacy of a vaccine observed in a trial to a new setting.

The surrogate evaluation field was catalyzed by Prentice’s (1989) definition of a surrogate endpoint as a replacement endpoint that provides a valid test of the null hypothesis of no treatment effect on the clinical endpoint. The two main criteria for checking this definition are: (i) the distribution of the clinical endpoint conditional on the surrogate is the same as the distribution of the clinical endpoint conditional on the surrogate and treatment (i.e., all of the clinical treatment effect is “mediated” through the surrogate); and (ii) the surrogate and clinical endpoints are correlated. Frangakis and Rubin (2002) (henceforth FR) noted that this definition is based on observable random variables, and named a biomarker satisfying criterion (i) a “statistical surrogate.” Since 1989, many surrogate-evaluation methods have been designed to check if a biomarker is a statistical surrogate, including methods for estimating the proportion of the treatment effect explained (Freedman, Graubard, and Schatzkin, 1992). Notably some approaches have not been based on (i); for example, the adjusted association estimand is designed for evaluating the correlation criterion (ii), and the relative effect estimand is based on average causal effects (Buyse and Molenberghs, 1998).

Treatment effects adjusted for a variable measured after randomization (called *net effects*) are susceptible to postran-domization selection bias. Because candidate surrogates are measured after randomization, criterion (i) defining a statistical surrogate is based on net effects. FR pointed out that this definition does not have a causal interpretation, and proposed a new surrogate definition based on principal causal effects. FR’s definition of a “principal surrogate” is based on the potential outcomes framework for causal inference, which Robins (1995) also considered for studying treatment effects subject to postrandomization selection bias. To date, statistical methods for evaluating principal surrogates have not been elaborated. A recent review paper noted that FR “present a convincing case for the principal surrogate definition” and called for such elaborations (Weir and Walley, 2006).

The literature on statistical methods for evaluating surrogate endpoints contains approaches based on a single large clinical trial and on metaanalysis. Here we develop an approach for evaluating a principal surrogate within the former setting. Following Follmann (2006), our approach uses a baseline covariate(s) to predict missing potential biomarker outcomes. After defining statistical and principal surrogates in Section 2, in Section 3 we introduce the *causal effect predictiveness (CEP) surface* and the *marginal CEP curve*, plus associated summary causal estimands, which quantify how well a biomarker predicts population-level causal effects of treatment. In Section 4, we develop an estimated-likelihood approach for estimating the causal estimands based on case–cohort sampling of the biomarker, and parametric or nonparametric marginal structural mean models. In Section 5, we evaluate the nonparametric method in simulations based on an HIV vaccine trial, and in Section 6 we conclude with discussion.

Throughout we consider a randomized trial with treatment assignment *Z* (*Z* = 1 or 0), a discrete or continuous biomarker *S* measured at fixed time *t*_{0} after treatment assignment, and a binary clinical endpoint *Y* (*Y* = 1 for disease, 0 otherwise) measured after *t*_{0}. Because *S* must be measured prior to disease to evaluate it as a candidate surrogate, the analysis is restricted to subjects disease free at *t*_{0}; denote this evaluability criterion by the indicator *V* = 1. The biomarker *S* is only measured in those with *V* = 1, and otherwise is undefined (denoted by *S* = *). We consider two phase outcome-dependent case–cohort sampling, wherein baseline covariates *X* are measured for everyone (phase 1) and in the second phase a baseline covariate(s) *W* is measured for all or almost all cases (those with *Y* = 1) and for a random “subcohort” of controls (those with *Y* = 0). Let δ indicate whether *W* is measured. For subjects with *V* = 1, *S* is measured for those with *W* measured. Case–cohort sampling is efficient when *W* or *S* is expensive (Prentice, 1986). For vaccine trials, *W* and *S* can be measured after the trial using stored specimens (Gilbert et al., 2005).

Following FR, methods for evaluating statistical surrogates are based on comparing the risk distributions

If *S* is continuous then these definitions abuse notation; however, to avoid the distraction of technical details the formal definitions are placed in Web Appendix A. FR defined *S* to be a *statistical surrogate* if, for all values *s* of *S*, risk(*s | Z* = 1) = risk(*s | Z* = 0).

Because *S* and *V* are measured after randomization, a comparison of risk(*s | Z* = 1) and risk(*s | Z* = 0) measures the net effect of treatment, i.e., differences due to a mixture of the causal treatment effect and any differences in characteristics between treatment 1 subjects who have response level *s*, {*Z* = 1, *V* = 1, *S* = *s*}, and treatment 0 subjects who have response level *s*, {*Z* = 0, *V* = 1, *S* = *s*}. Consequently, application of a method that evaluates a statistical surrogate may mislead about the capacity of a biomarker to reliably predict causal clinical treatment effects.

Let *Y*(*Z*) be the potential clinical endpoint after time *t*_{0} under assignment to treatment *Z*. Similarly define potential outcomes *S*(*Z*) for the biomarker endpoint measured at *t*_{0}, and let *V*(*Z*) be the potential indicators of whether the subject is disease free at *t*_{0}, for *Z* = 0,1. Note that *S*(*Z*) and *Y*(*Z*) are undefined if *V*(*Z*) = 0; in this case *S*(*Z*) = *Y*(*Z*) = *. We suppose that (*Z _{i}*,

Stable unit treatment value assumption (SUTVA)

*Ignorable treatment assignments (Rubin, 1986)*: Conditional on *X*, *Z* is independent of (*W*, *V* (1), *V* (0), *S*(1), *S*(0), *Y* (1), *Y*(0))

*Equal individual clinical risk up to time t*_{0} : *V*(1) = 1 if and only if *V* (0) = 1

A1 implies that the potential outcomes (*V _{i}*(1),

With these preliminaries, we now define a principal surrogate endpoint. FR defined the *basic principal stratification P*_{0} with respect to the postrandomization variable *S* as the partition of units *i* = 1, … , *n* such that within any set of *P*_{0}, all units have the same vector (*S _{i}*(1),

results in equality for all *s*_{1} = *s*_{0}. FR did not explicitly condition on *V* (1) = *V* (0) = 1 in their definition; however, implicitly they must have, because (*S*(1), *S*(0)) is only defined if *V* (1) = *V* (0) = 1. For notational simplicity henceforth all probability statements involving *S*(*Z*) implicitly condition on *V* (*Z*) = 1. A contrast in risk_{(1)}(*s*_{1}, *s*_{0}) and risk_{(0)}(*s*_{1}, *s*_{0}) measures a population-level causal treatment effect on *Y* for subjects with {*S _{i}*(1) =

risk_{(1)}(*s*_{1}, *s*_{0}) = risk_{(0)}(*s*_{1}, *s*_{0}) for all *s*_{1} = *s*_{0}.

Biomarkers with the greatest utility for predicting clinical treatment effects will not only be necessary for a clinical effect, but also sufficient. For example, knowing that an antibody titer > 1000 is sufficient for a vaccine to protect individuals against HIV infection is exactly the information needed to use titer as a reliable predictor of protection. We define average causal sufficiency as

There exists a constant *C* ≥ 0 such that risk_{(1)}(*s*_{1}, *s*_{0}) ≠ risk_{(0)}(*s*_{1}, *s*_{0}) for all |*s*_{1} − *s*_{0}| > *C*.

For the one-sided situation where interest is in assessing if higher treatment 1 biomarker responses (*S*(1) > *S*(0)) predict clinical benefit of treatment 1 (*Y* (1) = 0 and *Y* (0) = 1) (e.g., a placebo-controlled trial), a one-sided version of average causal sufficiency may be more appropriate, defined as above with ≠ replaced with < and |*s*_{1} − *s*_{0}| replaced with *s*_{1} − *s*_{0}. In either case, we suggest a refined definition of a principal surrogate endpoint as a biomarker that satisfies both average causal necessity and average causal sufficiency. Henceforth we use this definition of a principal surrogate endpoint.

FR suggested that the quality of a surrogate be measured by its “associative effects” relative to its “dissociative effects.” As defined in equation 5.3 and of FR, an *associative effect* is a comparison between the ordered sets

and a *dissociative effect* is a comparison between the ordered sets

For the purpose of quantifying these effects, we introduce a *causal effect predictiveness (CEP) surface*. Let *CE* *h*(Pr(*Y* (1) = 1), Pr(*Y* (0) = 1)) be the overall causal effect of treatment on the clinical endpoint, where *h*(·, ·) is a known contrast function satisfying *h*(*x*, *y*)= 0 if and only if *x* = *y*, for example *h*(*x*, *y*) = *x* − *y* or log(*x/y*). Let

be this contrast conditional on {*S*(1) = *s*_{1}, *S*(0) = *s*_{0}}. Note that CEP^{risk}(*s*, *s*) = 0 for all *s* is equivalent to average causal necessity, whereas CEP^{risk}(*s*_{1}, *s*_{0}) ≠ 0 for all |*s*_{1} − *s*_{0}| > *C* (or the one-sided analog) is equivalent to average causal sufficiency. Therefore, the criteria for a principal surrogate can be checked through inference on the CEP surface. Moreover, biomarkers with capacity to predict clinical treatment effects will often have |CEP^{risk}(*s*_{1}, *s*_{0})| increasing in |*s*_{1} − *s*_{0}|, reflecting the situation that on average groups of persons with a greater causal effect on the marker have a greater causal effect on the clinical endpoint. We refer to the capacity of a biomarker to reliably predict the population level causal effect of treatment on the clinical endpoint as the biomarkers’ *surrogate value*. This value can be quantified both by the nearness of |CEP^{risk}(*s*_{1}, *s*_{0})| to 0 for *s*_{1} near *s*_{0} and by the extent to which |CEP^{risk}(*s*_{1}, *s*_{0})| increases with |*s*_{1} − *s*_{0}|, with a greater increase reflecting greater associative effects. Note that even if one or both of average causal necessity or sufficiency fail, a biomarker can still have surrogate value if |CEP^{risk}(*s*_{1}, *s*_{0})| increases with |*s*_{1} − *s*_{0}|; Figure 2 (dashed line) will illustrate this. Moreover, two principal surrogates can have different surrogate values as reflected by different CEP surfaces.

For case CB with *S*_{i}(0) = *c* for all *i* with *c* = *L* the lower bound of *S*, biomarkers *S* that have no (horizontal solid line), modest (dashed line), moderate (dotted line), and high (hatched line) surrogate value. Here CEP^{risk}(*s*_{1}, *c*) = *h*(risk_{(1)}(*s*_{1}, *c*), risk **...**

If *S* is continuous, then the CEP surface can alternatively be defined in terms of percentiles of the marker *S*. To formulate this, consider Huang, Pepe, and Feng’s (2007) proposal to judge the value of a continuous marker *S* for predicting disease *Y* by the *predictiveness curve*, *R*(υ) Pr(*Y* = 1| *S* = *F*^{−1}(υ)), υ [0, 1], where *F* is the cumulative distributive function (cdf) of *S*. Note that *R*(υ) = risk(*S* = *F*^{−1}(υ)), i.e., *R*(υ) is risk as a function of the quantiles of *S*, which provides a common scale for comparing multiple markers. The predictiveness curve *R*(υ) usefully informs about both absolute risks at different marker quantiles and the frequency of these risks in the population. A predictive marker is one with *R*(υ) monotone (or approximately so) in υ with large |*R*(1) − *R*(0)|.

Applying these ideas, we propose a scale-independent version of the causal effect predictiveness surface, CEP* ^{R}*(υ

In this definition, *S*(1) and *S*(0) are standardized relative to the distribution *F*_{(1)} of *S*(1). Figure 1 illustrates two CEP surfaces for the one-sided setting where interest is in predicting clinical benefit of treatment 1 from higher treatment 1 biomarker responses.

Example CEP^{R}(υ_{1}, υ_{0}) = *h*(*R*_{(1)}( υ_{1}, υ_{0}), *R*_{(0)}( υ_{1}, υ_{0})) surfaces, with *h*(*x*, *y*) = 1 − *x/y*. The surface in (i) reflects a biomarker with no surrogate value, wherein the clinical treatment effect is **...**

For some studies, the *marginal CEP curve* is a related causal estimand of interest:

where risk_{(Z)}(*s*_{1}) Pr(*Y*(*Z*) = 1| *S*(1) = *s*_{1}). Similarly *m*CEP* ^{R}*(υ

If *S _{i}*(0) is constant across subjects, then the CEP surface (trivially) equals the marginal CEP curve. We refer to this special case as case CB:

*Constant Biomarkers*: *S _{i}*(0) =

HIV vaccine trials fit case CB, with (almost) all subjects having no immune response under placebo (*Z* = 0). This occurs because *S* is an HIV-specific immune response, so that vaccine antigens must be presented to the immune system to induce a response (Gilbert et al., 2005). The dissociative effect can be measured by CEP^{risk}(*c*, *c*) and the associative effects by CEP^{risk}(*s*_{1}, *c*) for *s*_{1} ≠ *c*. For example, with *c* = *L* the lower bound of *S*, the nearer CEP^{risk}(*c*, *c*) is to zero and the greater the increase of |CEP^{risk}(*s*_{1}, *c*)| with *s*_{1} > *c*, the greater the surrogate value (Figure 2).

For placebo-controlled trials for which case CB fails yet *S _{i}*(0) has much less variability than

We suggest functionals of the CEP surface that summarize the surrogate value of a biomarker. We again consider the one-sided setting where interest is in assessing whether *S*(1) > *S*(0) predicts clinical benefit of treatment 1 (*Y* ( 1) = 0 and *Y* (0) = 1). To summarize the associative and dissociative effects, we consider the *expected associative effect (EAE)* and the *expected dissociative effect (EDE)*:

(1)

(2)

where ω(·, ·) is a nonnegative weight function. For case CB with *c* = *L*, EAE^{ω} = {∫_{s1>c} ω(*s*_{1}, *c*) *dF*_{(1)}(*s*_{1})}^{−1} ∫_{s1>c} × ω(*s*_{1}, *c*)CEP^{risk}(*s*_{1}, *c*) *dF*_{(1)}(*s*_{1}) and EDE = CEP^{risk}(*c*, *c*).

We also define the *proportion associative (PA) effect* by

(3)

Values of PAE^{ω} [0, 0.5] suggest the biomarker has no surrogate value, whereas values in (0.5, 1] suggest some surrogate value.

A weight function is included in EAE^{ω} to reflect the idea that a biomarker with high surrogate value should have large |CEP^{risk}(*s*_{1}, *s*_{0})| for large *s*_{1} − *s*_{0} > 0. For example, weights ω(*s*_{1}, *s*_{0}) = *s*_{1} − *s*_{0} or *I*(*s*_{1} = *U*, *s*_{0} = *L*) may be used, where *L* (*U*) is the lower (upper) bound of *S*. With the latter weight, PAE^{ω} compares the clinical effect among groups with the maximum surrogate effect and with no surrogate effect:

If *h*(*x*, *y*) = *x* − *y*, Pr(*S*(1) > *S*(0)) = 0.5, and an additional monotonicity assumption is made (that *Y _{i}*(1) ≤

(proof in Web Appendix A). This summary measure, proposed by Taylor, Wang, and Thiebaut (2005), is interpreted as the proportion of the study population with a beneficial causal clinical effect that also has a positive causal surrogate effect. The PA depends on the underlying principal strata distribution *F*_{(1),(0)}(*s*_{1}, *s*_{0}) Pr(*S*(1) ≤ *s*_{1}, *S*(0) ≤ *s*_{0}); if Pr(*S*(1) > *S*(0)) is small (large) then the PA will tend to be small (large), irrespective of the biomarker’s surrogate value. By conditioning on (*S*(1), *S*(0)), the PAE^{ω} is designed to be robust to *F*_{(1),(0)}(·, ·); the PAE^{ω} reflects the relative magnitude of clinical effects for those with and without surrogate effects.

Biomarkers satisfying average causal necessity have EDE = 0 and thus PAE^{ω} = 1, in which case EAE^{ω} contributes no information to the PAE^{ω}. Therefore, additional measures are needed for summarizing the magnitude of associative effects. One such measure is the *associative span (AS)*, defined by AS |CEP^{risk}(*U*, *L*)| − |EDE|. Figure 2 illustrates PAE^{ω=1} and AS. Although the summary parameters may be useful, it is important to estimate the CEP estimands over the range of marker values or quantiles to provide a full picture of the associative and dissociative effects.

Below we also consider estimands defined as above except they condition on *X* and/or *W*; for example risk_{(Z)}(*s*_{1}, *s*_{0}, *x*, *w*) Pr(*Y* (*Z*) = 1 | *S*(1) = *s*_{1}, *S*(0) = *s*_{0}, *X* = *x*, *W* = *w*) and CEP^{risk}(*s*_{1}, *s*_{0}, *x*, *w*) *h*(risk_{(1)}(*s*_{1}, *s*_{0}, *x*, *w*), risk_{(0)}(*s*_{1}, *s*_{0}, *x*, *w*)). The conditional estimands reflect baseline covariatespecific surrogate value.

We consider one approach to identifying and estimating the CEP surface in the practically important special case CB. The same approach identifies and estimates the marginal CEP curve in the general case that *S _{i}*(0) has arbitrary variability. In case CB it is difficult to evaluate a statistical surrogate, because it is not possible to study the correlation of

Due to missing potential outcomes the CEP surface and marginal CEP curve are not identified without further assumptions. A1–A3 imply

demonstrating that risk_{(Z)}(*s*_{1}, *s*_{0}, *x*, *w*) would be identified if we knew the potential outcomes *S _{i}*(

Our method of inference is based on one of the augmented vaccine trial designs proposed by Follmann (2006), wherein a baseline covariate(s) *W* that is predictive of *S*(1) is measured in subjects in both treatment arms. A model predicting *S*(1) from *X* and *W* fit from arm *Z* = 1 subjects is used to predict *S*(1) for arm *Z* = 0 subjects. The predictions are unbiased because A1–A3 imply *S*(1) | *Z* = 1, *X*, *W* =* ^{d} S*(1) |

We observe i.i.d. data *O _{i}* (

Because CEP^{risk}(·, *c*, *X*, *W*; β) depends on β but not ν, the ν are nuisance parameters. Although profile likelihood is a natural approach to pursue, it is difficult to implement because the likelihood integrates over
, and *F ^{W|X}*. We use estimated likelihood (Pepe and Fleming, 1991), also called pseudolikelihood, wherein consistent estimates of ν are obtained based on treatment arm 1 data, and then

The estimated likelihood approach can be used for a variety of structural models for risk_{(z)}(*s*_{1}, *c*, *x*, *w*) and the nuisance parameters ν. Here we consider two types of models for case CB. The first is fully parametric, where we assume and *F ^{W|X}* have particular parametric distributions, and

for *s*_{1} ≥ *c* and some known link function *g*(·). For example, we might assume *F ^{W|X}* is normal and , is censored normal, with left-censoring of values below

Simple calculations yield EDE(*x*, *w*) = (β_{10} − β_{00}) + (β_{11} − β_{01})*L* + (β_{12} − β_{02})^{T}*x* + (β_{13} − β_{03})^{T}*w*, AS(*x*,*w*) = |(β_{10} − β_{00}) + (β_{11} − β_{01})*U* + (β_{12} − β_{02})^{T}*x*+(β_{13} − β_{03})^{T}*w*| − |EDE × (*x, w*)|, and EAE^{ω=1}(*x*, *w*) = (β_{10} − β_{00}) + (β_{11} − β_{01})*E*[*S*(1)|*S*(1) > *c*, *x*, *w*] + (β_{12} − β_{02})^{T}*x* + (β_{13} − β_{03})^{T}*w*. For the case that *g* = Ф, Web Appendix C provides a proof, adapted from a proof by Dean Follmann, that β is identified under the untestable imposed constraint that one of the components of
equals the corresponding component of . Therefore identifiability requires assuming the absence of one interaction, but otherwise if and how the CEP curve varies with *X* and *W* can be evaluated. If no interactions between treatment and *X* or *W* are assumed, then CEP^{risk}(*s*_{1}, *c*; β) = (β_{10} − β_{00}) + (β_{11} − β_{01})*s*_{1} is interpreted as the covariate-adjusted CEP curve.

Secondly, we consider a nonparametric approach wherein *S* and *W* are treated as categorical variables with *J* and *K* levels, which may be discretized versions of continuous measurements. Here we assume the lowest category *j* = 1 corresponds to the constant *c* in case CB. With *W* the only baseline covariate, nonparametric models are specified by ν* _{jk}* Pr(

for *j* = 1,…, *J*; *k* = 1,…,*K*; and *z* = 0, 1. The parameters
are constrained such that for all *z*, *j*, *k* and for identifiability. Under model A4-NP, *W* has the same effect on risk for the two study arms. This no-interaction assumption identifies the model, and the expanded model with
replaced with is not identified (see Web Appendix C).

In the simulations we consider the CEP curve estimand CEP^{risk}(*j*, 1; β) = log(risk_{(1)}(*j*, 1; β_{1j})/risk_{(0)}(*j*, 1; β_{0j})) based on average risks . It follows that CEP^{risk}(*j*, 1; β) = log(β_{1j}/β_{0j} ), AS = |log(β_{1J}/ β_{0J} )| − |log(β_{11}/ β_{01})|,EDE = log(β_{11}/β_{01}), and
with
and .

For both the parametric and nonparametric approaches, Web Appendix B describes consistent estimators of ν and procedures for maximizing *L*(β,) in β.

Because PAE^{ω} = 0.5 supports that *S* has no surrogate value, Wald tests for any surrogate value can be based on the maximum estimated likelihood estimator (MELE) minus 0.5 divided by its bootstrap standard error. Similarly Wald tests of AS = 0 can be implemented based on . For the nonparametric approach assuming model A4-NP, we also consider a test statistic divided by its bootstrap standard error, where . This test evaluates *H*_{0} : CEP^{risk}(*j*, 1) = *CE* for all *j* versus the monotone alternative that CEP^{risk}(*j*, 1) increases in *j*, similar to the Breslow–Day trend test (Breslow and Day, 1980). The null and alternative hypotheses indicate that average causal sufficiency does not and does hold, respectively.

Based on data from the first preventive HIV vaccine efficacy trial (Gilbert et al., 2005), we conducted a simulation study to evaluate performance of the MELE methods. The vaccine trial was double blind with 2:1 randomization to vaccine:placebo. A biomarker of interest *S* was the 50% neutralization titers against the HIV recombinant gp120 molecule measured from a serum sample drawn at the month 1.5 visit, and *Y* was HIV infection during the time period *t*_{0} = 1.5 months to 36 months. The lower quantification limit of the neutralization assay was 1.65, and 44 of 47 placebo recipients with *S* measured at 1.5 months had left-censored values; thus the data essentially fit case CB. The range of *S _{i}* was [1.65, 4.09], which we rescaled to [0, 1], so that

We simulated vaccine trials with the following steps. Step 1: For all 3598 (1805) subjects in the vaccine (placebo) arm, (*W _{i}*,

For each of 1000 simulated data sets the MELE was computed using the nonparametric approach described in Section 4.3. Then, with *h*(*x*, *y*) = log(*x/y*), was used to compute the MELEs of CEP^{risk}(*j*; 1), AS, and PAE^{ω} for ω(*j*; 1) = 1, *j*, and *I*(*j* = *J* = 4). Wald tests (with bootstrap standard errors) based on
− 0.5, , and *T* were used to test for any surrogate value. The MELEs of CEP^{risk}(*j*; 1), PAE^{ω} and AS performed well (Table 1 and Table 2). Bootstrap percentile confidence intervals (CIs) had approximately nominal coverage and for higher values of ρ the MELEs exhibited negligible bias. The tests for any surrogate value had approximately nominal size and showed adequate power to detect surrogate value; the nonparametric trend test had power 0.83, 0.99, and >0.99 for ρ = 0.5, 0.7, and 0.9 under scenario (ii).

Model A4-NP simulation results for the nonparametric MELEs
(j, 1; β) = log(_{1j}/_{0j}) for j = 1, …, 4^{a}

Additional simulations were conducted to evaluate the performance of the MELE method with binned covariates when the data were generated from a continuous model. Specifically, Step 2 described above was replaced with Step 2′: For vaccine arm subjects, *Y _{i}*(1) was generated using model A4-P with

A main use of a surrogate endpoint is predicting treatment effects on a clinical endpoint. Within the principal surrogate framework, we have introduced the CEP surface and the marginal CEP curve as appropriate estimands for measuring the predictive capacity of a candidate surrogate. We developed estimation and testing methods under case–cohort sampling from a single large clinical trial (or multiple similar trials); such inferences apply for measuring surrogate predictiveness for the same or similar setting as the trial. The inferences do not form an empirical basis for bridging information about clinical efficacy to a new setting not represented in the trial(s) (e.g., to a new human population or treatment formulation); for this additional experiments (such as mechanistic studies and studies that deliberately manipulate the biomarker) and metaanalysis of heterogeneous studies are needed.

Because the definition of the CEP surface involves unobservable potential outcomes, strong untestable assumptions may be needed to identify it, possibly precluding its reliable estimation. The estimation method we developed requires A1–A3, a reasonably good model predicting *S* from baseline covariates *X* and *W* in treatment arm 1, and models for risk_{(z)}(*s*_{1}, *c*, *x*, *w*) or its marginal counterpart risk_{(z)}(*s*_{1}, *x*, *w*), for *z* = 0, 1. A1–A2 are standard in blinded randomized trials. A1 (SUTVA) is potentially dubious in the infectious disease setting where dependent happenings are possible (Halloran and Struchiner, 1995), but should approximately hold in trials with a small study population relative to the total population of at risk individuals. A3 can be assessed by testing at each of multiple fixed baseline covariate levels *x*, where rejecting for any *x* rejects A3. It is difficult to fully verify A3, however, due to the curse of dimensionality. The method is expected to be robust to violations of A3 if the vast majority of clinical events happen after the biomarker measurement time. Otherwise it will be important to extend the methods to facilitate sensitivity analyses to departures from A3.

Models for the conditional distribution of *S* given *X* and *W* can be directly checked using arm *Z* = 1 data, and under A1–A3 models for risk_{(1)}(*s*_{1}, *c*, *x*, *w*) can be tested. The model for risk_{(0)}(*s*_{1}, *c*, *x*, *w*) specified by A4-P or A4-NP is not testable. However, with extra data collection Follmann’s (2006) “close-out placebo vaccination” approach would provide one way to test it. Given the challenge in verifying this assumption, sensitivity analysis and the use of multiple surrogate evaluation approaches is warranted.

Within the principal surrogate framework considered here, internal validity of the putative surrogate can be checked by comparing the estimated overall clinical treatment effect, , to the CE predicted from the biomarker. Under A1–A3 and case CB, CE can be predicted by , which averages the predicted clinical treatment effect over the distribution of observed marker values of subjects assigned arm *Z* = 1. Furthermore, the estimated CEP curve can be used to check projective validity, that is, the utility of *S* for bridging efficacy predictions across populations. For example, suppose treatments *Z* = 1 and *Z* = 0 are compared within two subroups of a large trial. The CEP surface can be estimated from subgroup 1 data, and Pred(CE) calculated by estimating *F*_{(1)}(·) from the observed biomarker values *S* of subgroup 2 subjects in arm *Z* = 1. Then projective validity would be supported by Pred(CE) near for subgroup 2.

The estimands and estimation techniques developed here for a binary clinical endpoint *Y* also apply for a quantitative clinical endpoint *Y*, with all expressions Pr(*Y* (*Z*) = 1 | ·) replaced with *E*(*Y* (*Z*) | ·). In either case the CEP estimands describe how the average or population level causal effect on *Y* depends on the causal effect on *S*.

Web Appendices referenced in Sections 2.1, 3.1, 3.2, and 4.3 are available under the Paper Information link at the *Biometrics* website http://www.biometrics.tibs.org. R code for the nonparametric method is also available at the *Biometrics* website.

Click here to view.^{(136K, pdf)}

The authors thank Dean Follmann, Margaret Pepe, Ross Prentice, Steve Self, and the associate editor for helpful comments. This work was supported by NIH grants 2 R01 AI54165-04 and 5 R37 AI029168-16.

- Breslow N, Day N. Statistical Methods in Cancer Research. Volume 1. Lyon, France: International Agency for Research on Cancer; 1980.
- Buyse M, Molenberghs G. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
- Chan I, Shu L, Matthews H, Chan C, Vessey R, Sadoff J, Heyse J. Use of statistical models for evaluating antibody response as a correlate of protection against varicella. Statistics in Medicine. 2002;21:3411–3430. [PubMed]
- Follmann D. Augmented designs to assess immune response in vaccine trials. Biometrics. 2006;62:1161–1169. [PMC free article] [PubMed]
- Frangakis C, Rubin D. Principal stratification in causal inference. Biometrics. 2002;58:21–29. [PubMed]
- Freedman L, Graubard B, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in Medicine. 1992;11:167–178. [PubMed]
- Gilbert P, Peterson M, Follmann D, Hudgens MG, Francis DP, Gurwith M, Heyward WL, Jobes DV, Popovic V, Self SG, Sinangil F, Burke D, Berman PW. Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of HIV-1 infection in a phase 3 HIV-1 preventive vaccine trial. Journal of Infectious Diseases. 2005;191:666–677. [PubMed]
- Halloran M, Struchiner C. Causal inferences in infectious diseases. Epidemiology. 1995;6:142–151. [PubMed]
- Huang Y, Pepe M, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics. 2007;63:1181–1188. [PubMed]
- Pepe M, Fleming T. A non-parametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
- Prentice R. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11.
- Prentice R. Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine. 1989;8:431–440. [PubMed]
- Robins J. An analytic method for randomized trials with informative censoring: Part I. Lifetime Data Analysis. 1995;1:241–254. [PubMed]
- Rubin D. Statistics and causal inference: Which ifs have causal answers. Journal of the American Statistical Association. 1986;81:961–962.
- Taylor J, Wang Y, Thiebaut R. Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics. 2005;61:1102–1111. [PubMed]
- Weir C, Walley R. Statistical evaluation of biomarkers as surrogate endpoints: A literature review. Statistics in Medicine. 2006;25:183–203. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |