Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biometrics. Author manuscript; available in PMC 2009 August 13.
Published in final edited form as:
PMCID: PMC2726718

Evaluating Candidate Principal Surrogate Endpoints


Frangakis and Rubin (2002, Biometrics 58, 21–29) proposed a new definition of a surrogate endpoint (a “principal” surrogate) based on causal effects. We introduce an estimand for evaluating a principal surrogate, the causal effect predictiveness (CEP) surface, which quantifies how well causal treatment effects on the biomarker predict causal treatment effects on the clinical endpoint. Although the CEP surface is not identifiable due to missing potential outcomes, it can be identified by incorporating a baseline covariate(s) that predicts the biomarker. Given case–cohort sampling of such a baseline predictor and the biomarker in a large blinded randomized clinical trial, we develop an estimated likelihood method for estimating the CEP surface. This estimation assesses the “surrogate value” of the biomarker for reliably predicting clinical treatment effects for the same or similar setting as the trial. A CEP surface plot provides a way to compare the surrogate value of multiple biomarkers. The approach is illustrated by the problem of assessing an immune response to a vaccine as a surrogate endpoint for infection.

Keywords: Case cohort, Causal inference, Clinical trial, HIV vaccine, Postrandomization selection bias, Structural model, Prentice criteria, Principal stratification

1. Introduction

Identifying biomarkers that can be used as approximate surrogates for clinical endpoints in randomized trials is useful for many reasons including shortening studies, reducing costs, sparing study participants discomfort, and elucidating treatment effect mechanisms. As a motivating example, a central objective of placebo-controlled preventive HIV vaccine efficacy trials is the evaluation of vaccine-induced immune responses as surrogate endpoints for HIV infection. An immunological surrogate would be useful for several purposes including guiding iterative development of immunogens between basic and clinical research, informing regulatory decisions and immunization policies, and bridging efficacy of a vaccine observed in a trial to a new setting.

The surrogate evaluation field was catalyzed by Prentice’s (1989) definition of a surrogate endpoint as a replacement endpoint that provides a valid test of the null hypothesis of no treatment effect on the clinical endpoint. The two main criteria for checking this definition are: (i) the distribution of the clinical endpoint conditional on the surrogate is the same as the distribution of the clinical endpoint conditional on the surrogate and treatment (i.e., all of the clinical treatment effect is “mediated” through the surrogate); and (ii) the surrogate and clinical endpoints are correlated. Frangakis and Rubin (2002) (henceforth FR) noted that this definition is based on observable random variables, and named a biomarker satisfying criterion (i) a “statistical surrogate.” Since 1989, many surrogate-evaluation methods have been designed to check if a biomarker is a statistical surrogate, including methods for estimating the proportion of the treatment effect explained (Freedman, Graubard, and Schatzkin, 1992). Notably some approaches have not been based on (i); for example, the adjusted association estimand is designed for evaluating the correlation criterion (ii), and the relative effect estimand is based on average causal effects (Buyse and Molenberghs, 1998).

Treatment effects adjusted for a variable measured after randomization (called net effects) are susceptible to postran-domization selection bias. Because candidate surrogates are measured after randomization, criterion (i) defining a statistical surrogate is based on net effects. FR pointed out that this definition does not have a causal interpretation, and proposed a new surrogate definition based on principal causal effects. FR’s definition of a “principal surrogate” is based on the potential outcomes framework for causal inference, which Robins (1995) also considered for studying treatment effects subject to postrandomization selection bias. To date, statistical methods for evaluating principal surrogates have not been elaborated. A recent review paper noted that FR “present a convincing case for the principal surrogate definition” and called for such elaborations (Weir and Walley, 2006).

The literature on statistical methods for evaluating surrogate endpoints contains approaches based on a single large clinical trial and on metaanalysis. Here we develop an approach for evaluating a principal surrogate within the former setting. Following Follmann (2006), our approach uses a baseline covariate(s) to predict missing potential biomarker outcomes. After defining statistical and principal surrogates in Section 2, in Section 3 we introduce the causal effect predictiveness (CEP) surface and the marginal CEP curve, plus associated summary causal estimands, which quantify how well a biomarker predicts population-level causal effects of treatment. In Section 4, we develop an estimated-likelihood approach for estimating the causal estimands based on case–cohort sampling of the biomarker, and parametric or nonparametric marginal structural mean models. In Section 5, we evaluate the nonparametric method in simulations based on an HIV vaccine trial, and in Section 6 we conclude with discussion.

2. Statistical and Principal Surrogates

Throughout we consider a randomized trial with treatment assignment Z (Z = 1 or 0), a discrete or continuous biomarker S measured at fixed time t0 after treatment assignment, and a binary clinical endpoint Y (Y = 1 for disease, 0 otherwise) measured after t0. Because S must be measured prior to disease to evaluate it as a candidate surrogate, the analysis is restricted to subjects disease free at t0; denote this evaluability criterion by the indicator V = 1. The biomarker S is only measured in those with V = 1, and otherwise is undefined (denoted by S = *). We consider two phase outcome-dependent case–cohort sampling, wherein baseline covariates X are measured for everyone (phase 1) and in the second phase a baseline covariate(s) W is measured for all or almost all cases (those with Y = 1) and for a random “subcohort” of controls (those with Y = 0). Let δ indicate whether W is measured. For subjects with V = 1, S is measured for those with W measured. Case–cohort sampling is efficient when W or S is expensive (Prentice, 1986). For vaccine trials, W and S can be measured after the trial using stored specimens (Gilbert et al., 2005).

2.1 Definition of a Statistical Surrogate

Following FR, methods for evaluating statistical surrogates are based on comparing the risk distributions


If S is continuous then these definitions abuse notation; however, to avoid the distraction of technical details the formal definitions are placed in Web Appendix A. FR defined S to be a statistical surrogate if, for all values s of S, risk(s | Z = 1) = risk(s | Z = 0).

Because S and V are measured after randomization, a comparison of risk(s | Z = 1) and risk(s | Z = 0) measures the net effect of treatment, i.e., differences due to a mixture of the causal treatment effect and any differences in characteristics between treatment 1 subjects who have response level s, {Z = 1, V = 1, S = s}, and treatment 0 subjects who have response level s, {Z = 0, V = 1, S = s}. Consequently, application of a method that evaluates a statistical surrogate may mislead about the capacity of a biomarker to reliably predict causal clinical treatment effects.

2.2 Definition of a Principal Surrogate Endpoint

Let Y(Z) be the potential clinical endpoint after time t0 under assignment to treatment Z. Similarly define potential outcomes S(Z) for the biomarker endpoint measured at t0, and let V(Z) be the potential indicators of whether the subject is disease free at t0, for Z = 0,1. Note that S(Z) and Y(Z) are undefined if V(Z) = 0; in this case S(Z) = Y(Z) = *. We suppose that (Zi, Xi, δi, δiWi, V i(1), V i(0), Si(1), Si(0), Y i(1), Y i(0)), i = 1,…, n, are independent and identically distributed (i.i.d.), and for simplicity assume no drop-out. Further we assume A1–A3:


Stable unit treatment value assumption (SUTVA)


Ignorable treatment assignments (Rubin, 1986): Conditional on X, Z is independent of (W, V (1), V (0), S(1), S(0), Y (1), Y(0))


Equal individual clinical risk up to time t0 : V(1) = 1 if and only if V (0) = 1

A1 implies that the potential outcomes (V i(1), V i(0), Si(1), Si(0), Y i(1), Y i(0)) are independent of the treatment assignments of other subjects, which implies “consistency,” (V i(Zi), Si(Zi), Y i(Zi)) = (V i, Si, Y i). A2 holds for blinded randomized trials, where the randomization may depend on X. A3 will be needed for identifying the causal estimand based on data from subjects observed to be at risk for disease at t0. Inferences will be robust to A3 if t0 is near baseline relative to the period of follow-up for clinical events and the vast majority of subjects are at risk at t0, in which case V i(1) = V i(0) = 1 for almost all i.

With these preliminaries, we now define a principal surrogate endpoint. FR defined the basic principal stratification P0 with respect to the postrandomization variable S as the partition of units i = 1, … , n such that within any set of P0, all units have the same vector (Si(1), Si(0)). A principal stratification is a partition of units whose sets are unions of sets in P0. FR defined a biomarker S to be a principal surrogate endpoint if the comparison between


results in equality for all s1 = s0. FR did not explicitly condition on V (1) = V (0) = 1 in their definition; however, implicitly they must have, because (S(1), S(0)) is only defined if V (1) = V (0) = 1. For notational simplicity henceforth all probability statements involving S(Z) implicitly condition on V (Z) = 1. A contrast in risk(1)(s1, s0) and risk(0)(s1, s0) measures a population-level causal treatment effect on Y for subjects with {Si(1) = s1, Si(0) = s0}. Such a contrast is causal because it conditions on a principal stratification, which, by construction, is unaffected by treatment. Thus in FR’s definition, S is a principal surrogate if groups of subjects with no causal effect on the biomarker have no causal effect on the clinical endpoint. We call this property average causal necessity.

Average causal necessity

risk(1)(s1, s0) = risk(0)(s1, s0) for all s1 = s0.

Biomarkers with the greatest utility for predicting clinical treatment effects will not only be necessary for a clinical effect, but also sufficient. For example, knowing that an antibody titer > 1000 is sufficient for a vaccine to protect individuals against HIV infection is exactly the information needed to use titer as a reliable predictor of protection. We define average causal sufficiency as

Average causal sufficiency

There exists a constant C ≥ 0 such that risk(1)(s1, s0) ≠ risk(0)(s1, s0) for all |s1s0| > C.

For the one-sided situation where interest is in assessing if higher treatment 1 biomarker responses (S(1) > S(0)) predict clinical benefit of treatment 1 (Y (1) = 0 and Y (0) = 1) (e.g., a placebo-controlled trial), a one-sided version of average causal sufficiency may be more appropriate, defined as above with ≠ replaced with < and |s1s0| replaced with s1s0. In either case, we suggest a refined definition of a principal surrogate endpoint as a biomarker that satisfies both average causal necessity and average causal sufficiency. Henceforth we use this definition of a principal surrogate endpoint.

3. Causal Effect Predictiveness Estimands

3.1 Quantitation of Associative and Dissociative Effects

FR suggested that the quality of a surrogate be measured by its “associative effects” relative to its “dissociative effects.” As defined in equation 5.3 and of FR, an associative effect is a comparison between the ordered sets


and a dissociative effect is a comparison between the ordered sets


For the purpose of quantifying these effects, we introduce a causal effect predictiveness (CEP) surface. Let CE [equivalent] h(Pr(Y (1) = 1), Pr(Y (0) = 1)) be the overall causal effect of treatment on the clinical endpoint, where h(·, ·) is a known contrast function satisfying h(x, y)= 0 if and only if x = y, for example h(x, y) = xy or log(x/y). Let


be this contrast conditional on {S(1) = s1, S(0) = s0}. Note that CEPrisk(s, s) = 0 for all s is equivalent to average causal necessity, whereas CEPrisk(s1, s0) ≠ 0 for all |s1s0| > C (or the one-sided analog) is equivalent to average causal sufficiency. Therefore, the criteria for a principal surrogate can be checked through inference on the CEP surface. Moreover, biomarkers with capacity to predict clinical treatment effects will often have |CEPrisk(s1, s0)| increasing in |s1s0|, reflecting the situation that on average groups of persons with a greater causal effect on the marker have a greater causal effect on the clinical endpoint. We refer to the capacity of a biomarker to reliably predict the population level causal effect of treatment on the clinical endpoint as the biomarkers’ surrogate value. This value can be quantified both by the nearness of |CEPrisk(s1, s0)| to 0 for s1 near s0 and by the extent to which |CEPrisk(s1, s0)| increases with |s1s0|, with a greater increase reflecting greater associative effects. Note that even if one or both of average causal necessity or sufficiency fail, a biomarker can still have surrogate value if |CEPrisk(s1, s0)| increases with |s1s0|; Figure 2 (dashed line) will illustrate this. Moreover, two principal surrogates can have different surrogate values as reflected by different CEP surfaces.

Figure 2
For case CB with Si(0) = c for all i with c = L the lower bound of S, biomarkers S that have no (horizontal solid line), modest (dashed line), moderate (dotted line), and high (hatched line) surrogate value. Here CEPrisk(s1, c) = h(risk(1)(s1, c), risk ...

If S is continuous, then the CEP surface can alternatively be defined in terms of percentiles of the marker S. To formulate this, consider Huang, Pepe, and Feng’s (2007) proposal to judge the value of a continuous marker S for predicting disease Y by the predictiveness curve, R(υ) [equivalent] Pr(Y = 1| S = F−1(υ)), υ [set membership] [0, 1], where F is the cumulative distributive function (cdf) of S. Note that R(υ) = risk(S = F−1(υ)), i.e., R(υ) is risk as a function of the quantiles of S, which provides a common scale for comparing multiple markers. The predictiveness curve R(υ) usefully informs about both absolute risks at different marker quantiles and the frequency of these risks in the population. A predictive marker is one with R(υ) monotone (or approximately so) in υ with large |R(1) − R(0)|.

Applying these ideas, we propose a scale-independent version of the causal effect predictiveness surface, CEPR1, υ0) [equivalent] h(R(1)1, υ0), R(0)1, υ0)), where


In this definition, S(1) and S(0) are standardized relative to the distribution F(1) of S(1). Figure 1 illustrates two CEP surfaces for the one-sided setting where interest is in predicting clinical benefit of treatment 1 from higher treatment 1 biomarker responses.

Figure 1
Example CEPR1, υ0) = h(R(1)( υ1, υ0), R(0)( υ1, υ0)) surfaces, with h(x, y) = 1 − x/y. The surface in (i) reflects a biomarker with no surrogate value, wherein the clinical treatment effect is ...

For some studies, the marginal CEP curve is a related causal estimand of interest:


where risk(Z)(s1) [equivalent] Pr(Y(Z) = 1| S(1) = s1). Similarly mCEPR1) is defined as h(R(1)1), R(0)1)) with R(Z)(υ1)Pr(Y(Z)=1|S(1)=F(1)1(υ1)). With h(x, y) = xy, if S is continuous and strictly increasing then the area between mCEPR(·) and the zero-line equals CE = Pr(Y (1) = 1) − Pr(Y (0) = 1) (proof in Web Appendix A).

If Si(0) is constant across subjects, then the CEP surface (trivially) equals the marginal CEP curve. We refer to this special case as case CB:

Case CB

Constant Biomarkers: Si(0) = c for all i for some constant c

HIV vaccine trials fit case CB, with (almost) all subjects having no immune response under placebo (Z = 0). This occurs because S is an HIV-specific immune response, so that vaccine antigens must be presented to the immune system to induce a response (Gilbert et al., 2005). The dissociative effect can be measured by CEPrisk(c, c) and the associative effects by CEPrisk(s1, c) for s1c. For example, with c = L the lower bound of S, the nearer CEPrisk(c, c) is to zero and the greater the increase of |CEPrisk(s1, c)| with s1 > c, the greater the surrogate value (Figure 2).

For placebo-controlled trials for which case CB fails yet Si(0) has much less variability than Si(1), the marginal CEP curve has interpretation approximately equal to that of CEP(s1, s0). In general, however, mCEP(s1) does not measure the association between causal biomarker effects and causal clinical effects, and hence does not measure principal surrogate value. Nevertheless, under A1 and A2 mCEP(s1) has a different but useful interpretation as the population level causal treatment effect on Y for subjects with S(1) = s1, where conditioning on S(1) is equivalent to conditioning on a baseline covariate. As such, the marginal CEP curve can be used for predicting how clinical efficacy varies with the biomarker S = S(1) observed in persons attending a treatment or vaccine clinic.

3.2 Estimands for Summarizing Surrogate Value

We suggest functionals of the CEP surface that summarize the surrogate value of a biomarker. We again consider the one-sided setting where interest is in assessing whether S(1) > S(0) predicts clinical benefit of treatment 1 (Y ( 1) = 0 and Y (0) = 1). To summarize the associative and dissociative effects, we consider the expected associative effect (EAE) and the expected dissociative effect (EDE):



where ω(·, ·) is a nonnegative weight function. For case CB with c = L, EAEω = {∫s1>c ω(s1, c) dF(1)(s1)}−1s1>c × ω(s1, c)CEPrisk(s1, c) dF(1)(s1) and EDE = CEPrisk(c, c).

We also define the proportion associative (PA) effect by


Values of PAEω [set membership] [0, 0.5] suggest the biomarker has no surrogate value, whereas values in (0.5, 1] suggest some surrogate value.

A weight function is included in EAEω to reflect the idea that a biomarker with high surrogate value should have large |CEPrisk(s1, s0)| for large s1s0 > 0. For example, weights ω(s1, s0) = s1s0 or I(s1 = U, s0 = L) may be used, where L (U) is the lower (upper) bound of S. With the latter weight, PAEω compares the clinical effect among groups with the maximum surrogate effect and with no surrogate effect:


If h(x, y) = xy, Pr(S(1) > S(0)) = 0.5, and an additional monotonicity assumption is made (that Y i(1) ≤ Y i(0) for all i, i.e., no one is harmed by treatment 1), then PAEω=1 equals the PA, defined by

PAPr(S(1)>S(0),Y(1)=0,    Y(0)=1)/Pr(Y(1)=0,Y(0)=1)

(proof in Web Appendix A). This summary measure, proposed by Taylor, Wang, and Thiebaut (2005), is interpreted as the proportion of the study population with a beneficial causal clinical effect that also has a positive causal surrogate effect. The PA depends on the underlying principal strata distribution F(1),(0)(s1, s0) [equivalent] Pr(S(1) ≤ s1, S(0) ≤ s0); if Pr(S(1) > S(0)) is small (large) then the PA will tend to be small (large), irrespective of the biomarker’s surrogate value. By conditioning on (S(1), S(0)), the PAEω is designed to be robust to F(1),(0)(·, ·); the PAEω reflects the relative magnitude of clinical effects for those with and without surrogate effects.

Biomarkers satisfying average causal necessity have EDE = 0 and thus PAEω = 1, in which case EAEω contributes no information to the PAEω. Therefore, additional measures are needed for summarizing the magnitude of associative effects. One such measure is the associative span (AS), defined by AS [equivalent] |CEPrisk(U, L)| − |EDE|. Figure 2 illustrates PAEω=1 and AS. Although the summary parameters may be useful, it is important to estimate the CEP estimands over the range of marker values or quantiles to provide a full picture of the associative and dissociative effects.

Below we also consider estimands defined as above except they condition on X and/or W; for example risk(Z)(s1, s0, x, w) [equivalent] Pr(Y (Z) = 1 | S(1) = s1, S(0) = s0, X = x, W = w) and CEPrisk(s1, s0, x, w) [equivalent] h(risk(1)(s1, s0, x, w), risk(0)(s1, s0, x, w)). The conditional estimands reflect baseline covariatespecific surrogate value.

4. Estimating the CEP Surface and Marginal CEP Curve

We consider one approach to identifying and estimating the CEP surface in the practically important special case CB. The same approach identifies and estimates the marginal CEP curve in the general case that Si(0) has arbitrary variability. In case CB it is difficult to evaluate a statistical surrogate, because it is not possible to study the correlation of S with Y in arm Z = 0 subjects, and it is conceptually difficult to evaluate whether S fully mediates clinical treatment effects (Chan et al., 2002).

4.1 Identifiability of the Causal Estimands

Due to missing potential outcomes the CEP surface and marginal CEP curve are not identified without further assumptions. A1–A3 imply


demonstrating that risk(Z)(s1, s0, x, w) would be identified if we knew the potential outcomes Si(Z) of subjects assigned the opposite treatment 1 − Z. Estimating the CEP surface will therefore require a way to predict the missing potential biomarkers. This challenge is relatively easy in case CB, for which risk(1)(s1, c, x, w) is identified by the observed data in arm Z = 1. However, A1–A3 do not identify risk(0)(s1, c, x, w), and the remaining task to identify the CEP surface entails determining values Si(1) for arm Zi = 0 subjects.

4.2 Baseline Predictor Study Design and Likelihood

Our method of inference is based on one of the augmented vaccine trial designs proposed by Follmann (2006), wherein a baseline covariate(s) W that is predictive of S(1) is measured in subjects in both treatment arms. A model predicting S(1) from X and W fit from arm Z = 1 subjects is used to predict S(1) for arm Z = 0 subjects. The predictions are unbiased because A1–A3 imply S(1) | Z = 1, X, W =d S(1) | Z = 0, X, W, where =d denotes equality in distribution.

We observe i.i.d. data Oi [equivalent] (Zi, Xi, V i, Y i, δi, δiWi, δiZiSi ), i = 1,…, n. Only subjects with Vi = 1 contribute to the likelihood. Subjects with Ziδi = 1 contribute risk(1)(Si, c, Xi, Wi; β)Yi(1 − risk(1)(Si, c, Xi, Wi; β))1−Yi , where risk(1)(·, c, ·, ·; β) is modeled as a function of unknown parameters β. The likelihood contribution for subjects with (1 − Zii = 1 is obtained by integrating risk(0)(·, c, Xi, Wi; β) over the conditional cdf F(1)S|X,W, of S(1) | X, W. The contribution for subjects with δi = 0 is obtained by integrating risk(Zi)(·, c, Xi;β) over the conditional cdf of S(1), W |X, which is F(1)S|X,W×FW|X, where FW | X is the conditional cdf of W | X. Thus, with ν(F(1)S|X,W,FW|X), the conditional likelihood is L(β,ν)Πi=1nf(Yi|Zi,Xi,Vi,δi,δiWi,δiZiSi)Vi, where

f(Y|Z,X,V,δ,δW,δZS)={risk(1)(S,c,X,W;β)Y(1risk(1)(S,c,X,W;β))1Y}Zδ  ×{(risk(0)(s1,c,X,W;β)dF(1)S|X,W(s1|X,W))Y   ×(1risk(0)(s1,c,X,W;β)   ×dF(1)S|X,W(s1|X,W))1Y}(1Z)δ  ×{(risk(Z)(s1,c,X,w;β)     ×dF(1)S|X,W(s1|w,X)dFW|X(w|X))Y   ×(1risk(Z)(s1,c,X,w;β)      × dF(1)S|X,W(s1|w,X)dFW|X(w|X))1Y}(1δ).

Because CEPrisk(·, c, X, W; β) depends on β but not ν, the ν are nuisance parameters. Although profile likelihood is a natural approach to pursue, it is difficult to implement because the likelihood integrates over F(1)S|X,W, and FW|X. We use estimated likelihood (Pepe and Fleming, 1991), also called pseudolikelihood, wherein consistent estimates of ν are obtained based on treatment arm 1 data, and then L(β,ν ^) is maximized in β. The bootstrap is used to estimate standard errors for β ^. A re-sampling approach is used because there does not appear to be an analytical expression for the asymptotic variance of β ^ that accounts for the variations in ν ^, and previously developed techniques for deriving the variance do not apply because they would assume that all subjects have a non-zero probability that S(1) is observed (e.g., Pepe and Fleming, 1991).

4.3 Models for Risk (z)(·, c, ·, ·) and ν=(F(1)S|X,W,FW|X)

The estimated likelihood approach can be used for a variety of structural models for risk(z)(s1, c, x, w) and the nuisance parameters ν. Here we consider two types of models for case CB. The first is fully parametric, where we assume F(1)S|X,W and FW|X have particular parametric distributions, and S(1) is continuous subject to “limit of detection” left-censoring: S(1) [equivalent] max(S*(1), c), where S*(1) has a continuous cdf with Pr(S*(1) ≤ c) > 0. We also assume the risk functions have a generalized linear model form


for s1c and some known link function g(·). For example, we might assume FW|X is normal and F(1)S|X,W, is censored normal, with left-censoring of values below c, and A4-P holds with g equal to the standard normal cdf Ф. This set-up extends Follmann (2006) to account for censoring. With h(x, y) = g−1(x) − g−1(y), A4-P then implies


Simple calculations yield EDE(x, w) = (β10 − β00) + (β11 − β01)L + (β12 − β02)T x + (β13 − β03)Tw, AS(x,w) = |(β10 − β00) + (β11 − β01)U + (β12 − β02)T x+(β13 − β03)Tw| − |EDE × (x, w)|, and EAEω=1(x, w) = (β10 − β00) + (β11 − β01)E[S(1)|S(1) > c, x, w] + (β12 − β02)T x + (β13 − β03)Tw. For the case that g = Ф, Web Appendix C provides a proof, adapted from a proof by Dean Follmann, that β is identified under the untestable imposed constraint that one of the components of (β12T,β13T) equals the corresponding component of (β02T,β03T). Therefore identifiability requires assuming the absence of one interaction, but otherwise if and how the CEP curve varies with X and W can be evaluated. If no interactions between treatment and X or W are assumed, then CEPrisk(s1, c; β) = (β10 − β00) + (β11 − β01)s1 is interpreted as the covariate-adjusted CEP curve.

Secondly, we consider a nonparametric approach wherein S and W are treated as categorical variables with J and K levels, which may be discretized versions of continuous measurements. Here we assume the lowest category j = 1 corresponds to the constant c in case CB. With W the only baseline covariate, nonparametric models are specified by νjk [equivalent] Pr(S(1) = j, W = k) and


for j = 1,…, J; k = 1,…,K; and z = 0, 1. The parameters β(βzj,βk:z=0,1;j=1,,J;k=1,K) are constrained such that 0βzj+βk1 for all z, j, k andk=1Kβk=0 for identifiability. Under model A4-NP, W has the same effect on risk for the two study arms. This no-interaction assumption identifies the model, and the expanded model with βk replaced with βzk is not identified (see Web Appendix C).

In the simulations we consider the CEP curve estimand CEPrisk(j, 1; β) = log(risk(1)(j, 1; β1j)/risk(0)(j, 1; β0j)) based on average risks risk(z)(j,1;βzj)K1k=1Krisk(z)×(j,1,k;β)=βzj,z=0,1. It follows that CEPrisk(j, 1; β) = log(β1j0j ), AS = |log(β1J/ β0J )| − |log(β11/ β01)|,EDE = log(β1101), and EAEω=j=2Jω˜(j,1)log(β1j/β0j) with ω˜(j,1)=ω(j,1)νj/l=2Jω(l,1)νl and νj=k=1Kνjk.

For both the parametric and nonparametric approaches, Web Appendix B describes consistent estimators of ν and procedures for maximizing L(β,[nu with circumflex]) in β.

4.4 Tests for Whether a Biomarker Has Any Surrogate Value

Because PAEω = 0.5 supports that S has no surrogate value, Wald tests for any surrogate value can be based on the maximum estimated likelihood estimator (MELE) PAE^ω minus 0.5 divided by its bootstrap standard error. Similarly Wald tests of AS = 0 can be implemented based on AS^. For the nonparametric approach assuming model A4-NP, we also consider a test statisticT=j=2J(j1){β^0j(β^0j+β^1j)(μ^0/(μ^0+μ^1))} divided by its bootstrap standard error, where μ^z=1Jj=1Jβ^zj. This test evaluates H0 : CEPrisk(j, 1) = CE for all j versus the monotone alternative that CEPrisk(j, 1) increases in j, similar to the Breslow–Day trend test (Breslow and Day, 1980). The null and alternative hypotheses indicate that average causal sufficiency does not and does hold, respectively.

5. Simulation Study

Based on data from the first preventive HIV vaccine efficacy trial (Gilbert et al., 2005), we conducted a simulation study to evaluate performance of the MELE methods. The vaccine trial was double blind with 2:1 randomization to vaccine:placebo. A biomarker of interest S was the 50% neutralization titers against the HIV recombinant gp120 molecule measured from a serum sample drawn at the month 1.5 visit, and Y was HIV infection during the time period t0 = 1.5 months to 36 months. The lower quantification limit of the neutralization assay was 1.65, and 44 of 47 placebo recipients with S measured at 1.5 months had left-censored values; thus the data essentially fit case CB. The range of Si was [1.65, 4.09], which we rescaled to [0, 1], so that c = L = 0.

We simulated vaccine trials with the following steps. Step 1: For all 3598 (1805) subjects in the vaccine (placebo) arm, (Wi, Si(1)) was generated from a bivariate normal distribution with means 0.41, standard deviation 0.55, and correlation ρ = 0.5, 0.7, or 0.9; the standard deviation was chosen such that 23% of Si(1) values were less than 0 on average. Simulated values of Si(1) and W less than 0 (greater than 1) were set equal to 0 (1). Step 2: The Wi and Si(1) were binned into quartiles. For subjects i with quartile j value of Si(1) and quartile k value of Wi, Y i(Z) was generated from a Bernoulli distribution with success probability determined by model A4-NP with values of βzj and βk set as follows. First, the βzj were set to achieve the infection rate Pr(Y (1) = 1) = 0.067 that was observed in the vaccine arm of the trial, an overall vaccine efficacy of 50% (i.e., Pr(Y (0) = 1) = 2 × Pr(Y (1) = 1)) and to reflect a biomarker with either (i) no or (ii) high surrogate value. Based on risk averaged over W described in Section 4.3, in scenario (i) CEPrisk(j, 1; β) [equivalent] log (risk(1)(j, 1; β1j)/risk(0)(j, 1; β0j)) = −0.69 for j = 1,…, 4 and in scenario (ii) CEPrisk(j, 1; β) = − 0.22, −0.51, −0.92, −1.61 for j = 1,…, 4. With vaccine efficacy VE(j, 1) [equivalent] 1 − exp(CEPrisk(j, 1; β)), scenario (i) specifies constant VE(j, 1) = 0.5 and scenario (ii) specifies VE(j, 1) = 0.2, 0.4, 0.6, 0.8 for j = 1,…, 4. For both scenarios (i) and (ii), we let βk=0.0150.01(k1) for k = 1, …, 4. Step 3: To achieve case–cohort sampling, (Wi, Si(1)) was retained for all infected vaccine recipients and a subcohort of uninfected vaccine recipients. For the placebo arm Si(1) was set to missing for everyone and Wi was retained only for all infected placebo recipients and for a subcohort of uninfected placebo recipients. For each arm, the ratio of controls to cases was 3:1. The simulated data sets satisfied A1–A3 and A4-NP.

For each of 1000 simulated data sets the MELE [beta] was computed using the nonparametric approach described in Section 4.3. Then, with h(x, y) = log(x/y), [beta] was used to compute the MELEs of CEPrisk(j; 1), AS, and PAEω for ω(j; 1) = 1, j, and I(j = J = 4). Wald tests (with bootstrap standard errors) based on PAE^ω − 0.5, AS^, and T were used to test for any surrogate value. The MELEs of CEPrisk(j; 1), PAEω and AS performed well (Table 1 and Table 2). Bootstrap percentile confidence intervals (CIs) had approximately nominal coverage and for higher values of ρ the MELEs exhibited negligible bias. The tests for any surrogate value had approximately nominal size and showed adequate power to detect surrogate value; the nonparametric trend test had power 0.83, 0.99, and >0.99 for ρ = 0.5, 0.7, and 0.9 under scenario (ii).

Table 1
Model A4-NP simulation results for the nonparametric MELEs CEP^risk (j, 1; β) = log([beta]1j/[beta]0j) for j = 1, …, 4a
Table 2
Model A4-NP simulation results for the nonparametric MELEs PAE^ω and AS^ , with h(x, y) = log(x/y)a

Additional simulations were conducted to evaluate the performance of the MELE method with binned covariates when the data were generated from a continuous model. Specifically, Step 2 described above was replaced with Step 2′: For vaccine arm subjects, Y i(1) was generated using model A4-P with g = Ф and (β10, β11, β12, β13) = (−1.21, −0.67, 0, −0.1) set to fit the real vaccine arm data with infection rate 0.067. For the placebo arm, we supposed overall vaccine efficacy of 50% and generated Y i(0) assuming model A4-P with β02 = 0, β03 = β13 and either (i) β01 = β11 or (ii) β01 = 0. For h(x, y) = g−1(x) − g−1(y), under (i) CEPrisk(s1, c; β) = β10 − β00 = −1.21 − (−1.1) = −0.11 is constant, so that S has no surrogate value (with AS = 0 and PAEω = 0.5 for any weight function ω); under (ii) CEPrisk(s1, c; β) = −0.11 − 0.67 s1, so that S has high surrogate value (with AS = 0.67 and PAEω = 0.88 for ω(j; 1) = I(j = J = 4)). Using h(x, y) = g−1(x) − g−1(y), the MELEs and CIs of PAEω and AS performed well (Table 3). Relative to Table 2, there was a slight increase in bias and in standard errors. Tests for any surrogate value had approximately nominal size, with power only slightly lower than in the previous set of simulations. These simulation studies provide a “proof of principle” that the proposed methods can reliably estimate the CEP surface and distinguish biomarkers with no or high surrogate value.

Table 3
Model A4-P (probit) model simulation results for the nonparametric MELEs PAE^ω and AS^ , with h(x, y) = Ф−1(x) − Ф−1(y)a

6. Discussion

A main use of a surrogate endpoint is predicting treatment effects on a clinical endpoint. Within the principal surrogate framework, we have introduced the CEP surface and the marginal CEP curve as appropriate estimands for measuring the predictive capacity of a candidate surrogate. We developed estimation and testing methods under case–cohort sampling from a single large clinical trial (or multiple similar trials); such inferences apply for measuring surrogate predictiveness for the same or similar setting as the trial. The inferences do not form an empirical basis for bridging information about clinical efficacy to a new setting not represented in the trial(s) (e.g., to a new human population or treatment formulation); for this additional experiments (such as mechanistic studies and studies that deliberately manipulate the biomarker) and metaanalysis of heterogeneous studies are needed.

Because the definition of the CEP surface involves unobservable potential outcomes, strong untestable assumptions may be needed to identify it, possibly precluding its reliable estimation. The estimation method we developed requires A1–A3, a reasonably good model predicting S from baseline covariates X and W in treatment arm 1, and models for risk(z)(s1, c, x, w) or its marginal counterpart risk(z)(s1, x, w), for z = 0, 1. A1–A2 are standard in blinded randomized trials. A1 (SUTVA) is potentially dubious in the infectious disease setting where dependent happenings are possible (Halloran and Struchiner, 1995), but should approximately hold in trials with a small study population relative to the total population of at risk individuals. A3 can be assessed by testingH0x:Pr(V=1|Z=1,X=x)=Pr(V=1|Z=0,X=x) at each of multiple fixed baseline covariate levels x, where rejecting H0x for any x rejects A3. It is difficult to fully verify A3, however, due to the curse of dimensionality. The method is expected to be robust to violations of A3 if the vast majority of clinical events happen after the biomarker measurement time. Otherwise it will be important to extend the methods to facilitate sensitivity analyses to departures from A3.

Models for the conditional distribution of S given X and W can be directly checked using arm Z = 1 data, and under A1–A3 models for risk(1)(s1, c, x, w) can be tested. The model for risk(0)(s1, c, x, w) specified by A4-P or A4-NP is not testable. However, with extra data collection Follmann’s (2006) “close-out placebo vaccination” approach would provide one way to test it. Given the challenge in verifying this assumption, sensitivity analysis and the use of multiple surrogate evaluation approaches is warranted.

Within the principal surrogate framework considered here, internal validity of the putative surrogate can be checked by comparing the estimated overall clinical treatment effect, CE^h(Pr^(Y(1)=1),Pr^(Y(0)=1)), to the CE predicted from the biomarker. Under A1–A3 and case CB, CE can be predicted by Pred(CE)=CEP^risk(s1,c)dF^(1)(s1), which averages the predicted clinical treatment effect over the distribution of observed marker values of subjects assigned arm Z = 1. Furthermore, the estimated CEP curve can be used to check projective validity, that is, the utility of S for bridging efficacy predictions across populations. For example, suppose treatments Z = 1 and Z = 0 are compared within two subroups of a large trial. The CEP surface can be estimated from subgroup 1 data, and Pred(CE) calculated by estimating F(1)(·) from the observed biomarker values S of subgroup 2 subjects in arm Z = 1. Then projective validity would be supported by Pred(CE) near CE^ for subgroup 2.

The estimands and estimation techniques developed here for a binary clinical endpoint Y also apply for a quantitative clinical endpoint Y, with all expressions Pr(Y (Z) = 1 | ·) replaced with E(Y (Z) | ·). In either case the CEP estimands describe how the average or population level causal effect on Y depends on the causal effect on S.

Supplementary Material


Supplementary Materials:

Web Appendices referenced in Sections 2.1, 3.1, 3.2, and 4.3 are available under the Paper Information link at the Biometrics website R code for the nonparametric method is also available at the Biometrics website.


The authors thank Dean Follmann, Margaret Pepe, Ross Prentice, Steve Self, and the associate editor for helpful comments. This work was supported by NIH grants 2 R01 AI54165-04 and 5 R37 AI029168-16.


  • Breslow N, Day N. Statistical Methods in Cancer Research. Volume 1. Lyon, France: International Agency for Research on Cancer; 1980.
  • Buyse M, Molenberghs G. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
  • Chan I, Shu L, Matthews H, Chan C, Vessey R, Sadoff J, Heyse J. Use of statistical models for evaluating antibody response as a correlate of protection against varicella. Statistics in Medicine. 2002;21:3411–3430. [PubMed]
  • Follmann D. Augmented designs to assess immune response in vaccine trials. Biometrics. 2006;62:1161–1169. [PMC free article] [PubMed]
  • Frangakis C, Rubin D. Principal stratification in causal inference. Biometrics. 2002;58:21–29. [PubMed]
  • Freedman L, Graubard B, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in Medicine. 1992;11:167–178. [PubMed]
  • Gilbert P, Peterson M, Follmann D, Hudgens MG, Francis DP, Gurwith M, Heyward WL, Jobes DV, Popovic V, Self SG, Sinangil F, Burke D, Berman PW. Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of HIV-1 infection in a phase 3 HIV-1 preventive vaccine trial. Journal of Infectious Diseases. 2005;191:666–677. [PubMed]
  • Halloran M, Struchiner C. Causal inferences in infectious diseases. Epidemiology. 1995;6:142–151. [PubMed]
  • Huang Y, Pepe M, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics. 2007;63:1181–1188. [PubMed]
  • Pepe M, Fleming T. A non-parametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
  • Prentice R. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11.
  • Prentice R. Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine. 1989;8:431–440. [PubMed]
  • Robins J. An analytic method for randomized trials with informative censoring: Part I. Lifetime Data Analysis. 1995;1:241–254. [PubMed]
  • Rubin D. Statistics and causal inference: Which ifs have causal answers. Journal of the American Statistical Association. 1986;81:961–962.
  • Taylor J, Wang Y, Thiebaut R. Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics. 2005;61:1102–1111. [PubMed]
  • Weir C, Walley R. Statistical evaluation of biomarkers as surrogate endpoints: A literature review. Statistics in Medicine. 2006;25:183–203. [PubMed]