Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3597127

Formats

Article sections

- Summary
- 1. Introduction
- 2. Estimands
- 3. Identifiability of Joint and Marginal Risks
- 4. Estimation
- 5. Sensitivity Analysis
- 6. Simulations
- 7. Discussion
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2013 March 14.

Published in final edited form as:

PMCID: PMC3597127

NIHMSID: NIHMS444131

Julian Wolfson, Department of Biostatistics, University of Washington, F-600 Health Sciences Building, Box 357232, Seattle, WA 98195-7232, Email: julianw/at/u.washington.edu;

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Given a randomized treatment *Z*, a clinical outcome *Y*, and a biomarker *S* measured some fixed time after *Z* is administered, we may be interested in addressing the surrogate endpoint problem by evaluating whether *S* can be used to reliably predict the effect of *Z* on *Y*. Several recent proposals for the statistical evaluation of surrogate value have been based on the framework of principal stratification. In this paper, we consider two principal stratification estimands: *joint risks* and *marginal risks*. Joint risks measure causal associations of treatment effects on *S* and *Y*, providing insight into the surrogate value of the biomarker, but are not statistically identifiable from vaccine trial data. While marginal risks do not measure causal associations of treatment effects, they nevertheless provide guidance for future research, and we describe a data collection scheme and assumptions under which the marginal risks are statistically identifiable. We show how different sets of assumptions affect the identifiability of these estimands; in particular, we depart from previous work by considering the consequences of relaxing the assumption of no individual treatment effects on *Y* before *S* is measured. Based on algebraic relationships between joint and marginal risks, we propose a sensitivity analysis approach for assessment of surrogate value, and show that in many cases the surrogate value of a biomarker may be hard to establish, even when the sample size is large.

The identification and evaluation of surrogate endpoints is a major goal of many clinical studies. In vaccine trials, for example, pinpointing a biomarker which reliably predicts protection from infection could provide an invaluable biological target for vaccine development and allow researchers to predict vaccine efficacy in new populations without the need to conduct a full-fledged trial. Several statistical approaches to surrogate endpoint assessment are summarized in Weir and Walley (2006). While the common scientific goal of these approaches is to predict the effect of treatment on the outcome in a future setting given information about the effect of treatment on the biomarker, Joffe and Greene (2009) (henceforth JG) distinguish between two paradigms based on the quantities used to make these predictions. In the “causal effects” (CE) paradigm, treatment effects on the outcome are predicted by combining the effect of treatment on the biomarker with knowledge of the effect of the biomarker on the outcome. The “causal association” (CA) paradigm predicts future treatment effects on the outcome based on the (previously observed) association between treatment effects on the biomarker and treatment effects on the outcome. While prototypical CA methods (eg.Buyse et al. (2000);Gail et al. (2000)) are based on meta-analysis of multiple trials, we focus on the case where the data arise from a single trial. We consider the counterfactual-based principal stratification approach proposed by Frangakis and Rubin (2002) (henceforth FR) and extended by Gilbert and Hudgens (2008) (henceforth GH), which allows the CA paradigm to be applied in the single-trial setting.

A major aim of this paper is to illustrate the inherent statistical difficulty of the surrogate endpoint assessment problem. We begin with a brief introduction to the problem and describe the vaccine trial setup and basic assumptions under which we will operate. Next, we introduce joint and marginal risks, the two estimands central to the paper. In Section 3, we describe the degree to which joint and marginal risks are statistically identifiable under a variety of scenarios. Section 4 outlines an estimated likelihood procedure for estimation of joint and marginal risks. Section 5 presents a sensitivity analysis approach to assessing surrogate value wherein we assume known functions for the unidentified distributions, indexed by fixed sensitivity parameters. Section 6 presents simulation results. We conclude with some final thoughts on the challenges of surrogate endpoint assessment.

In what follows, we consider a vaccine trial where subjects *i* = 1,…, *n* are randomly assigned at time 0 to either vaccine (*Z _{i}* = 1) or placebo (

Prentice (1989) defined a surrogate as a replacement endpoint such that a test of the null hypothesis of no treatment effect on this endpoint *(S)* provides a valid test of the corresponding null hypothesis for the clinical endpoint *(Y)*. Mathematically, this definition can be formulated as

(1)

According to Prentice, the two major criteria which *S* must satisfy to qualify as a surrogate for *Y*, conditional on *W*, are 1) *S* has some predictive value for *Y*, and 2) Conditional on *S, Y* is statistically independent of *Z*. Approaches to surrogate endpoint assessment which aim to verify these criteria belong to JG's CE paradigm. Figure 1 of their paper illustrates the well-known result from directed acyclic graph theory (see, eg., Pearl (2000)) that conditioning on the post-randomization event *S* may induce statistical dependence between *Z* and *Y* if the set of simultaneous predictors of *S* and *Y* (say *U*) is not accounted for. A further complication arises when the outcome of interest may occur before *S* is measured, a case not considered by JG. Our Figure 1 illustrates this situation: *Y ^{Τ}*, the early infection indicator, is affected by treatment and affects both

Since the determinants of *S, Y ^{Τ}*, and

(2)

(3)

where are the potential values of *(Y, Y ^{Τ}, S)* under treatment assignment

GH proposed to judge the value of *S* as a surrogate by the degree to which it satisfies the following two criteria, conditional on *W:*

**Average causal necessity (ACN):**
*R*_{1}(*S*_{0}; *S*_{1};*W*) = *R*_{0}(S_{0}; *S*_{1};*W*) for all *S*_{1} = *S*_{0}.

**Average causal sufficiency (ACS):** There exists a constant *C*
*≥* 0 such that *R*_{1}(*S*_{0}, *S*_{1}, *W*) ≠ *R*_{0}(*S*_{0}, *S*_{1}, *W*) for all |*S*_{0}–S_{1}|>*C*

ACN and ACS are checked by estimating contrasts of *Y*_{1} and *Y*_{0} within a principal stratum
. Since the stratum to which an individual belongs is unaffected by treatment, these contrasts represent causal effects and are not subject to post-randomization selection bias arising from unmeasured confounding (see FR, page 23 and Lauritzen (2004)). We note that ACN is equivalent to the definition given by Vanderweele (2008) of “no average principal strata direct effects”, while ACS implies the existence of some “average principal strata indirect effects” in the sense of Vanderweele.

While the principal stratification estimands *R*_{1} and *R*_{0} have appealing properties, we pay a price for their “robustness” to unmeasured confounding in terms of statistical non-identifiability. In what follows, we investigate how different assumptions affect the identifiability of (2), (3) and related counterfactual estimands.

We will henceforth refer to the quantities *R*_{1}(*S*_{0}, *S*_{1},*W*) and *R*_{0}(*S*_{0}, *S*_{1},*W*) as *joint risks*, to indicate that they depend on the joint values (*S*_{0}, *S*_{1}). We focus on identifiability and estimation in the special case of “constant biomarkers” considered in GH:

- [CB]
**Constant biomarkers under placebo**Provided*Y*^{Τ}= 0,*S*_{0}=*c*for some constant*c*

Though condition [CB] may not be satisfied in all situations, it often applies in vaccine trials where subjects have not previously been infected by the pathogen targeted by the vaccine under study. In HIV vaccine trials, for example, enrollment criteria generally require that subjects be HIV-uninfected at the start of the trial. Subjects who are not vaccinated, therefore, are not “primed” to recognize the HIV epitopes present in the vaccine, and hence are unlikely to show a measurable immune response when exposed to these epitopes via an immunological assay (Gilbert et al. (2005)). In this case, *c* might represent the lower limit of detection of the relevant assay. We note that, unlike some of the untestable assumptions introduced later, [CB] can be directly verified in the lab by measuring the biomarker values of placebo recipients who remain uninfected at time Τ.

When [CB] holds, we have (*S*_{0}, *S*_{1}) = (*c*, *S*_{1}), and hence for ease of notation we can drop the dependence of *R*_{1} and *R*_{0} on *S*_{0}, yielding

(4)

(5)

(the reader should recall, however, that these quantities remain “joint” risks since they are equivalent to *R*_{1}(*c*, *S*_{1}, *W*) and *R*_{0}(*c*, *S*_{1}, *W*)). The surrogate value of *S* can then be assessed, conditional on *W*, by considering the magnitude of contrasts of *R*_{0} and *R*_{1} for a variety of values of *S*_{1}. We refer the reader to GH's Figure 2, which plots *R*_{0}(*S*_{1}, *W*) — *R*_{1}(*S*_{1}, *W*) as a function of *S*_{1} for biomarkers having no, low, moderate, and high surrogate value.

Four assumptions allowed GH to identify the joint risks under [CB]:

- [A1]
**Stable Unit Treatment Value Assumption (SUTVA)****[No interference]**The potential outcomes for one subject are independent of the treatment assignments of other subjects, i.e. there is no “interference” between experimental units.**[Consistency]**For an individual receiving treatment*Z = z*and with observed outcome*Y*, we have*Y = Y*, i.e. the observed outcome is equal to the potential outcome under the treatment actually received._{z}

- [A2]
**Ignorable Treatment Assignments**- Conditional on
*W, Z*is independent of

- [A3]
**Equal Individual Clinical Risk prior to time**Τ- if and only if , i.e. the vaccine has no causal effect on the risk of infection before the immune response is measured at Τ.

- [A4]
**Risk restriction**- For the case where
*S*_{1}has*J*distinct levels 1,…,*J*and*W*has*K*distinct levels , for*z*{0, 1}.

Assumptions [A1] and [A2] are standard in most work on causal inference. The validity of the “No interference” part of [A1] may be questioned when the disease of interest is infectious, but is defensible if the vaccine trial enrolls a small fraction of the at-risk population. Recent work by Hudgens and Halloran (2008) discusses relaxation of SUTVA. [A2] will hold in a randomized trial where blinding is maintained.

The validity of [A3], however, is doubtful in many contexts. Since most current vaccine trials are designed to measure peak immune response, an effective vaccine may offer at least partial protection to vaccine recipients before their immune responses are measured, thereby violating [A3]. [A4] requires that the effect of *S* on the clinical endpoint not vary within levels of *W*. Both [A3] and [A4] are untestable, although [A3] has the testable implication that the population vaccine effect up to time Τ is zero. A major goal of this paper is to explore the consequences of relaxing [A3] and [A4].

In the Identifiability section below, we show that if assumptions [A3] and [A4] are relaxed, then joint risks are statistically non-identifiable. We might therefore ask whether any useful information can be obtained from the family of estimands which *are* identifiable from the available data when these assumptions fail.

We begin by noting that [A3] implies the equalities

(6)

(7)

The left-hand sides of (6) and (7) correspond to the expressions for *R*_{1} and *R*_{0}, and from the expressions on the right-hand sides we define the *marginal risks*

(8)

(9)

Since and do not condition on
, *S*_{0} may be undefined and hence *S*_{1} does not specify the basic principal stratum to which an individual belongs: an individual with *S*_{1} = *s* and may belong to either the stratum {*S*_{1} = *s*, *S*_{0} = *} or {*S*_{1} = *s*, *S*_{0} = *c*}. As a result, without [A3] marginal risks do not allow an assessment of how causal clinical effects vary with contrasts of *S*_{1} and *S*_{0}, and hence cannot be used to check ACN and ACS. Nevertheless, contrasts of the marginal risks are causal effects and may be used to predict treatment effects in a new setting. In particular,

(10)

where is the overall causal treatment effect for level-*W* subjects uninfected at Τ under active treatment, which is a useful summary of treatment efficacy. Suppose in a new setting (e.g., for a refined version of the treatment to be tested in a follow-up efficacy trial in the same study population), is distributed according to *F ^{new}(s, W)*. Assuming the marginal risk functions are the same for the efficacy trial and the new setting (an untestable assumption whose plausibility may be defended based on biological and mechanistic knowledge), the predicted

In this section, we explore the identifiability of joint and marginal risks in a variety of scenarios. We restrict attention to the case where *S* has *J* discrete levels, 1,…,*J*, and omit baseline covariates *W* (they are assumed fixed and do not affect the identifiability arguments). With this setup, the joint and marginal risks are considered as functions of *S*_{1} only, and have *J* possible values. The propositions below describe how many of these *J* values are identifiable from observed data under different sets of assumptions. Proofs of these propositions appear in Web Appendix A.

Proposition 1 (Identifiability under [CB], [A1]–[A2] (Base case)): When [CB] and [A1]–[A2] hold, is identifiable from the available data but *R*_{1}, *R*_{0}, and are non-identifiable and each requires *J* parameters to be specified for their *J* possible values.

Since *R*_{1}, *R*_{0}, and are non-identifiable under [CB] and [A1]–[A2], it is natural to consider the question: Are there additional plausible assumptions which can help to identify these estimands? One possible additional assumption is a weakened version of [A3]:

[A3′] **Pre-Τ monotonicity**

The pre-Τ monotonicity assumption asserts that a participant's clinical outcome prior to Τ is no worse under assignment to treatment than under assignment to placebo. [A3′] is implied by [A3], but unlike [A3] it allows for a pre-Τ treatment effect. In the context of blinded vaccine trials where the treatment is generally safe, [A3′] will often hold, but it must be considered carefully: data from the Step trial (which was stopped prematurely in late 2007 because the vaccine product did not appear to be effective) suggested the possibility of vaccine enhancement, whereby subjects who received active vaccine were more likely to be HIV-infected than those who received the placebo (Buchbinder et al. (2008)). Beyond the vaccination context, [A3′] may fail in trials comparing two active treatments, or when a treatment is highly toxic to some but of benefit to others.

When [A3′] holds in addition to [CB] and [A1]–[A2], we have the following result:

Proposition 2 (Identifiability under [CB], [Al]–[A3′] (Monotonicity)): Under [CB], and [A1]–[A3′], the number of parameters required to describe *R*_{0} is reduced by 1, to *J*–1. Identification of *R*_{1} and still requires *J* parameters each.

It can be shown (see the proof of Proposition 3, Web Appendix A) that would be identifiable from the observed data if we could identify the distribution of *S*_{1} among subjects who would have remained uninfected over the entire trial if they had received the placebo. An idea due to Follmann (2006) suggests a way to identify this quantity: an augmented study design called *closeout vaccination*, where placebo recipients who remain uninfected at the end of the trial are given the active vaccine (“closed out”) and have their immune response measured Τ time-units later. Denote this closeout measurement by
, and the indicator that an individual is infected during the duration-Τ closeout period by . Under [A1]–[A3′], closeout vaccination yields information about . The following two assumptions, taken together, guarantee that this distribution is equivalent to the distribution of interest, :

- [A5]
**Time constancy of the immune response distribution:**For subjects with almost surely - [A6]
**No infections during the closeout period:**

The time constancy assumption implies that the immune responses measured from closed out subjects would not have been different if these subjects had been assigned to the vaccine at the beginning of the trial and had their immune response measured at time Τ. [A5] may be evaluated from knowledge of the immune system; ) recent research (eg. Goronzy and Weyand (2005), Lacroix-Desmazes et al. (1999)) suggests that several key markers of T cell production/diversity and antibody reactivity remain relatively constant in adults under age 65, and hence [A5] may reasonably apply to these biomarkers in the context of vaccine trials of relatively short duration. If a substantial proportion of study subjects are co-infected with another pathogen during the study period, however, the validity of [A5] may be questioned. When [A5] fails, closeout vaccination does not provide the information necessary to identify the marginal risk .

Unlike [A5], [A6] is directly testable; one need only check whether any infections occurred during the closeout period. Since placebo recipients who remain infection-free for the duration of the trial (and hence are candidates to be closed out) are likely to have a lower risk of infection than the general placebo population, [A6] may be reasonable in certain circumstances. If there is evidence showing that [A6] is violated, however, the relationship between is non-identifiable. In the Sensitivity Analysis section, we describe this relationship explicitly in terms of non-identifiable sensitivity parameters.

The following proposition summarizes the identifiability of the joint and marginal risks when [A5] and [A6] additionally hold:

Proposition 3 (Identifiability under [CB], [Al]–[A3′], [A5]–[A6] (Monotonicity + Closeout)): Under [CB], [Al]–[A3′] and [A5]–[A6], marginal risks
and
are identifiable from the observed data. *R*_{1} remains non-identifiable and one of the *J* possible values of *R*_{0} is identified.

Table 1 summarizes the results derived in this section, along with some others which follow using similar arguments. Note the dramatic simplification achieved when [A3] replaces [A3′]: the joint risks *R*_{0} and *R*_{1} are equivalent to the (identifiable) marginal risks and hence ACN and ACS can be verified non-parametrically, allowing surrogate value to be assessed from the observed data.

Following the notation presented above, the iid observed data are
, where we reintroduce the vector *W _{i}* of baseline covariates assumed to be available for all study subjects. Note that in the definition of

Only subjects with contribute to the conditional likelihood, which takes the form

(11)

See Web Appendix B for details of the individual terms which make up the likelihood. *L* is a function of parameters of interest Θ, which specify and , and nuisance parameters Γ. We propose to make inference about Θ based on the estimated likelihood method described in Pepe and Fleming (1991), which consists of plugging in estimates of the elements of Γ into *L* and maximizing over Θ. In this application, we suggest using the bootstrap to obtain standard errors; previously derived asymptotic variance results for estimated likelihood rely on the assumption that each subject has a non-zero probability of having *S*_{1} observed, which does not apply in our situation since infected placebo recipients cannot have *S*_{1} measured.

Since the joint risks are not statistically identifiable unless certain potentially implausible assumptions hold, we suggest a sensitivity analysis approach to assessing surrogate value. Our strategy relies on the fact that one can re-express the joint risks in terms of marginal risks and non-identifiable quantities π and π* as

(12)

where (a proof of (12) is given in Web Appendix C). Different values of π and π* correspond to different degrees of post-randomization selection bias induced by the unmeasured simultaneous predictors (*U, V*) from Figure 1. Sensitivity analysis proceeds by estimating and and then varying the values of π(*j*) and π*(*j*) to observe their effect on contrasts of *R*_{1} and *R*_{0}. Certain features of the observed data, along with the requirement that probabilities lie in [0,1], restrict the range of allowed sensitivity parameter values. These constraints are described in Web Appendix D; they may be further tightened based on additional assumptions and/or subject matter knowledge.

If [A6] fails and some subjects are infected during the closeout period, is no longer identifiable from observed data. In this case, we suggest introducing additional nonidentifiable parameters η*(j)* defined by so that . η can be viewed as an analog of π for the closeout period. For the sake of clarity, in this paper we assume that [A6] holds throughout so that η*(j)* = 1 for all *j*.

We simulate a vaccine trial where *n* participants are randomized to vaccine or placebo in a 1:1 ratio. is generated as 1 + *Bernoulli*(0.5), so that *S*_{1} {1,2}; we do not consider covariates *W*. [CB] is assumed to hold with *S*_{0} = 1 for all subjects with , so that ACN can be assessed by checking whether Δ_{1} = *R*_{1}(1) – *R*_{0}(1) = 0 and ACS assessed by checking whether Δ_{2} *R*_{1}(2) – *R*_{0}(2) ≠ 0.

For all simulations, data are generated under [CB], [Al]–[A3′] and [A5]–[A6], with the restrictions on π, π* described in Web Appendix D. We consider five scenarios representing different amounts of surrogate value: In the first scenario (referred to in Tables 2 and and33 as No_{1}), Δ_{1} = Δ_{2} = 0 (ACS fails); in the second (No_{2}), Δ_{1} = Δ_{2} < 0 (ACN is violated), so that the biomarker has no surrogate value. In the third scenario (Low), we take Δ_{1} ≈ 0 and Δ_{2} < 0; though ACN fails, Δ_{1} is close to zero and ACS is satisfied, so the biomarker may be said to have some surrogate value. Scenarios 4 and 5 (High_{1} and High_{2}) both have Δ_{1} = 0 and Δ_{2} < 0, with Δ_{2} larger in magnitude for the latter scenario; since ACN and ACS are satisfied, in both cases the biomarker is a good surrogate.

Bias, power and coverage for _{1} and _{2}, estimated via maximum estimated likelihood. n is the total number of subjects in the simulated trial, divided equally between vaccine and placebo recipients. SV = surrogate value **...**

Operating characteristics of PTE, RE, and sensitivity analyses in simulated trials of size n = 10; 000 when . SV = surrogate value scenario, SB = amount of post-randomization selection bias. For PTE and RE, we report
and , the median point estimates **...**

Table 2 displays the performance of the maximum estimated likelihood estimators (MELEs) of Δ_{1} and Δ_{2} when the sensitivity parameters are correctly specified to match the values used to generate the data (π(*j*) = π*(*j*) = 0.995 for *j* = 1,2) and [CB], [Al]–[A3′], and [A5]–[A6] are assumed. The pre-Τ infection rates are set to 1.4% and 1.9% for vaccine and placebo recipients, respectively. The post-Τ infection rate is set at 7% for placebo recipients; the rate for vaccine recipients depends on Δ_{1} and Δ_{2}, and ranges from 2.8% to 6.5% across the five surrogate value scenarios. These rates are consistent with what might reasonably be observed in an HIV vaccine trial enrolling a high-risk population.

We focus on the case where higher values of the biomarker are associated with better clinical outcomes, and calculate power based on Wald tests for the hypotheses *H*_{0} : Δ_{j}
*≥* 0 vs. *H*_{1} : Δ*j* < 0, *j* = 1, 2. In each of 250 simulations, percentile-based 95% confidence intervals are computed from 100 bootstrap replicates. Standard errors for Wald tests are based on empirical estimates from the set of 250 simulations. The proposed method yields unbiased estimates and correct coverage probabilities. Tests have approximately correct nominal size, and power increases to 1 with the sample size. We note, however, that large trial sizes are required to achieve adequate power to detect even large effects.

In Web Table A, we present an analog of Table 2 suggested by a reviewer, where data are generated without enforcing the monotonicity assumption [A3′]. When the data generating process does not obey monotonocity but we assume that it does, _{1} and _{2} are somewhat biased, the size of tests is not correct, and confidence interval coverage is degraded. Though the effect of incorrectly assuming [A3′] is relatively small for these simulation settings, additional simulations (results not shown) confirm that the impact of the monotonicity assumption is much more pronounced when the pre-Τ infection rate is higher.

We now return to the problem of surrogate endpoint assessment in the presence of pre-Τ treatment effects (i.e. when [A3] does not hold). We consider two approaches to quantifying the surrogate value of a biomarker:

**Proportion of Treatment Effect Explained**These summary measures are computed by fitting regression models from observed data. The motivation for and properties of*(PTE)*and Relative Effect*(RE)*["statistical surrogate" approach]:*PTE*and*RE*are discussed at length in Burzykowski et al. (2005). We fit the logistic models for binary outcomes suggested in Buyse and Molenberghs (1998) - details appear in Web Appendix E.**Sensitivity analysis [“principal surrogate” approach]:**This method consists of calculating the marginal risks via estimated likelihood, then computing Δ_{1}and Δ_{2}across the entire range of sensitivity parameters consistent with the restrictions described in Web Appendix D. For each simulated dataset, we consider 64 possible sensitivity parameter settings, with each of π(2),π*(2),π*(1) taking on 4 equally spaced values between their minimum and maximum allowed values (note that π(1) is determined by π(2)).

Table 3 summarizes the performance (over 500 simulated trials of size *n* =10,000) of these surrogate value assessment methods under a variety of settings. We fix the pre-Τ infection rate under assignment to vaccine at . In addition to the five different surrogate value scenarios, we consider four different degrees of post-randomization selection bias: None
, Low , Moderate (, small departure of π* from π), and High ( far from π). The parameter values for all the scenarios appear in Web Table B. Web Table C presents a version of Table 3 showing the performance of these methods when .

Since they are based on different paradigms, it is difficult to compare the performance of *PTE/RE* and the sensitivity analysis. *PTE* is undefined in scenario No_{1} (ACN holds, ACS fails), leading to unstable point estimates and large standard errors, but performs well when there is no pre-Τ treatment effect (i.e.
), with values near zero in scenarios with no surrogate value (No_{2}), values near 1 in high surrogate value scenarios High_{1} and High_{2}, and intermediate values in the Low surrogate value scenario. As expected, the performance of *PTE* degrades when there is post-randomization selection bias (i.e.
), taking on values near 1 in the Low surrogate value scenario and greater than 1 in the two High surrogate value scenarios. The *RE* performs poorly, as it is unable to distinguish between scenarios No_{2} and High_{2}. We note that Buyse and Molenberghs recommend considering the *RE* together with the *adjusted association*, but the odds ratio which they propose is undefined in case [CB].

Given estimates of marginal risks, the constraints imposed by Web Appendix D limit the sensitivity parameters to a narrow range of values under most of the simulation settings in Table 3, and hence a “typical” sensitivity analysis yields a narrow range of plausible values for Δ_{1} and Δ_{2}. However, the uncertainty in estimating Δ_{1} and Δ_{2} due to possible post-randomization selection bias is dwarfed by the uncertainty in estimating the marginal risks: the sensitivity intervals (unions of the 64 Wald confidence intervals constructed around the point estimates from the sensitivity analysis) are very wide in comparison to the effect size, indicating that even trials of size 10,000 may have low power to detect surrogate value. Power increases with the infection rate: the sensitivity intervals are much narrower in relation to the effect size for the simulations presented in Web Table C, where the pre-Τ and overall infection rates are larger (approximately 6% and 16%, respectively). However, as shown by the wider ranges for _{1} and _{2} in Web Table C, a higher pre-Τ infection rate carries the potential for greater post-randomization selection bias. Hence, in assessing surrogate value there is a trade-off between the ability to estimate identifiable quantities and the ability to restrict nonidentifiable quantities to a sufficiently narrow range.

In this article we described how several sets of assumptions contribute (or not) to the identifiability of the joint and marginal risk causal estimands of interest. In addition, to relax the untestable assumptions [A3] and [A4] made in previous work for evaluating a candidate surrogate endpoint, we developed methods that combine assumptions that may be more plausible, a creative study design, and a sensitivity analysis to assess the surrogate value of a biomarker.

The use of counterfactuals has been criticized by some (eg. Dawid (2000)) because of the underlying assumption of determinism (or “fatalism”) required to make certain counterfactual expressions well-defined. Dawid argues that this is a fundamental flaw in the potential outcomes framework, and the use of counterfactuals should therefore be restricted to a small class of problems where no such definitional problems arise. However, we subscribe to the point of view expressed by the majority of the respondents to Dawid's article: Potential outcomes provide a natural way of formulating and answering important scientific questions which are difficult to express without counterfactuals.

Jin and Rubin (2008), among others, employed the principal stratification framework to estimate treatment effects in the presence of partial noncompliance by partitioning individuals into compliance strata (eg. “always-taker”, “never-taker”, “complier”, “defier”). The counterfactual values defining these strata are generally assumed to be well-defined for all individuals, but this assumption may be violated if the outcome can occur before compliance is (fully) measured: for example, consider a clinical trial to assess the survival of late-stage cancer patients where the compliance measure is average weekly pill count over three months. In such setups, the methods we have described for quantifying surrogate value when [A3] is violated can be applied to evaluate compliance-adjusted treatment effects. Of course, certain simplifying assumptions such as Constant Biomarkers and Monotonicity may not apply in some studies where noncompliance is of interest.

To our knowledge, no previous work on surrogate endpoint assessment has addressed the consequences of treatment effects prior to measurement of the biomarker, a situation which is common in practice. In such cases, available data from a closeout vaccine trial are insufficient to identify joint risks, even in the simple case where there is no variability in the biomarker values of placebo recipients. Non-identifiability of joint risks leads naturally to a sensitivity analysis framework, but estimating surrogate value requires the ability to estimate marginal risks precisely and restrict the sensitivity parameter values to a suitably narrow range without relying on implausible assumptions. When non-identifiability makes the assessment of surrogate value intractable, it may be more useful to focus on the information which can be obtained from alternate estimands such as contrasts of marginal risks, which provide clinically relevant information and are statistically identifiable under weaker assumptions.

The authors wish to thank the reviewers for their comments, which helped clarify our thinking about these difficult issues and resulted in a much improved manuscript. This work was partially supported by NIH grant ROI AI54165-06, as well as Le Fonds quebecois de la recherche sur la nature et les technologies (FQRNT).

Supplementary Materials

Web Appendices and Tables referenced in Sections 3–6 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

Julian Wolfson, Department of Biostatistics, University of Washington, F-600 Health Sciences Building, Box 357232, Seattle, WA 98195-7232, Email: julianw/at/u.washington.edu.

Peter Gilbert, Fred Hutchinson Cancer Research Center, Statistical Center for HIV/AIDS Research and Prevention, 1100 Fairview Avenue North, M2-C801, PO Box 19024, Seattle, WA 98109, Email: pgilbert/at/scharp.org.

- Buchbinder S, Mehrotra D, Duerr A, Fitzgerald D, Mogg R, Li D, Gilbert P, Lama J, Marmor M, Delrio C. Efficacy assessment of a cell-mediated immunity hiv-1 vaccine (the step study): a double-blind, randomised, placebo-controlled, test-of- concept trial. The Lancet. 2008;372:1881–1893. [PMC free article] [PubMed]
- Burzykowski T, Molenberghs G, Buyse M. The Evaluation of Surrogate Endpoints. Springer; 2005.
- Buyse M, Molenberghs G. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
- Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1:49–67. [PubMed]
- Dawid AP. Causal inference without counterfactuals. Journal of the American Statistical Association. 2000;95:407–424.
- Follmann D. Augmented designs to assess immune response in vaccine trials. Biometrics. 2006;62:1161–1169. [PMC free article] [PubMed]
- Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58:21–29. [PubMed]
- Gail MH, Pfeifier R, Van Houwelingen HC, Carroll RJ. On meta-analytic assessment of surrogate outcomes. Biostatistics. 2000;1:231–246. [PubMed]
- Gilbert PB, Hudgens MG. Evaluating candidate principal surrogate endpoints. Biometrics. 2008;64:1146–1154. [PMC free article] [PubMed]
- Gilbert PB, Peterson ML, Follmann D, Hudgens MG, Francis DP, Gurwith M, Heyward WL, Jobes DV, Popovic V, Self SG, Sinangil F, Burke D, Berman PW. Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of hiv-1 infection in a phase 3 hiv-1 preventive vaccine trial. The Journal of infectious diseases. 2005;191:666–677. [PubMed]
- Gilbert PB, Qin L, Self SG. Evaluating a surrogate endpoint at three levels, with application to vaccine development. Statistics in Medicine. 2008;27:4758–4778. [PMC free article] [PubMed]
- Goronzy JJ, Weyand CM. T cell development and receptor diversity during aging. Current Opinion in Immunology. 2005;17:468–475. [PubMed]
- Hudgens MG, Halloran ME. Toward causal inference with interference. Journal of the American Statistical Association. 2008;103:832–842. [PMC free article] [PubMed]
- Jin H, Rubin DB. Principal stratification for causal inference with extended partial compliance. Journal of the American Statistical Association. 2008;103:101–111.
- Joffe MM, Greene T. Related causal frameworks for surrogate outcomes. Biometrics. 2009;65:530–538. [PubMed]
- Lacroix-Desmazes S, Mouthon L, Kaveri SV, Kazatchkine MD, Weksler ME. Stability of natural self-reactive antibody repertoires during aging. Journal of Clinical Immunology. 1999;19:26–34. [PubMed]
- Lauritzen SL. Discussion on causality. Scandinavian Journal of Statistics. 2004;31:189–193.
- Pearl J. Causality : Models, Reasoning, and Inference. Cambridge University Press; 2000.
- Pepe MS, Fleming TR. A nonparametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108.
- Prentice RL. Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine. 1989;8:431–440. [PubMed]
- Vanderweele T. Simple relations between principal stratification and direct and indirect effects. Statistics & Probability Letters. 2008;78:2957–2962.
- Weir CJ, Walley RJ. Statistical evaluation of biomarkers as surrogate endpoints: a literature review. Stat Med. 2006;25:183–203. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |