We consider a two-arm randomized trial. Let *Z* be the binary treatment indicator, 0 for placebo and 1 for active treatment (vaccination). Let *W* be baseline covariates such as demographics and laboratory measurements. We focus on discrete *W* in this manuscript, but note that the methods we describe can be generalized to accommodate continuous *W* by incorporating nonparametric smoothing techniques. Let *S* be the candidate surrogate of interest measured on the continuous scale at fixed time *τ* after randomization. Here we consider a univariate marker, but the estimation method we propose could be easily extended to allow for more than one marker. Let *Y* denote the binary clinical endpoint of interest, 0 for non-diseased and 1 for diseased. Acknowledging the possibility that *Y* occurs before *S* is measured, let *Y*^{τ} be the indicator of whether disease develops before *τ. S* is only measurable if *Y*^{τ} = 0; if *Y*^{τ} = 1, then *S* is undefined. We further incorporate the potential outcomes framework. Let *S*(*z*), *Y*^{τ}(*z*), *Y*(*z*) be the corresponding potential outcomes under treatment assignment *z*, for *z* = 0, 1. If *Y*^{τ}(*z*) = 1, *S*(*z*) is undefined and we set *S*(*z*) = *. We also consider a possible CPV component. At the end of the trial followup period, some fraction of placebo recipients who are uninfected at study closeout are vaccinated and the immune biomarker *S*_{c} at time *τ* after vaccination is measured; the proportion of the uninfected placebo recipients selected for closeout vaccination can range from 0 to 1.

Following the notation in

Follmann (2006), we call the design with no CPV the BIP-only design (the design with baseline predictors

*W* only), and the design with non-zero CPV component the BIP + CPV design. The setting we consider is a two-phase sampling design. In the first phase, information about

*Y*,

*Z*, and

*W* are collected for every trial participant. In the second phase,

*S*(1) or

*S*_{c} is measured in a subcohort of study participants selected according to a random mechanism. We let

*δ* to indicate the availability of

*S*(1) or

*S*_{c}.

Frangakis and Rubin (2002) proposed characterizing the principal surrogate effect of a marker based on comparison between the risk of

*Y*(1) and

*Y*(0) conditional on

*S*(1) and

*S*(0). In HIV vaccine trials, only subjects without previous infection with the pathogen under study are enrolled such that

*S*(0) = 0; the characterization of surrogate value simplifies to comparison between

*risk*_{(0)}{

*S*(1)} =

*P*{

*Y*(0) = 1|

*S*(1),

*Y*^{τ}(0) =

*Y*^{τ}(1) = 0} and

*risk*_{(1)}{

*S*(1)} =

*P*{

*Y*(1) = 1|

*S*(1),

*Y*^{τ}(0) =

*Y*^{τ}(1) = 0}, namely the marginal causal effect predictiveness curve (

*CEP*) as proposed in

Gilbert and Hudgens (2008) with

*CEP*{

*S*(1)} =

*h* [

*risk*_{(1)}{

*S*(1)},

*risk*_{(0)}{

*S*(1)}] for a pre-specified contrast function

*h*.

For a rare disease like HIV, one natural choice of

*CEP* function is the vaccine efficacy (VE) as a function of

*S*(1):

the percent reduction in infection rate for the subgroup of vaccine recipients with immune response

*S*(1) compared to if they had not been vaccinated. The vaccine efficacy curve (curve of

*VE*(

*s*) versus

*s*) tells us the range of vaccine efficacies we can achieve with respect to HIV infection corresponding to varying levels of vaccine-induced immune response. For a desired vaccine efficacy level, the corresponding immune response level helps set the target for refining the vaccine in follow-up phase I/II studies. A useful surrogate will have strong effect modification in the sense of large variability in

*VE*{

*S*(1)} and thus there is potential to achieve a large vaccine efficacy by increasing the immune responses. Examples of vaccine efficacy curves for two biomarkers with the same

*S*(1) distribution are displayed in with the steeper curve (marker 1) corresponding to a more useful surrogate. In general, the x-coordinates of these curves can be brought to the same scale through a cumulative distribution function (CDF) transformation to facilitate comparison between curves.

After we refine a vaccine to achieve certain immune response levels in phase I/II trials, the next step is to determine whether the refined vaccine has large enough predicted vaccine efficacy in a future licensure trial, based on the change in immune responses we observe in phase I/II studies through the refinement. The quality of this prediction depends on the surrogacy of the biomarker identified in the current trial as well as the ‘bridging’ assumption regarding the relationship between the vaccine effect on immune response and the vaccine effect on infection rate. For example, suppose the risk of HIV infection given

*S*(1) and

*Z* in the current trial can be modeled with

*risk*_{(1)}{

*S*(1)} = Φ{

*β*_{0} +

*β*_{1}*Z* +

*β*_{2}*S*(1) +

*β*_{3}*ZS*(1)} with Φ the CDF of N(0,1), and the refined vaccine leads to a location-shift Δ in immune response distribution relative to the original vaccine. For a valid bridging surrogate,

Follmann (2006) models the HIV infection rate conditional on

*S*(1), the immune response induced by the original vaccine, and the treatment assignment

*Z*^{new} in the future trial with

*risk*_{(zNew)}{

*S*(1)} = Φ [

*β*_{0} +

*β*_{1}*Z*^{new} +

*β*_{2}*S*(1) +

*β*_{3}{

*S*(1) + Δ}

*Z*^{new}]. Then the predicted overall efficacy of the refined vaccine on

*Y* is

with

*F*{

*S*(1)} the distribution of

*S*(1). This model will be used later in our simulation studies and study design. In , we show the curves of

*VE*^{new}(Δ) as a function of Δ corresponding to the same two markers whose vaccine efficacy curves are displayed in . Note that the same location shift in the better surrogate marker (marker 1) corresponds to a higher predicted overall vaccine efficacy.

We next consider the estimation of *VE*{*S*(1)} and *VE*^{New}(Δ), by first estimating the disease risk conditional on *S*(1) and *Z*. We make the following assumptions.

2.1 Identifiability Assumptions

(A1)

SUTVA and Consistency: {*S*(1), *S*(0), *Y*^{τ}(1), *Y*^{τ}(0), *Y*(1), *Y*(0)} of one subject is independent of the treatment assignments of other subjects, and given the treatment a subject actually received, a subject’s potential outcomes equal the observed outcomes.(A2)

Ignorable Treatment Assignments: *Z* *W*, *S*(1), *S*(0), *Y*^{τ}(1), *Y*^{τ}(0), *Y*(1), *Y*(0).(A3)

Equal Early Clinical Risk : *Y*^{τ}(1) = *Y*^{τ}(0) for all subjects.Assumptions (A1)–(A3) have been made in earlier literature (

Gilbert and Hudgens, 2008;

Hudgens and Gilbert, 2009;

Huang and Gilbert, 2011). Basically, (A1) is plausible in trials where participants do not interact with one another and (A2) is ensured by randomization. As discussed in

Wolfson and Gilbert (2010), (A3) is plausible if relatively few clinical events happen before the biomarker is measured. (A3) implies that the risk of

*Y* conditional on

*Z* =

*z*,

*W*,

*S*(1),

*S*(0) and

*Y*^{τ}(0) =

*Y*^{τ}(1) = 0 can be identified based on the subset of subjects assigned

*Z* =

*z* who are observed to have the marker measured at time

*τ* (i.e.,

*Y*^{τ} = 0), with additional identifiability assumptions needed as given below. Henceforth we simplify the notation and drop the conditioning of all probabilities on

*Y*^{τ}(1) =

*Y*^{τ}(0) =

*Y *^{τ} = 0.

Motivated by the design of HIV vaccine trials where

*S*(0) = 0 for all subjects, we focus on risk models conditional on

*S*(1) only, which is the one most relevant to vaccine development. In general when

*S*(0) varies, risk conditional on

*S*(1) has the interpretation of risk conditional on

*S*(1) and

*S*(0) averaged over the conditional distribution of

*S*(0) and is still useful for vaccine development (

Wolfson and Gilbert, 2010). Next, in assumption (A4) we posit generalized linear models for risk conditional on

*Z*,

*S*(1), and

*W*.

(A4)

The risk of *Y* conditional on *Z*, *S*(1) and *W* can be modeled with a parametric function: *risk*_{(}_{z}_{)} {*S*(1), *W*} *P*{*Y*(*z*) = 1|*S*(1), *W*} = *g* {*β; S*(1), *Z*, *W*}, with *g* a pre-specified link function and *β* a finite-dimensional parameter.Based on the standard trial design, (A1)–(A4) and the observed data identify

*risk*_{(0)} and

*risk*_{(1)}. But since

*S*(1) is unobserved for all subjects in the placebo arm (

*Z* = 0), one cannot fully test the appropriateness of the model assumption (A4) as pointed out in

Gilbert and Hudgens (2008). This issue is resolved with the addition of the CPV component, together with the assumptions (A5) and (A6) below.

(A5)

Time-constancy of immune response: For uninfected placebo recipients, *S*(1) = *S*^{true} + *U*_{1}, and *S*_{c} = *S *^{true} + *U*_{2}, for some underlying *S *^{true} and i.i.d. measurement error *U*_{1}, *U*_{2}.(A6)

No placebo subjects uninfected at closeout have an infection over the next *τ* time-units. Under (A5) and (A6), *S*_{c} can be used to substitute *S*(1) for these subjects sampled in CPV. The addition of the CPV component makes (A4) fully testable by allowing non-parametric estimation of the risk model, as sketched in Web Supplementary Appendix A. *Henceforth we simplify the notation and use S to indicate a measurement of vaccine-induced immune response which can be obtained either during standard trial period or during CPV*. Let *N* be the number of trial participants. The observed data are *N* iid copies *O*_{i} = (*Z*_{i}, *W*_{i}, *δ*_{i}, *δ*_{i}S_{i}, *Y*_{i})′, *i* = 1, ···, *N*. Finally, we state two assumptions about the sampling probability of S (either *S*(1) or *S*_{c}) required for validity of the pseudo-score estimators described later in Section 2.3.(A7)

∫ *P*(*δ* = 1|*y*, *Z*, *W*)*dy* > 0 for every *Z*, *W* level.(A8)

∫ *P*(*δ* = 1|*y*, *z*, *W* )*dydz* > 0 for every *W* level.

2.2 Existing Methods and the Motivating Example

Under the two-phase sampling design described above, the *vaccine-induced immune response S* is missing at random (MAR) because it is determined completely by design. The MAR assumption allows identification of the risk model in (A4) based on observed likelihood

where

*F*(

*S*|

*Z*,

*W*) is the CDF of

*S* conditional on

*Z*,

*W*.

Earlier work for identifying risk model parameters in evaluating principal surrogate markers was based on an estimated likelihood approach (

Pepe and Fleming, 1991) that maximizes an estimated version of the likelihood (

3). Specifically, estimation is performed in two steps. In the first step,

*F*(

*S*|

*Z*,

*W*) is estimated; and then in the second step, its estimator

(

*S*|

*Z*,

*W*) is substituted into (

3) and

*β* is estimated as the maximizer of the resulting estimated likelihood. Approaches to estimating

*F*(

*S*|

*Z*,

*W*) vary along the spectrum from nonparametric to parametric (

Gilbert and Hudgens, 2008;

Qin et al., 2008;

Hudgens and Gilbert, 2009;

Huang and Gilbert, 2011). These methods work for both the BIP-only and the BIP+CPV designs. For both designs, the estimation in the first step is achieved using vaccine recipients with

*S* measured given that

*F*(

*S*|

*Z* = 1,

*W*) =

*F*(

*S*|

*Z* = 0,

*W*) =

*F*(

*S*|

*W*) as ensured by the randomization assumption (A2). When sampling of

*S* depends on other phase-I variables such as the response

*Y*, inverse probability weighting (IPW) (

Horvitz and Thompson, 1952) can be implemented to correct for biased sampling (

Gilbert et al., 2011a;

Huang and Gilbert, 2011). Note that in a BIP+CPV design, even if all CPV samples contribute a full conditional likelihood term

*P*(

*Y*|

*Z*,

*W*,

*S*) to the estimated likelihood, they cannot be used for estimating

*F*(

*S*|

*W*). The fact that all infected placebo recipients have zero sampling probability for

*S* prevents the application of IPW to the whole

*S* sample in estimating

*F*(

*S*|

*W*).

In our motivating design of the South Africa HIV vaccine trial,

Gilbert et al. (2011a) considered incorporating the CPV component into the trial design and examined power for detecting principal surrogates using a parametric estimated likelihood approach. They examined two-phase case-control sampling using either a BIP-only or a BIP+CPV design where cases and controls were sampled at 1:5 ratio within the vaccine arm and controls ten times that of the number of cases in placebo arm were included in CPV. A surprising finding was that in some scenarios where

*W* had a strong correlation with

*S*, the BIP-only design was more powerful than the BIP+CPV design for testing an interaction effect between

*S* and

*Z* (Table 7 of

Gilbert et al. (2011a)).

Here we investigate in further detail this seemingly counter-intuitive result of decreased efficiency caused by adding the CPV component. We compare variances of the risk model parameter estimators between the BIP-only design and the BIP+CPV design with varying ratios of CPV sampling. As shown in

Web Supplementary Figure 1, the efficiency loss of the BIP+CPV design relative to the BIP-only design becomes more severe as the proportion of uninfected placebo recipients selected for closeout vaccination increases. In contrast, as also shown in

Web Supplementary Figure 1, if we enter the ‘true’

*F*(

*S*|

*W*) into the observed likelihood (

3), then the BIP+CPV design is more efficient than the BIP-only design and the efficiency gain increases in general with a higher CPV sampling fraction, as expected. These results suggest that the decreased efficiency caused by CPV sampling is due to the fact that two different sets of “validation data” are used in the two steps of the estimated likelihood procedure: the CPV component is included in the validation set in maximizing the likelihood but not in the estimation of the conditional distribution of

*S*. This will be further demonstrated in Section 3.

To use the CPV component more efficiently, we need an estimation method that removes the incompatibility in the use of validation sets as present in the estimated likelihood methods. In the next section, we propose a pseudo-score type estimator as a solution, building upon the original work by

Chatterjee et al. (2003).

2.3 The Pseudo-Score Estimator for Principal Surrogates Evaluation

The score equation of the observed likelihood (

3) is

with

*U*_{β}(

*Y*|

*S*,

*Z*,

*W*) =

log

*P*(

*Y*|

*S*,

*Z*,

*W*)/

*β*.

Equation (4) can be further written into the following parsimonious form incorporating the randomization assumption (A2)

According to Bayes’ theorem, we have

Substituting the right hand side of (

6) for its left hand side into (

5) we arrive at a pseudo-score

We propose to estimate the pseudo-score (

7) by first estimating the distribution of

*S* conditional on

*W* based on

*S* measured in the second phase sample, and then estimating the sampling probability of

*S* conditional on

*S* and

*W*. The latter can be estimated as the sampling probability of

*S* conditional on all covariates and

*Y* together averaged over the joint distribution of

*Y* and

*Z* conditional on

*S* and

*W*. That is,

*P*(

*δ* = 1|

*S*,

*W*) = ∫∫

*P*(

*δ* = 1|

*y*,

*z*,

*S*,

*W*)

*P*(

*y*,

*z*|

*S*,

*W*)

*dydz* = ∫∫

*P*(

*δ* = 1|

*y*,

*z*,

*W*)

*P*(

*y*|

*S*,

*z*,

*W*)

*P*(

*z*)

*dydz*. The corresponding pseudo-score estimator is defined as the solution to (

7). Note this proposed estimator is an extended version of an original pseudo-score estimator proposed by

Chatterjee et al. (2003). We call the original pseudo-score estimator the PSO estimator and the proposed new estimator the PSN estimator. Both estimators transforms the task of estimating the conditional distribution of

*S* in the population into the task of estimating the conditional distribution of

*S* in the sample; PSN requires estimation of

*F*(

*S*|

*W*,

*δ* = 1) while PSO requires estimation of

*F*(

*S*|

*W*,

*Z*,

*δ* = 1) (details provided in

Web Supplementary Appendix B). Note that both PSN and PSO allow incorporation of the CPV component into estimation of the distribution of

*S* conditional on

*W* or

*Z* and

*W*, and can be applied to a design with non-zero CPV component. The PSO estimator does not, however, apply to a BIP-only design since

*F*(

*S*|

*Z* = 0,

*W*,

*δ* = 1) is undefined. In other words, given that

*risk*_{(}_{z}_{)}(

*S*,

*W*) > 0 almost surely, validity of PSO requires the sampling probability of

*S* being greater than zero for every

*Z*,

*S*,

*W* level (assumption (A7)); in contrast, the PSN relaxes this requirement to the weaker requirement (A8) that the sampling probability of

*S* exceeds zero for every

*S*,

*W* level and hence is applicable to both the BIP-only and the BIP+CPV designs.

To obtain the PSN estimator, for each unique value *w* of *W*, we estimate *F*(*s*|*w*, *δ* = 1) empirically with Σ_{δi=1}*I*(*S*_{i} ≤ *s*, *W*_{i} = *w*)/Σ_{δi=1}*I*(*W*_{i} = *w*). An Expectation-Maximization (EM) algorithm can be employed to estimate the risk model parameters *β*:

- Start with an initial value of
*β*; - For a subject
*i* with *δ*_{i} = 1, use its observed data. For a subject with *δ*_{i} = 0, construct a set of filled-in data with length equal to the number of observations in *V*_{Wi}, where *V*_{Wi} is the set of validation subjects with *δ* = 1 and *W* = *W*_{i}. Specifically, for each *j* *V*_{Wi}, we construct a new observation {*Y*_{i}, *S*_{j}, *Z*_{i}, *W*_{i}}. - For each filled-in observation {
*Y*_{i}, *S*_{j}, *Z*_{i}, *W*_{i}}, *j* *V*_{Wi}, calculate an associated weight,
which is an estimate of the density of

*S*_{j} conditional on

*Y*_{i},

*Z*_{i},

*W*_{i}, where

, with

(

*δ* = 1|

*y*,

*z*,

*W*_{i}) a consistent estimate of

*P*(

*δ* = 1|

*y*,

*z*,

*W*_{i}) and

(

*y*|

*S*_{j},

*z*,

*W*_{i}) obtained based on the current

*β* estimate.

- Fit a weighted GLM to the augmented dataset and obtain a new estimate of
*β*. - Repeat steps (II) to (IV) until convergence.

Suppose the sampling probability of

*S* conditional on

*Y*,

*Z*, and

*W* can be modeled with

*P*(

*δ* = 1|

*Y*,

*Z*,

*W*) =

*π*(

*Y*,

*Z*,

*W; α*) for some parameter

*α*. We substitute

*α* with its maximum likelihood estimator (MLE)

to obtain

(

*δ* = 1|

*y*,

*z*,

*w*) for computing the pseudo-score (

7). For example, in the simulation studies described next where the sampling probability of

*S* depends on

*Y* and

*Z* only, we apply a saturated model for the sampling probability of

*S* with

*π* = {

*π*(

*Y*,

*Z*)} = {

*π*(0, 0),

*π*(0, 1),

*π*(1, 0),

*π*(1, 1)}, such that MLE of

*π*(

*y*,

*z*) equals the observed sampling fractions in the category defined by

*Y* =

*y* and

*Z* =

*z*. Under regularity conditions specified in

Web Supplementary Appendix C, the PSN estimator

can be shown to be consistent and asymptotically normally distributed.

Theorem 1 in Web Supplementary Appendix C describes the asymptotic distribution of

with a proof sketched. In our simulation and design studies, we consider a risk model

*P*{

*Y* = 1|

*S*(1),

*Z*,

*W*} = Φ{

*β*_{0} +

*β*_{1}*Z* +

*β*_{2}*S*(1) +

*β*_{3}*S*(1)

*Z*}. Based on risk model parameter estimators

_{0},

_{1},

_{2},

_{3}, we estimate

*VE*{

*S*(1)} with

, and estimate

*VE*^{New}(Δ) with

with pre-specified

*F*(

*s*_{1}). Asymptotic normality of

and

follow from Theorem 1. Their asymptotic variances can be derived based on the Delta method:

and

.