PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of biostsLink to Publisher's site
 
Biostatistics. Jan 2010; 11(1): 34–47.
Published online Oct 8, 2009. doi:  10.1093/biostatistics/kxp034
PMCID: PMC2800161
Semiparametric estimation of the average causal effect of treatment on an outcome measured after a postrandomization event, with missing outcome data
Peter B. Gilbert*
Department of Biostatistics, University of Washington, Seattle, WA 98105, USA and Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA ; pgilbert/at/scharp.org
Yuying Jin
Department of Biostatistics, University of Washington, Seattle, WA 98105, USA
*To whom correspondence should be addressed.
Received December 30, 2008; Revised July 29, 2009; Accepted July 30, 2009.
In the past decade, several principal stratification–based statistical methods have been developed for testing and estimation of a treatment effect on an outcome measured after a postrandomization event. Two examples are the evaluation of the effect of a cancer treatment on quality of life in subjects who remain alive and the evaluation of the effect of an HIV vaccine on viral load in subjects who acquire HIV infection. However, in general the developed methods have not addressed the issue of missing outcome data, and hence their validity relies on a missing completely at random (MCAR) assumption. Because in many applications the MCAR assumption is untenable, while a missing at random (MAR) assumption is defensible, we extend the semiparametric likelihood sensitivity analysis approach of Gilbert and others (2003) and Jemiai and Rotnitzky (2005) to allow the outcome to be MAR. We combine these methods with the robust likelihood–based method of Little and An (2004) for handling MAR data to provide semiparametric estimation of the average causal effect of treatment on the outcome. The new method, which does not require a monotonicity assumption, is evaluated in a simulation study and is applied to data from the first HIV vaccine efficacy trial.
Keywords: Causal inference, HIV vaccine trial, Missing at random, Posttreatment selection bias, Principal stratification, Sensitivity analysis
For randomized placebo-controlled efficacy trials of an HIV vaccine evaluated in HIV-uninfected volunteers, a primary objective is to evaluate the effect of vaccination on the incidence of HIV infection. Another objective, which was secondary in 2 trials of antibody-based vaccines (Flynn and others, 2005; Pitisittithum and others, 2006) and co-primary in 2 trials of T cell–based vaccines (Buchbinder and others, 2008), is to evaluate the effect of vaccination on HIV viral load measured after HIV infection. Two causal approaches of interest for assessing the latter objective are intent-to-treat (ITT) methods that assess the burden of illness (a composite end point that combines the infection end point and the postinfection end point) (Chang and others, 1994) and conditional methods that assess the postinfection end point in the principal stratum of subjects who would be HIV-infected under either treatment assignment (the so-called “always infected” stratum). As discussed by Gilbert and others (2003) [henceforth GBH] and others, the 2 approaches address different substantive questions. For example, in registrational/licensure trials, the ITT approach may be most appropriate for the primary analysis and the conditional approach would be used in secondary analyses of the mechanistic vaccine effect on the postinfection outcome, whereas in preregistrational test-of-concept trials, the conditional approach might be used for the primary analysis (Mehrotra and others, 2006).
Here, we develop methodology for the conditional approach to evaluate the causal vaccine effect on the postinfection end point. For concreteness, we refer to the first postrandomization event as infection and the outcome of interest measured after this event as viral load. However, the method has general application to evaluating causal treatment effects on outcomes measured after a postrandomization event, including studies of quality of life (Rubin, 2000), prostate cancer severity (Shepherd and others, 2008), kidney disease, and cancer screening (Joffe and others, 2007).
Because the set of trial participants who are in the always infected stratum is unknown, causal treatment effects for this group are not identified from randomized trial data. Two approaches for addressing the nonidentifiability have been to derive sharp bounds for causal effects (Jemiai and Rotnitzky, 2003; Hudgens and others, 2003; Zhang and Rubin, 2003) and to estimate causal effects under an additional set of identifiability assumptions that include models describing the nature and degree of possible selection bias, with a sensitivity analysis to explore how the inferences vary over a range of the selection models (GBH; Hayden and others, 2005; Hudgens and Halloran, 2006; Jemiai and Rotnitzky, 2005 (henceforth JR); Jemiai and others, 2007; Shepherd, Gilbert, Jemiai and others, 2006; Shepherd and others, 2007, 2008). All these methods assume that viral load is observed for all infected subjects. Therefore, the validity of their inferences depends on a missing completely at random assumption (MCAR). While the MCAR assumption is often untenable, trials may collect sufficient data on participant characteristics to make a missing at random (MAR) assumption plausible. This article extends the approach of GBH and JR to accommodate MAR missingness of the viral load end point.
To illustrate the new method, we focus on the first efficacy trial (Flynn and others, 2005) and define the end point Yp to be the pre-antiretroviral therapy (pre-ART) log10 viral load at the month 12 visit post HIV infection diagnosis. Only subjects who have not started ART by the month 12 visit contribute a value Yp. We exclude viral load values measured after ART initiation because ART strongly effects viral load levels (Gilbert and others, 2003). Inferences about Yp apply to a population where ART is not prescribed during the first 12 months after infection diagnosis.
Of the 368 subjects who became HIV-infected during the trial, 121 had Yp observed, 138 had missing data because they initiated ART prior to the month 12 study visit, and 109 had missing data because they dropped out prior to initiating ART and prior to the visit. Figure 1 shows the VaxGen data on Yp = Y3 in relationship to 2 key covariates: Y1 is early pre-ART square root CD4 cell count, and Y2 is early pre-ART log10 viral load; these variables average all pre-ART values measured within 2 months of infection diagnosis. Viral load early and at 12 months were positively correlated (Figures 1(e) and (f)), with Spearman rank correlation 0.68 (0.55) for the vaccine (placebo) group. There is evidence of a negative correlation between early CD4 cell count and viral load at 12 months (Figures 1(c) and (d)) for the vaccine group (Spearman correlation = − 0.33) but not for the placebo group (Spearman correlation = 0.16).
Fig. 1.
Fig. 1.
VaxGen trial data: jittered pre-antiretroviral (pre-ART) viral loads at the month 12 postinfection diagnosis visit (Y3) ((a) and (b)); pre-ART viral loads at the month 12 postinfection diagnosis visit versus square root CD4 cell counts early after infection (more ...)
In addition, lower levels of Y1 and higher levels of Y2 were highly predictive of ART initiation (P values  < 0.001 in a multivariate Cox model; Gilbert and others, 2005), showing that MCAR was badly violated. MAR may be plausible, however, because physicians base decisions to prescribe ART on the monitoring of CD4 cell counts, viral loads, HIV-related clinical events, and comorbidities (Hammer and others, 2008). In fact, MAR missingness due to ART initiation may approximately hold “systematically” in settings where ART is offered to all infected participants when their biomarkers are observed to cross prespecified thresholds or they present with prespecified symptomatic illnesses.
Two popular approaches to making valid inferences under MAR are inverse probability weighted (IPW)–based methods (e.g. Robins and others, 1995) and likelihood-based methods (e.g. Little and Rubin, 2002). For some HIV vaccine trials, the former methods are expected to be relatively inefficient and possibly unstable because the estimated weights for some subjects are expected to be near zero. Specifically, for a subject with Yp observed, the weight in the denominator of the estimating equation equals the estimated probability that the subject did not drop out by the month 12 visit multiplied by the estimated probability that the subject did not start ART by the visit conditional on not dropping out. These estimates can be computed as fitted values in regression models based on the subject's covariates. The latter conditional probability may be near zero for subjects whose CD4 cell counts drop below the prespecified threshold at which ART initiation is recommended (currently the recommended threshold is between 200 and 500 cells/mm3 depending on the country). In fact, for ethical reasons any efficacy trial will offer ART to all infected participants who meet treatment criteria, such that the more successful the treatment coverage the closer some estimated weights in the denominator may be to zero. Therefore, for some HIV vaccine trials, IPW methods are expected to provide relatively imprecise inferences.
While likelihood methods are less subject to the instability problem, they are susceptible to misspecification of the model relating Yp to covariates. To partially ameliorate this problem, we use the robust likelihood–based method of Little and An (2004) (henceforth LA) that is based on penalized splines of the propensity score. Our approach only handles monotone missing data (i.e. dropout); an alternative approach would handle nonmonotone missingness using parametric multiple imputation and Monte Carlo Markov chain, for example, by extending the method of Mogg and Mehrotra (2007) for analyzing viral load data to address postrandomization selection bias. The article is organized as follows. Section 2 describes the causal estimand of interest and identifiability assumptions. Section 3 shows how to combine the methodologies of GBH/JR and LA into a procedure for consistently estimating the causal estimand under an MAR assumption. Section 4 evaluates the new method in a simulation study. Section 5 applies the method to the first HIV vaccine efficacy trial, and Section 6 offers concluding remarks.
2.1. Notation and estimand
Let Z be treatment assignment, and let X be a q-vector of baseline covariates fully observed for everyone. Let S be the indicator of the postrandomization event. Subjects experiencing S = 1 are subsequently evaluated at V visits, where variables Y1,…, Yn1 are collected at visit 1, variables Yn1 + 1,…, Yn1 + n2 are collected at visit 2, and so on, with variables Yi = 1V − 1ni + 1,…, Yp collected at visit V, where p = ∑i = 1Vni. The entire collection of p variables measured after S = 1 is Y [equivalent] (Y1,…, Yp), where Yp is the outcome variable of interest. For j = 1,…, p, let Mj be the indicator of whether Yj is missing and set M = (M1,…, Mp). The variables Y are only meaningful if S = 1; thus, Y and M are undefined if S = 0 and we denote this by Y = M = *. For HIV vaccine trials, Z is vaccination assignment (Z = 1, vaccine; Z = 0, placebo) and S is HIV infection diagnosis during the trial.
Each participant has potential infection outcome S(1) if assigned vaccine and S(0) if assigned placebo. For Z = 0,1, the potential outcomes Y(Z) [equivalent] (Y1(Z),Y2(Z),…, Yp(Z)) and M(Z) [equivalent] (M1(Z),M2(Z),…, Mp(Z)) are defined if S(Z) = 1; otherwise Y(Z) [equivalent] * and M(Z) [equivalent] *. With μz [equivalent] E(Yp(z)|S(0) = S(1) = 1) for z = 0,1, the “average causal effect (ACE)” estimand of interest is ACE [equivalent] μ1μ0. Our goal is to estimate the ACE based on assumptions and the observed i.i.d. data (Zi,Xi,Si,Mi,Yi), i = 1,…, N.
2.2. MAR assumption
For subjects with S = 1, let Yobs denote the components of Y that are observed and Ymis denote the components of Y that are missing. Let f be the conditional cumulative distribution function (cdf) of M given Y and S = 1, f(M|Y,S = 1,ν), where ν denotes unknown parameters. MAR states that missingness depends only on the observed values Yobs, that is,
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx1_ht.jpg
For simplicity, we develop the methods for a setting where Y1,…, Yp − 1 are fully observed in infected subjects (with S = 1) and only Yp has missing values. In Section 6, we describe how to extend the method to the case of monotone missing data.
2.3. Identifiability assumptions
Throughout, we make the following 3 assumptions.
  • A1: Stable unit treatment values assumption (Rubin, 1978).
  • A2: The treatment assignment Z is independent of (X, S(0), S(1), M(0), M(1), Y(0), Y(1)).
  • A3: For infected subjects (with S = 1), the missing data mechanism for Y is MAR.
All the papers for studying the causal vaccine effect in the always infected stratum cited in Section 1 assume A1 and A2; we refer the reader to these articles for discussion about their justification. As discussed in Section 1, the MAR assumption A3 may be quite plausible when missingness is mainly due to ART initiation and ART guidelines are used. If the missing data mechanism is believed to be non-MAR, then the methodology here may mislead and should be used with caution. An advantageous feature of the MAR assumption is that investigators can design clinical trials to collect the covariate data that make it plausible.
We next postulate additional assumptions that identify the ACE and are indexed by fixed sensitivity parameters. Following JR, we suppose 3 models, which we refer to collectively as A4:
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx2_ht.jpg
(2.1)
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx3_ht.jpg
(2.2)
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx4_ht.jpg
(2.3)
where g0 and g1 are known invertible link functions whose inverses are continuous in α0 and α1, α0 and α1 are unknown parameters to be estimated, and β0, β1, and [var phi] are known sensitivity parameters that are varied over plausible ranges, where subject matter experts can help define the plausible ranges. With logit links g0 and g1, β0 is interpreted as the difference in the log odds of infection in the vaccine group given infection in the placebo group with y versus y − 1 viral load, and β1 is interpreted similarly reversing the role of vaccine and placebo. The parameter [var phi] is interpreted as the probability that a subject infected in the vaccine group would also be infected in the placebo group. Except for the method of JR, all of the previously developed methods cited above assume [var phi] = 1 (i.e. monotonicity, that the vaccine does not increase the risk of infection for any subject), in which case the selection model (2.2) is superfluous and only model (2.1) is used, and the only sensitivity parameter is β0. In this case, the methods of GBH and JR are equivalent. For greater applicability of the method, here, we allow for nonmonotonic settings by considering the trio of models (2.1–2.3).
3.1. Estimation of the ACE for complete data
For the case that Yp is measured from all subjects with S = 1, JR proved that A1, A2, and A4 identify the ACE and developed unbiased estimating equations that can be solved to obtain a consistent asymptotically normal estimator of the ACE. Shepherd and others (2008) summarized this result together with a computational procedure for solving the equation and for obtaining the consistent sandwich variance estimator. With θ [equivalent] (p0,p1,α0,α1,μ0,μ1), JR's estimating equation ∑i = 1NUi(θ) = 0 is detailed using our notation in Appendix A ([A.1]–[A.6]) in the supplementary material available at Biostatistics online.
JR's estimating equation was derived based on the following relationships, which form the basis for integrating the LA and JR methods:
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx5_ht.jpg
(3.1)
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx6_ht.jpg
(3.2)
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx7_ht.jpg
(3.3)
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx8_ht.jpg
(3.4)
3.2. Estimation of the ACE for MAR data
We show how to augment JR's estimating equations to handle missing data via LA's method. Following Section 5 of LA, we suppose q + p ≥ 3 so that there are at least 2 covariates. For infected subjects, separately for groups Z = 0 and Z = 1, define the logit of the propensity score for Yp to be observed, given the covariates X,Y1,…, Yp − 1:
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx9_ht.jpg
(3.5)
LA observed that, conditional on the propensity score and assuming MAR, the missingness of Yp does not depend on Z, X, Y1,…, Yp − 1 (Rosenbaum and Rubin, 1983). Consequently, each of the expectations in (3.1–3.4) can be written as
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx10_ht.jpg
(3.6)
where h(Yp) denotes a function of Yp. Therefore, JR's estimating function Ui(θ) can be modified to accommodate MAR missingness, where the modified function UiM(θ) uses (3.6) instead of (3.1)–(3.4). The estimating function UiM(θ) is detailed in Appendix A ([A.7]–[A.12]) in the supplementary material available at Biostatistics online. The function UiM(θ) includes predicted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx13_ht.jpg[h(Yp)] for h(Yp)= gz−1(Yp; α0, β0) or Ypgz − 1(Yp;α0,β0), for z = 0,1, and UiM(θ) becomes fully specified once LA's technique is applied to obtain the predicted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx13_ht.jpg[h(Yp)].
3.3. Calculation of the fitted values
We use LA's (Section 5) penalized spline propensity prediction method to predict Yp for infected subjects with missing data. First, a logistic regression model is fit relating Mp to (X1,…, Xq,Y1,…, Yp − 1), which yields estimated propensities An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx14_ht.jpg for all subjects with S = 1. Second, a spline regression model of Yp on An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx14_ht.jpg is fit using subjects for whom Yp is observed. Beyond An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx14_ht.jpg, additional covariates that predict Yp may be entered into the regression model parametrically. From this fitted model, the value E[Yp|Y1*,X2,…, Xq,Y1,…, Yp − 1,S = 1,Z = z] is predicted for each subject with SMp = 1 using his or her estimated propensity An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx14_ht.jpg and other covariates. Third, as described below, fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx18_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx20_ht.jpg are computed using these regression fits and either numerical integration (for a general link function) or a closed-form equation (for an inverse probit link function). These steps may be done in the vaccine and placebo groups separately.
Applied to our setting, LA's specific propensity spline prediction model replaces one of the predictor variables, say X1, by Y1* and supposes
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx11_ht.jpg
(3.7)
An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx12_ht.jpg
for z = 0,1, where for j = 2,…, q, szjX(Y1*) = E(Xj|Y1*,S = 1,Z = z) is a spline for the regression of Xj on Y1*, where Xj* = XjszjX(Y1*); and for j = 1,…, p − 1, szjY(Y1*) = E(Yj|Y1*,S = 1,Z = z) is a spline for the regression of Yj on Y1*, where Yj* = YjszjY(Y1*). Furthermore, for z = 0,1, rz is a parametric function with unknown parameter vector γz that satisfies rz(Y1*,0,…, 0,γz) = 0 for all γz. For subject i in group z with SiMpi = 1, the predicted value of Yp is obtained as An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx15_ht.jpg, where An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx16_ht.jpg denotes the sample estimate of the spline szjX, An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx17_ht.jpg denotes the sample estimate of the spline szjY, xij is the realization of Xij, and yij is the realization of Yij. The model (3.7) enters Xj* and Yj* as covariates; in follow-up work, Zhang and Little (2005) showed that the LA method—and hence our sensitivity analysis method—remains valid if the Xj*,Yj* are replaced with the Xj,Yj.
Next, using (3.7), for a general link function, we compute the fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx18_ht.jpg as An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx19_ht.jpg, where Φ is the cdf of the standard normal distribution. The predicted value An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx20_ht.jpg is also computed using numerical integration. For the case that gz is the inverse probit function, numerical integration is not needed and the fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx18_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx20_ht.jpg can instead be computed as simple functions of An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx13_ht.jpg[Yp]. This procedure is described in Appendix A in the supplementary material available at Biostatistics online.
3.4. Computational algorithm for estimating the ACE
The estimate An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx21_ht.jpg is computed by solving ∑i = 1NUiM(θ) = 0 (UiM(θ) is defined in Appendix A [A.7]–[A.12] in the supplementary material available at Biostatistics online) with the following steps.
  • Step 1: Estimate p0 [equivalent] Pr(S(0) = 1) by solving ∑i = 1NU1iM(p0) = 0, and estimate p1 [equivalent] Pr(S(1) = 1) by solving ∑i = 1NU2iM(p1) = 0.
  • Step 2: Plug An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx22_ht.jpg and the fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx23_ht.jpg into U3iM(θ) and solve for α0 in ∑i = 1NU3iM(α0) = 0 with a 1D line search. Similarly, plug An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx24_ht.jpg and the fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx25_ht.jpg into U4iM(θ) and solve for α1.
  • Step 3: Plug the estimates of p0 and α0 and the fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx26_ht.jpg into U5iM(θ) and solve ∑i = 1NU5iM(μ0) = 0 for μ0. Similarly, solve ∑i = 1NU6iM(μ1) = 0 for μ1.
Under the monotonicity assumption (i.e. [var phi] = 1), the computational algorithm is the same except that the second part of Step 2 is omitted (α1 is no longer relevant), and U6iM(θ) simplifies (given in expression (A.13) of Appendix A in the supplementary material available at Biostatistics online). If gz is inverse probit, then Steps 2 and 3 use closed-form formulas (A.14) and (A.15) in Appendix A in the supplementary material available at Biostatistics online; otherwise they use numerical integration as described in Section 3.3.
3.5. Standard errors and CIs
Deriving asymptotic-based standard error estimators for An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx27_ht.jpg and confidence intervals (CIs) for ACE is difficult, given the smoothing and tuning parameter selection involved with the penalized splines. We develop a bootstrap approach. As LA did not develop CIs nor standard error estimators, the performance of this approach is of interest even for the setting without a postrandomization event.
Within each treatment group Z = z separately, B bootstrap data sets are constructed by sampling with replacement Nz realizations of (Xi,Si,Mi,Yi)|Zi = z, z = 0,1. The estimation procedure described in Section 3.4 is carried out for each bootstrap data set. Then, standard errors for An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx28_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx29_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx27_ht.jpg are estimated by the sample standard deviations of the bootstrap estimates, and (1 − α) × 100% CIs are obtained as the α/2 and 1 − α/2 percentiles of the bootstrap estimates.
3.6. Robustness of the ACE estimator
LA's Theorem 1 states a double robustness–type property of the propensity spline prediction procedure, see Kang and Schafer (2007) and comments within for further discussion. In our setting, this robustness property can be stated as follows.
Theorem 3.1
Assume A1–A4, and let An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx28_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx29_ht.jpg be the estimators of μ0 and μ1, respectively, obtained by solving ∑i = 1NUiM(θ) = 0. Then, for z = 0,1, An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx30_ht.jpg is a consistent estimator of μz if either (i) the mean of Yp conditional on (Y1*,X2,…, Xq,Y1,…, Yp − 1,S = 1,Z) in (3.7) is correctly specified or (ii) the propensity Y1* is correctly specified, and (iii) E[Xj*|Y1*,S = 1,Z = z] = szjX(Y1*) for j = 2,…, q and E[Yj*|Y1*,S = 1,Z = z] = szjY(Y1*) for j = 1,…, p.
LA comment that the robustness property of (iii) is that rz does not need to be correctly specified and suggest that (b2) is a weak assumption because of the flexibility of the spline regression models.
We briefly describe the simulation study of an HIV vaccine trial, with expanded details in Appendix B in the supplementary material available at Biostatistics online.
4.1. Design of simulation study
Four simulation experiments were carried out such that A1–A4 hold with [var phi] = 1 (i.e. monotonicity holds) and g0 − 1(y;α0,β0) = Φ(α0 + β0y). We created moderate or extreme selection bias specified by β0 = 1 or β0 = 3. Based on the VaxGen trial, we consider the 3 variables Y1, Y2, and Y3 defined in Section 1 (the outcome Y3 is month 12 pre-ART log10 viral load; Y1 is early pre-ART square root CD4 cell count; Y2 is early pre-ART log10 viral load).
Each simulation generated data with an even chance of assignment to vaccine or placebo, an overall placebo infection rate of 25%, and Y(0) = (Y1(0),Y2(0),Y3(0)) multivariate normal, with means and variances equal to the sample estimates obtained in the VaxGen trial (Gilbert and others, 2005). The correlations were set to cor(Y1(0),Y2(0)) = cor(Y1(0),Y3(0)) = − 0.5 and cor(Y2(0),Y3(0)) = 0.8. For infected vaccine group subjects, Y(1) was set to Y(0) + (0,0, − 0.5), which specifies the true ACE = − 0.5.
We assumed complete data for Y1 and Y2 in infected subjects and created missing data for Y3 that was either MCAR or MAR. In either case, Y3 is missing in about 50% of infected subjects. The MAR scenario generated missingness indicators from a logistic regression model of M3 on Y1 and Y2 with coefficients chosen from a fit to the VaxGen data.
Four methods were used to estimate the ACE with a 95% bootstrap percentile CI. The first was ordinary least squares complete case (OLSCC), which simply compares sample averages of Y3 between groups in all infected subjects that have Y3 measured. The next 3 methods are causal: before deletion (BD), which applies GBH/JR to Y3 using the complete data (an unknowable gold standard for comparison); CC, a complete case analysis that applies GBH/JR; SPPL, the new method with penalized spline propensity prediction based on a regression of Y3 on the spline of Y1*, where Y1* is the linear predictor of the estimated propensity to observe Y3 from a linear logistic regression of M3 on Y1 and Y2. This SPPL method uses 15 equally spaced fixed knots and a truncated linear basis and uses model (3.7) with a linear additive model for Y2*:rz(Y1*,Y2*,γz) = γz1Y1* + γz2Y2*, for z = 0,1. Note that the OLSCC method is equivalent to the GBH/JR method for the special case that [var phi] = 1 and β0 = β1 = 0.
4.2. Simulation results
Table 1 shows bias and root mean squared error (RMSE) of the estimated ACE and the coverage probability of the nonparametric bootstrap CIs. In all cases, the OLSCC method is badly biased because it ignores the postrandomization selection bias. As expected, the CC method is unbiased under MCAR missingness but biased under MAR missingness, whereas the new SPPL method is unbiased under both MCAR and MAR data. Corresponding to this, the CIs about the ACE for the CC method have too-low coverage probability under MAR data, whereas the CIs for the SPPL method are always near the correct level. The RMSEs for the SPPL method are 21.1% lower on average than those of the CC method, showing that the new method provides substantially better estimators than the existing methods.
Table 1.
Table 1.
Bias, RMSE and 95% coverage probability for the OLSCC, BD, CC, and SPPL methods (true ACE = – 0.5)
Figure 2 shows distributions of the CI lengths, demonstrating that the SPPL method provides much more precise estimates than the CC method. Furthermore, under MAR missingness the bootstrap variance estimates of An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx27_ht.jpg had a similar distribution as the Monte Carlo sample variances of the An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx27_ht.jpg′s for the SPPL method but not for the CC method, indicating that the bootstrap variance estimator for the SPPL method is unbiased (not shown). Additional simulations were conducted under the above design except that Y(1) was set to Y(0) + (0, − 0.5, − 0.5). The results were similar, demonstrating that the new method performs well when there is an average causal effect on a postinfection covariate used to predict the outcome of interest.
Fig. 2.
Fig. 2.
Simulation study: boxplots of nonparametric bootstrap percentile 95% CI lengths for the ACE.
We now apply the newly developed SPPL method to the VaxGen data. The same method as evaluated in the simulations was applied, except with 3000 bootstrap replications. Our outcome of interest Y3 and covariates Y1 and Y2 are the same as considered above. Of the 368 subjects who became HIV-infected, there is nearly complete data for Y1 (n = 334) and Y2 (n = 341), and, as discussed above, one-third of the infected subjects have Y3 observed (n = 121) (Figure 1). The R2 value of a linear regression of Y3 on Y1 and Y2 is 0.40, suggesting that accounting for the predictive information in Y1 and Y2 with the new method is expected to provide greater efficiency than the CC method that does not leverage this positive correlation.
Because the estimated incidence of HIV infection was slightly lower in the vaccine group than the placebo group (Flynn and others, 2005), the data support (but of course do not prove) the monotonicity assumption that [var phi] = 1. Accordingly, we first conduct a sensitivity analysis assuming [var phi] = 1. For the sensitivity parameter β0 ranging from − 2 to 2, which in our opinion covers the plausible range for potential selection bias, Figure 3 shows point and 95% bootstrap CI estimates of the ACE by the CC method and by the new method. Both analyses support no causal vaccine effect on viral load, and the new method provides slightly narrower CIs.
Fig. 3.
Fig. 3.
Estimated ACE and 95% CIs with [var phi] = 1. Solid lines are for the SPPL method and dashed lines are for the CC method.
Acknowledging that monotonocity may not hold, we next performed a sensitivity analysis setting [var phi] = 0.8. This value was chosen based on the fact that the upper 95% confidence limit for the hazard ratio (vaccine/placebo) was 1.17. A value [var phi] = 0.8 assumes that a vaccinated subject who becomes infected would have a 20% chance of avoiding infection had he/she been assigned placebo, which approximately specifies a 17% plausible elevation of infection risk in the vaccine group. The shaded regions in Figure 4 show values of (β0, β1) under which a 95% CI for the ACE excludes 0. Figure 4 shows that only under conditions of large selection bias in certain directions are the data consistent with a beneficial or harmful effect. We conclude that the data are consistent with the null hypothesis of ACE = 0.
Fig. 4.
Fig. 4.
(a) Estimated ACE, (b) lower 95% confidence limits, and (c) upper 95% confidence limits with [var phi] = 0.8. Solid lines are for the SPPL method and dashed lines are for the CC method.
We conducted the example using an inverse probit link function gz. With this link, interpreting the sensitivity parameters βz requires thinking in the “Z-metric” (normal quantitle metric), where a one unit increase in Yp leads to an increase in the probit score Φ − 1(Pr(S(1) = 1|S(0) = 1,Yp(0) = y) by βz standard deviations. For those uncomfortable thinking in the Z-metric, it may be preferable to use an alternative link function that they find more interpretable, for example, logit. Furthermore, it is good practice in sensitivity analysis to repeat the analysis for multiple link functions.
Recently developed methods for estimating causal treatment effects on an outcome measured after a postrandomization event, including the methods of GBH and JR, are valid under an MCAR assumption. We have extended the GBH and JR methods to allow for MAR missingness, by augmenting it with LA's robust likelihood approach that uses penalized spline propensity prediction. The simulation study showed that the new method corrects for the bias of the existing methods and improves precision by incorporating covariates that predict the outcome.
We have focused on accommodating missing data on the outcome of interest Yp, and as such have supposed complete data for the postrandomization event S. The method could be extended to accommodate MAR missingness of S via IPW approach of Shepherd and others, 2008 to address this issue.
If the missing outcome data follow a monotone pattern, such that Yj + 1,…, Yp are missing for all subjects for whom Yj is missing, then the method can be extended using the approach described in Section 7 of LA. Specifically, the propensity spline model (3.7) can be applied sequentially to each block of missing values. With this approach, the fitted values An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx18_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxp034fx20_ht.jpg appearing in the estimating function UiM(θ) are computed by replacing missing covariate values by predicted values in sequential regression models.
Extensions of interest include accommodating nonmonotone missing data and expanding the sensitivity analysis to evaluate the impact on inferences of deviations from the MAR assumption.
FUNDING
National Institutes of Health (RO1 AI054165-06) to P.B.
Supplementary Material
[Supplementary Material]
Acknowledgments
Conflict of Interest: None declared.
  • Buchbinder SP, Mehrotra DV, Duerr A, Fitzgerald DW, Mogg R, Li D, Gilbert PB, Lama JR, Marmor M, Del Rio C. and others. The Step study: the first test-of-concept efficacy trial of a cell-mediated immunity HIV vaccine. Lancet. 2008;372:1881–1893. [PMC free article] [PubMed]
  • Chang MN, Guess HA, Heyse JF. Reduction in the burden of illness: a new efficacy measure for prevention trials. Statistics in Medicine. 1994;13:1807–1814. [PubMed]
  • Flynn NM, Forthal DN, Harro CD, Judson FN, Mayer KH, Para MF. The rgp120 HIV Vaccine Study Group. Placebo-controlled phase 3 trial of recombinant glycoprotein 120 vaccine to prevent HIV-1 infection. The Journal of Infectious Diseases. 2005;191:654–665. [PubMed]
  • Gilbert PB, Ackers ML, Berman PW, Francis DP, Popovic V, Hu DJ, Heyward WL, Sinangil F, Shepherd B, Gurwith M. HIV-1 virologic and immunologic progression and antiretroviral therapy initiation among HIV-1-infected participants in an efficacy trial of recombinant glycoprotein 120 vaccine. The Journal of Infectious Diseases. 2005;192:974–983. [PubMed]
  • Gilbert PB, Bosch R, Hudgens MG. Sensitivity analysis for the assessment of causal vaccine effects on viral load in HIV vaccine trials. Biometrics. 2003;59:531–541. [PubMed]
  • Hammer SM, Eron JJ, Jr, Reiss P, Schooley RT, Thompson MA, Walmsley S, Cahn P, Fischl MA, Gatell JM, Hirsch MS. and others. Antiretroviral treatment of adult HIV infection: 2008 recommendations of the International AIDS Society—USA panel. Journal of the American Medicial Association. 2008;300:555–570. [PubMed]
  • Hayden D, Pauler DK, Schoenfeld D. An estimator for treatment comparisons among survivors in randomized trials. Biometrics. 2005;61:305–310. [PubMed]
  • Hudgens MG, Halloran ME. Causal vaccine effects on binary postinfection outcomes. Journal of the American Statistical Association. 2006;101:51–64. [PMC free article] [PubMed]
  • Hudgens MG, Hoering A, Self SG. On the analysis of viral load endpoints in HIV vaccine trials. Statistics in Medicine. 2003;22:2281–2298. [PubMed]
  • Jemiai Y, Rotnitzky A. Semiparametric methods for inferring treatment effects on outcomes defined only if a post-randomization event occurs, [Unpublished PhD. Dissertation]. Harvard School of Public Health, Department of Biostatistics. 2005
  • Jemiai Y, Rotnitzky A, Shepherd B, Gilbert PB. Semiparametric estimation of treatment effects given base-line covariates on an outcome measured after a post-randomization event occurs. Journal of the Royal Statistical Society, Series B. 2007;69:879–902. [PMC free article] [PubMed]
  • Joffe MM, Small D, Hsu C-Y. Defining and estimating intervention effects for groups that will develop an auxiliary outcome. Statistical Science. 2007;22:74–97.
  • Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007;22:523–539. [PMC free article] [PubMed]
  • Little R, An H. Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica. 2004;14:949–968.
  • Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 2002.
  • Little R, Zhang G. Extensions of the penalized spline of propensity prediction method of imputation. Biometrics. 2008 doi: 10.1111/j.1541-0420.2008.01155.x. [PubMed]
  • Mehrotra DV, Li X, Gilbert PB. A comparison of eight methods for the dual-endpoint evaluation of efficacy in a proof-of-concept HIV vaccine trial. Biometrics. 2006;62:893–900. [PubMed]
  • Mogg R, Mehrotra DV. Analysis of antiretroviral immunotherapy trials with potentially non-normal and incomplete longitudinal data. Statistics in Medicine. 2007;26:484–497. [PubMed]
  • Pitisuttithum P, Gilbert PB, Gurwith M, Heyward W, Martin M, Van Griensven F, Hu D, Tappero JW, Choopanya K. The Bangkok Vaccine Evaluation Group. Randomized, double-blind, placebo-controlled efficacy trial of a bivalent recombinant glycoprotein 120 HIV-1 vaccine among injection drug users in Bangkok, Thailand. The Journal of Infectious Diseases. 2006;194:1661–1671. [PubMed]
  • Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121.
  • Rosenbaum PR, Rubin DB. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. The Journal of the Royal Statistical Society, Series B. 1983;45:212–218.
  • Rubin DB. Bayesian inference for causal effects. Annals of Statistics. 1978;6:34–58.
  • Rubin DB. Comment on “Causal inference without counterfactuals” by A.P. Dawid. Journal of the American Statistical Association. 2000;95:435–437.
  • Shepherd B, Gilbert PB, Jemiai Y, Rotnitzky A. Sensitivity analyses comparing outcomes only existing in a subset selected post-randomization, conditional on covariates, with application to HIV vaccine trials. Biometrics. 2006;62:332–342. [PubMed]
  • Shepherd B, Gilbert PB, Lumley T. Sensitivity analyses comparing time-to-event outcomes only existing in a subset selected post-randomization, conditional on covariates, with application to HIV vaccine trials. Journal of the American Statistical Association. 2007;102:573–582. [PMC free article] [PubMed]
  • Shepherd B, Gilbert PB, Mehrotra DV. Eliciting a counterfactual sensitivity parameter. The American Statistician. 2006;61:56–63.
  • Shepherd BE, Redman MW, Ankerst DP. Does finasteride affect the severity of prostate cancer? A causal sensitivity analysis. Journal of the American Statistical Association. 2008;103:1392–1404. [PMC free article] [PubMed]
  • Zhang JL, Rubin DB. Estimation of causal effects via principal stratification when some outcomes are truncated by “death. Journal of Educational and Behavioral Statistics. 2003;28:353–368.
Articles from Biostatistics (Oxford, England) are provided here courtesy of
Oxford University Press