Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC3126668

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Collaborative targeted maximum likelihood estimation
- 3. Current methods for estimating marginal causal treatment effects
- 4. C-TMLE implementation
- 5. Simulation studies
- 6. Comparison of C-TMLE and TMLE
- 7. Data analysis
- 8. Discussion
- Supplementary Material
- References

Authors

Related links

Int J Biostat. 2010 January 1; 6(1): Article 18.

Published online 2010 May 17. doi: 10.2202/1557-4679.1182

PMCID: PMC3126668

Susan Gruber, University of California, Berkeley;

Copyright © 2010 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.

Targeted maximum likelihood estimation (TMLE) is a double robust method for estimating causal effects and their non-causal analogs, variable importance measures (van der Laan and Rubin, 2006). Observed data can be viewed as realizations of random variables arising from some true underlying data-generating distribution. An association or causal effect corresponds with some particular parameter of this unknown underlying distribution. TMLE is a two stage semi-parametric estimation methodology that minimizes bias in the estimate of this parameter of interest. Stage one estimates the density of the data-generating distribution. In stage two this initial density estimate is fluctuated in a manner specifically designed to provide the maximal change in the estimate of the target parameter. This targeted fluctuation is based on the efficient influence curve of the parameter, and involves estimating a nuisance parameter that can include a treatment mechanism, a missingness mechanism, and/or a censoring mechanism.

Collaborative targeted maximum likelihood estimation (C-TMLE) is an extension of TMLE that pursues an optimal strategy for nuisance parameter estimation. Theory advanced in van der Laan and Gruber (2010) provides the key insight that only the portion of the nuisance parameter that is not adequately accounted for in the first stage needs to be incorporated into the second stage fluctuation. Stage two of the C-TMLE approach exploits this collaborative double-robustness finding by creating a sequence of nuisance parameter estimates that grow increasingly larger, i.e., more and more non-parametric. Construction of the nuisance parameter estimates is guided data-adaptively based on the goodness-of-fit of the overall density estimate and its effect on the mean squared error of the estimate of the target parameter. In other words, each nuisance parameter estimate in the sequence is carefully constructed to provide the next in a series of fluctuations of the initial density estimate, and each fluctuation is carried out to create a series of candidate TMLE estimators. Likelihood-based cross-validation (possibly using the penalized likelihood) selects the best candidate for the given stage one estimator. This procedure can be carried out for multiple stage one estimators. The C-TMLE estimator is defined as the best among the set of candidate TMLE estimators indexed by stage one and stage two candidates, as chosen by likelihood-based cross-validation. In finite samples the C-TMLE estimator will often be more efficient than the standard TMLE estimator that utilizes an estimate of the entire nuisance parameter in the targeting step. C-TMLE enjoys all the properties of the standard TMLE estimator, namely, it is double robust and asymptotically efficient under appropriate regularity conditions..

C-TMLE is a general methodology that can be applied to a variety of estimators in many settings, including survival analysis, gene association studies, and longitudinal data structures. In this article we focus on one specific application of C-TMLE, non-parametric estimation of a point treatment effect. For simplicity, we limit the discussion of C-TMLE methodology to the simplest case in which there is only one stage one estimate. The extension to multiple stage one candidate estimators is straightforward. The performance of collaborative targeted maximum likelihood estimation methodology is compared with that of established estimation methods: G-computation, inverse probability of treatment weighting (IPTW), propensity score-based causal effect estimation, and unadjusted regression estimation of a point treatment effect. Results from three simulation studies designed to mimic commonly-encountered scenarios demonstrate the versatility of the CTMLE estimator. A fourth simulation contrasts C-TMLE with standard targeted maximum likelihood estimation (TMLE) introduced in van der Laan and Rubin (2006).

The paper is organized as follows. Section 2 describes TMLE and defines a new collaborative double robustness property that motivates the development of CTMLE. Other common estimators of causal effects and variable importance parameters found in the literature (G-computation, propensity score-based, unadjusted, and inverse probability of treatment weighted estimators) are reviewed in Section 3. A particular C-TMLE implementation and an influence curve-based method for obtaining inference is described in Section 4. Section 5 presents three simulations designed to offer a performance comparison across a variety of scenarios common to many analyses of real-world data. In Section 6 we compare the performance of the new C-TMLE estimator with standard TMLE. Variable importance estimates of the effect of HIV mutations on resistance to the anti-retroviral drug lopinavir are presented in Section 7. The paper concludes with a discussion in Section 8. An appendix provides a note on modification to the TMLE procedure that maintains its properties but also enforces Ψ(*Q _{n}*) as an imputation estimator. In addition, R source code that implements and demonstrates applying C-TMLE to analyze simulated data is provided as supplemental materials.

We review targeted maximum likelihood estimation of the additive treatment effect before defining the collaborative targeted maximum likelihood approach. Suppose we have a data set containing *n* independent and identically distributed observations, *O*_{1}, . . . , *O _{n}*, of a random variable

The probability distribution/density of *O* can be factored as *P*_{0}(*O*) = *Q*_{0}(*O*)*g*_{0}(*A | W*), where *Q*_{0}(*O*) = *Q _{Y}*

$$\psi ({Q}_{n})\mathrm{=\mathrm{\frac{1}{n}\mathrm{\sum _{i=1}^{n}({Q}_{n}(1,\mathrm{{W}_{i}})\mathrm{-\mathrm{{Q}_{n}(0,\mathrm{{W}_{i})),}}}}}}$$

where *Q _{n}*(

Targeted maximum likelihood estimation involves obtaining an initial estimate of the true conditional mean of *Y* given *A* and *W*, and subsequently fluctuating this estimate in a manner designed to reduce bias in the estimate of the parameter of interest. Let
${Q}_{n}^{0}(A,\mathrm{W)}$ be the initial estimate of the true conditional mean *Q*_{0}(*A*, *W*). For example, if *Y* is binary, then one constructs a parametric (least favorable) model
$\text{logit}({Q}_{n}^{0}(\mathrm{)\mathrm{(A,\mathrm{W))\mathrm{=\mathrm{\text{logit}({Q}_{n}^{0}\mathrm{+\mathrm{\mathrm{h)}}}}}}}}$, fluctuating the initial estimate
${Q}_{n}^{0}$, where is the fluctuation parameter. The function *h*(*A*, *W*), known as the “clever covariate”, depends on the treatment assignment mechanism *g*_{0}, and is given by

$$h(A,\mathrm{W)\mathrm{=\mathrm{\frac{I(A\mathrm{=\mathrm{1)}{g}_{0}(1|W)}\mathrm{-\mathrm{\frac{I(A=0)}{{g}_{0}(0\mathrm{|\mathrm{W)}}.}}}}{}}}}$$

(1)

The theoretical basis for this choice of clever covariate is given in van der Laan and Rubin (2006). In particular, it has the bias-reduction property that if one estimates with the parametric maximum likelihood estimator, and one sets
${Q}_{n}^{1}$ equal to the resulting update, then the resulting substitution estimator
$\Psi ({Q}_{n}^{1})$ is asymptotically unbiased, even if the initial estimator
${Q}_{n}^{0}$ is inconsistent. These results indicate that estimating *g*_{0} is crucial for reducing bias. However, the choice of an estimator *g _{n}* should be evaluated by how it affects the mean squared error of the resulting targeted maximum likelihood estimator
$\Psi ({Q}_{n}^{1})$, making it a harder and different problem than estimating

TMLE has been shown to be double robust, i.e. the estimate is consistent if either the limits of
${Q}_{n}^{1}$ or *g _{n}* are correctly specified. When both are correct, the estimator is efficient (van der Laan and Rubin, 2006). Recent theoretical advances show that TMLE is also

A particular method for construction of a collaborative estimator *g _{n}* involves building candidate treatment mechanism estimators that grow towards an unbiased estimator of the fully adjusted

Clever covariates based on these candidates give rise to a sequence of updated estimates,
${Q}_{n}^{1}({Q}_{n}^{0},\mathrm{{g}_{n}^{1}),\mathrm{\dots ,\mathrm{{Q}_{n}^{K}\mathrm{({Q}_{n}^{0},\mathrm{{g}_{n}^{1},\mathrm{\dots ,\mathrm{{g}_{n}^{K})}}}}}}}$, each of which provides a candidate TMLE estimate of *ψ*_{0}. The C-TMLE estimate is the best among these candidates, as determined by V-fold *Q*_{0}-log-likelihood-based cross-validation.

Current methods for estimating the marginal causal effect of a treatment *A* on outcome *Y* are compared with C-TMLE on simulated data below. The estimators under consideration are the G-computation estimator (Robins, 1986), the IPTW estimator (Hernan et al. (2000), Robins (2000b)), a double robust IPTW estimator (DR-IPTW), (Robins and Rotnitzky (2001); Robins et al. (2000); Robins (2000a)), a propensity score estimator (Rosenbaum and Rubin, 1983) that calculates the marginal treatment effect as the mean across strata defined by the conditional probability of receiving treatment, and an extension to propensity score estimators implemented in *Matching*, a publicly available R package (Sekhon (2008)).

Recall that our parameter of interest is given by: *ψ*_{0} = *E _{W}*[

$${\psi}_{n}^{\mathit{\text{Gcomp}}}\mathrm{=\mathrm{\frac{1}{n}\sum _{i=1}^{n}({Q}_{n}^{0}(1,\mathrm{{W}_{i})\mathrm{-\mathrm{{Q}_{n}^{0}(0,\mathrm{{W}_{i}))}}}}}}$$

$${\psi}_{n}^{\mathit{\text{IPTW}}}\mathrm{=\mathrm{\frac{1}{n}\sum _{i=1}^{n}[I({A}_{i}\mathrm{=\mathrm{1)\mathrm{-\mathrm{I({A}_{i}\mathrm{\mathrm{=\mathrm{0)]\mathrm{\frac{{Y}_{i}}{{g}_{n}\mathrm{({A}_{i},\mathrm{{W}_{i})}}}}}}}}}}}}}$$

$$\begin{array}{ccc}{\psi}_{n}^{\mathit{\text{DR\u2013IPTW}}}& =& \frac{1}{n}\sum _{i=1}^{n}\frac{[I({A}_{i}\mathrm{=\mathrm{1)\mathrm{-\mathrm{I({A}_{i}\mathrm{=\mathrm{0)]}{g}_{n}({A}_{i}|{W}_{i})}}\mathrm{({Y}_{i}\mathrm{-\mathrm{{Q}_{n}^{0}({W}_{i},\mathrm{{A}_{i}))}}}& & +\frac{1}{n}\mathrm{\sum _{i=1}^{n}({Q}_{n}^{0}(1,\mathrm{{W}_{i})\mathrm{-\mathrm{{Q}_{n}^{0}(0,\mathrm{{W}_{i}))}}}}}}}}}}{}\end{array}$$

$${\psi}_{n}^{\mathit{\text{C\u2013TMLE}}}\mathrm{=\mathrm{\frac{1}{n}\sum _{i=1}^{n}({Q}_{n}^{*}(1,\mathrm{{W}_{i})\mathrm{-\mathrm{{Q}_{n}^{*}(0,\mathrm{{W}_{i}))}}}}}}$$

$${\psi}_{n}^{\mathit{\text{PropScore}}}\mathrm{=\mathrm{\frac{1}{n}\sum _{i=1}^{n}({Q}_{n}^{0}(1,\mathrm{{s}_{i})\mathrm{-\mathrm{{Q}_{n}^{0}(0,\mathrm{{s}_{i}))}}}}}}$$

$${\psi}_{n}^{\mathit{\text{Matching}}}\mathrm{=\mathrm{\frac{1}{n}\sum _{i=1}^{n}({Q}_{n}^{0}(1,\mathrm{{m}_{i})\mathrm{-\mathrm{{Q}_{n}^{0}(0,\mathrm{{m}_{i}))}}}}}}$$

where
${Q}_{n}^{0}$ refers to an initial estimate of *Q*_{0}(*A*, *W*),
${Q}_{n}^{*}$ refers to an updated targeted estimate of *Q*_{0}(*A*, *W*), described in detail in the next section. For the propensity score method, *s _{i}* indicates a stratum of the propensity score of covariate vector

Regarding asymptotic properties of the estimators, the G-computation estimator relies on consistent estimation of *Q*_{0}, the IPTW estimator relies on consistent estimation of *g*_{0}, while the DR-IPTW estimator yields consistent estimates if one or both nuisance parameters are estimated consistently.

Notice that
${\psi}_{n}^{\mathit{\text{C\u2013TMLE}}}$ is a G-computation estimate. However, unlike G-computation, which is consistent only when *Q _{n}* is a consistent estimator for

$$\begin{array}{lll}{\psi}_{n}^{\mathit{\text{C\u2013TMLE}}}\hfill & =\hfill & \frac{1}{n}\sum _{i=1}^{n}\frac{[I\mathrm{({A}_{i}\mathrm{=\mathrm{1)\mathrm{-\mathrm{I({A}_{i}\mathrm{=\mathrm{0)]}{g}_{n}^{*}\mathrm{({A}_{i}\mathrm{|\mathrm{{W}_{i})}}({Y}_{i}\mathrm{-\mathrm{{Q}_{n}^{*}({W}_{i},\mathrm{{A}_{i}))}}}}\hfill & \hfill & +\frac{1}{n}\sum _{i=1}^{n}({Q}_{n}^{*}(1,\mathrm{{W}_{i})\mathrm{-\mathrm{{Q}_{n}^{*}(0,\mathrm{{W}_{i}))}}}}\hfill}}}}}}}{}\hfill \end{array}$$

The propensity score method implemented uses the *Deletion/Substitution/Addition* (DSA) algorithm (Sinisi and van der Laan, 2004) to model conditional treatment probabilities given covariates *W*. This data-adaptive algorithm searches over a large space of polynomial models by adding, subtracting, or substituting terms, starting with a base user-specified regression model. The final model, selected by cross-validation with the L2 loss function, was used to estimate a propensity score for each observation. Observations were then divided into five strata based on the quantiles of these propensity scores. Regression of *Y* on *A* and strata indicator variables using the full model enabled the calculation of stratum-specific treatment effects, which were averaged to obtain the marginal effect. The Matching estimator generalizes the propensity score approach by carefully matching observations in the treatment and control groups in such a way that potential confounders are evenly distributed, across the matches.

The Matching procedure relies on the genetic algorithm (Holland and Reitman (1977)) to achieve this goal. This is a non-parametric approach for selecting weights on covariates that are in turn are used to determine which observations are matched. Candidate sets of matches are evaluated based on a loss function and a distance metric specified at run-time, and are used to generate successive sets of candidates that achieve good balance Sekhon (2008). The marginal treatment effect is the average effect across strata defined by the matches.

Propensity score methods are especially effective when overall match quality is a function of true confounders. Estimates can suffer even when overall match quality is high if a small subset of covariates responsible for introducing the most bias into the estimate is unevenly distributed between treatment and control groups. Because matches are made without regard to the outcome variable, these methods do not exploit all information available in the data and are known to be less than fully efficient (Abadie and Imbens, 2006). A violation of the experimental treatment assignment assumption, also called the positivity assumption, is known to reduce the quality of the match and introduce bias into the estimate, and can be detected once the matches have been specified. The lack of identifiability as measured by such an assumption results in potential bias for each method, but the augmented IPTW, targeted MLE, and G-computation method allow reliance on extrapolation.

The general C-TMLE procedure is to create several stage 1 (non-targeted) density estimators and carry out stage 2 procedures for each of these. Penalized cross-validation is used to choose among the final candidate estimators that are indexed by stage 1 and stage 2 candidates. The implementation presented here is based on only one initial stage 1 estimate for simplicitly. We describe specific choices that were employed for the simulations and data analysis presented in the following sections, occasionally noting other implementation options.

- Step 1: Obtain an estimate ${Q}_{n}^{0}$ of
*Q*_{0}(*A, W*). A data-adaptive machine learning approach to obtaining this initial estimate is recommended. The super learner (SL) is a prediction algorithm that creates a weighted combination of predictions of many individual prediction algorithms, with weights selected using V-fold cross-validation (van der Laan et al., 2007). In practice, it is important to include algorithms in the SL library of predictors that cover different model spaces, e.g. support vector machines, splines, neural nets, etc., since the true best estimation method is unknown. If SL is not used, any particular data adaptive machine learning algorithm providing a consistent estimate is acceptable. For the simulations described in the next section, the DSA algorithm was used to provide the initial estimate of the the true regression of*Y*on treatment*A*and confounders*W*. - Step 2: Generate candidate second stage estimators ${Q}_{n}^{k}$. In the simulations forward selection was used to build a sequence of updates for
*g*_{0}that are increasing in size.Though not required, a sensible approach is to use the intercept model for*g*to construct a first clever covariate,*h*_{1}, used to create the first targeted maximum likelihood candidate, ${Q}_{n}^{1}$.This results in the first candidate second stage estimator ${Q}_{n}^{1}\mathrm{=\mathrm{{Q}_{n}^{0}\mathrm{+\mathrm{{1}_{{h}_{1}}}}}}$, where$$\begin{array}{c}{g}_{n}^{1}(1|W)\mathrm{=\mathrm{P(A\mathrm{=\mathrm{1),\mathrm{{g}_{n}^{1}(0|W)\mathrm{=\mathrm{P(A\mathrm{=\mathrm{0)}}}{h}_{1}\mathrm{=\mathrm{(\frac{I[A\mathrm{=\mathrm{1]}{g}_{n}^{1}(1|W)}\mathrm{-\mathrm{\frac{I[A\mathrm{=\mathrm{0]}{g}_{n}^{1}(0|W)}}{)}.}}}{}}}}}}}}}\end{array}$$_{1}is fitted by least-squares regression of*Y*on*h*_{1}with offset ${Q}_{n}^{0}$. Next we create an updated model for*g*by adding a main term to the intercept, and the resulting targeted MLE using the corresponding clever covariate is evaluated. The best main term is selected based on a penalized log-likelihood criterion for the targeted MLE fit. Additional terms are incorporated in the*g*-fit as long as they increase the overall penalized log-likelihood for the resulting*Q*_{0}-targeted MLE fit. Thus the penalized likelihood is defined as the empirical sum of squared residuals at the resulting*Q*_{0}-fit plus a penalty term proportional to the estimated variance of the target parameter, the empirical variance of*D**, the main component of the efficient influence curve (see below), at the resulting*Q*_{0}-fit and the candidate*g*-fit. In the event that no terms in the model for*g*increase the penalized likelihood of the resulting*Q*_{0}-fit, the targeted MLE update is carried out with the clever covariate that provided the best penalized log-likelihood, and the above process is iterated with this new initial estimator and next clever covariate indexed by*g*fits that are still building on last*g*-fit. - As an example suppose that in addition to the intercept term,
*m*terms, ordered 1, . . . ,*m*, are incorporated into the model for*g*, at which point no further increase of the penalized log likelihood is possible. We define candidate estimators ${Q}_{n}^{2}$ through ${Q}_{n}^{m+1}$ as:where the corresponding models ${g}_{n}^{i+1}$ contains all the terms in the model for ${g}_{n}^{i}$ plus one additional term,$$\begin{array}{ccc}{Q}_{n}^{2}& =& {Q}_{n}^{1}\mathrm{+\mathrm{{2}_{{h}_{2}}}}{Q}_{n}^{3}& =& {Q}_{n}^{1}\mathrm{+\mathrm{{3}_{{h}_{3}}}}& & {Q}_{n}^{m+1}& =& {Q}_{n}^{1}\mathrm{+\mathrm{{m+1}_{\mathrm{{h}_{m+1}}}}}\end{array}$$*i*= 2, . . . ,*m*. At this point ${Q}_{n}^{m+1}$ is considered as a new “initial” estimate of the true regression, and the entire process starts over in order to build a second clever covariate augmenting the previous fit ${g}_{n}^{m+1}$ used in*h*_{m}_{+1}. To continue the example, ${Q}_{n}^{m+2}\mathrm{=\mathrm{{Q}_{n}^{m+1}\mathrm{+\mathrm{{m+2}_{{h}_{m+2}}}}}}$. This process is iterated until all terms are incorporated into the final model for*g*. If the maximal number of terms that can be added is given by*K*, then this results in*K*candidate estimators ${Q}_{n}^{k}$,*k*= 1, . . . ,*K*, corresponding with treatment mechanism estimators ${g}_{n}^{k}$,*k*= 1, . . . ,*K*. Note that the number of clever covariates in ${Q}_{n}^{k}$ that are added to the initial estimator ${Q}_{n}^{0}$ cannot be predicted, and depends on how many covariates can be added to the treatment mechanism estimator in each iteration before reaching the local maximum (not allowing a further increase of the penalized log-likelihood).Note that the model for*g*is not restricted to main terms only. For example, variables can be created that correspond to higher-order terms. In addition, a categorical or continuous covariate can be split into many binary covariates, thereby allowing for more nonparametric modeling of the effect of a single covariate. When there are many covariates it might be desirable in practice to terminate the procedure before all covariates have been incorporated into the model for*g*, though care must be taken to ensure that none of the candidates thereby excluded from the subsequent selection process potentially maximize the penalized log-likelihood criterion. SL can be integrated into the second stage as well. A series of increasingly non-parametric propensity score SL estimates can be obtained based on different adjustment sets. These SL fits are used as the main terms for the stage 2 forward selection to build candidate*ĝ*estimators.The presented algorithm illustrates that the number of clever covariates used to update the initial estimator ${Q}_{n}^{0}$ depends entirely on the likelihood and cannot be pre-determined. Terms are incorporated into the model for*g*for a single clever covariate until there is a decrease in the likelihood. At that point the estimate is updated from ${Q}_{n}^{m}\mathrm{\to \mathrm{{Q}_{n}^{(m+1)}}}$ and the process iterates until all candidate TMLEs have been constructed.We also note that we can represent these estimators ${Q}_{n}^{k}$ and corresponding treatment mechanism estimators ${g}_{n}^{k}$ as mappingsand^{k}*ĝ*applied to the empirical distribution^{k}*P*: ${Q}_{n}^{k}\mathrm{=\mathrm{{\widehat{Q}}^{k}\mathrm{({P}_{n})}}}$, ${g}_{n}^{k}\mathrm{=\mathrm{{\widehat{g}}^{k}\mathrm{({P}_{n})}}}$,_{n}*k*= 1*, . . . , K*. These mappings*P*_{n}*→*(^{k}*P*) represent our candidate estimators of the true regression_{n}*Q*_{0}, and in the next step we use cross-validation to select among these candidate algorithms. - Step 3: Select the estimator that maximizes the V-fold cross-validated penalized likelihood, where V was set to 5. Maximizing the penalized likelihood is equivalent to minimizing the residual sum of squares (RSS) plus a penalty term corresponding to the mean squared error (MSE), which can be decomposed into variance and bias terms:These terms are defined as follows:$${k}^{*}\mathrm{=\mathrm{\underset{k}{\text{argmin}}\mathrm{cv\mathrm{{RSS}_{k}\mathrm{+\mathrm{cvVa{r}_{k}\mathrm{+\mathrm{n\mathrm{*\mathrm{cv{\mathit{\text{Bias}}}_{k}^{2}.}}}}}}}}}}$$$$cv\mathrm{{RSS}_{k}\mathrm{=\mathrm{\sum _{v=1}^{V}\sum _{i\mathit{\text{Val}}(v)}}}}$$$$cv{\mathit{\text{Var}}}_{k}\mathrm{=\mathrm{\sum _{v=1}^{V}\sum _{i\mathit{\text{Val}}(v)}}}$$$$cv\mathrm{{\mathit{\text{Bias}}}_{k}\mathrm{=\mathrm{\frac{1}{V}\mathrm{\sum _{v=1}^{V}\Psi ({\widehat{Q}}^{k}\mathrm{({P}_{nv}^{0}))\mathrm{-\mathrm{\Psi ({\widehat{Q}}^{k}\mathrm{({P}_{n}))}}}}}}}}$$where$$\begin{array}{l}\begin{array}{c}{D}^{*}\mathrm{(Q,\mathrm{g,\mathrm{\Psi \mathrm{(Q))\mathrm{(O)}}=\frac{I\mathrm{[A\mathrm{=\mathrm{1]\mathrm{-\mathrm{I[A\mathrm{=\mathrm{0]}g(A|W)}(Y\mathrm{-\mathrm{Q(A,\mathrm{W))}}}& & +\mathrm{\frac{1}{n}\mathrm{\sum _{i=1}^{n}Q(1,\mathrm{W)\mathrm{-\mathrm{Q(0,\mathrm{W)\mathrm{-\mathrm{\Psi (Q)}}}}}}}}}}}}}}{}}}}\end{array}\end{array}$$
*v*ranging from 1 to V indexes the validation set*Val*(*v*) for the*v*th fold, Ψ(*Q*) is a mapping from*Q*to the parameter of interest, and ${\widehat{Q}}^{k}\mathrm{({P}_{nv}^{0})}$ denotes the*k*-th C-TMLE applied to the corresponding training sample ${P}_{nv}^{0}$, containing*n*(1*− p*) observations, with*p*= 1*/V*.

There are many variations for obtaining
${\psi}_{n}^{\mathit{\text{C\u2013TMLE}}}$. For example, given an a priori set of candidate nuisance parameter estimators, *ĝ _{j}*, that includes highly nonparametric candidates we could construct clever covariates

Note that in contrast with the algorithm described above, in which previous coefficients are used as fixed offsets in the regression, coefficients in front of each term are estimated by least squares, thereby solving the efficient influence equation corresponding to each *ĝ _{j}*, in particular the most non-parametric of these. Because these covariates are highly correlated, refitting all coefficients in front of clever covariates at each step in the forward selection algorithm is likely to result in highly variable coefficient estimates, and therefore less stability in the estimate of the parameter of interest.

Another alternative approach is to define
${\psi}_{n}^{\mathit{\text{C\u2013TMLE}}}\mathrm{=\mathrm{\psi ({Q}_{n}^{1})}}$, where
${Q}_{n}^{1}={Q}_{n}^{0}\mathrm{+\mathrm{{n}_{h}}}$ is the targeted MLE updating the initial estimator with the final selected clever covariate defined by carrying out the *k** moves in the above forward selection algorithm to obtain a *g*-fit, where *k** is the optimal number of moves selected by likelihood-based cross-validation (exactly as above). This variation did not improve performance in simulation studies not presented in this article. We mention these alternatives only to underscore the fact that C-TMLE methodology can be implemented in a variety of ways, and is not limited to the specific implementation presented here.

The variance of the influence curve (IC) of the C-TMLE provides suitable inference, under certain regularity conditions, and assuming that the collaborative estimator *g _{n}* converges to a

$$IC\mathrm{({P}_{0})\mathrm{=\mathrm{{D}^{*}\mathrm{(Q,\mathrm{{g}_{0},\mathrm{{\psi}_{0})\mathrm{+\mathrm{I{C}_{{g}_{0}},}}}}}}}}$$

where *IC*_{g0} denotes the influence curve of the linearization of *P*_{0}*D** (*Q, g _{n}*,

In our application of C-TMLE, *g _{n}* is a data adaptively selected logistic regression model fitted with maximum likelihood estimation. Thus, if we define {

$${IC}_{g0}\mathrm{(O)\mathrm{=\mathrm{-{\alpha}_{0}\mathrm{\mathrm{{IC}_{\alpha}\mathrm{(O)}}}}}}$$

where

$$\begin{array}{ccc}\hfill {a}_{0}& =\hfill & {P}_{0}(Y\mathrm{-\mathrm{Q(A,\mathrm{W))\overrightarrow{W}{h}_{\alpha}(A,\mathrm{W),}}}\hfill {h}_{\alpha}\mathrm{(A,\mathrm{W)}}=\hfill & [\frac{A{g}_{\alpha}(0|W)}{{g}_{\alpha}(1|W)}\mathrm{-\mathrm{\frac{(1\mathrm{-\mathrm{A){g}_{\alpha}(1|W)}{g}_{\alpha}(0|W)}}{]},}}\hfill I{C}_{\alpha}\mathrm{(O)}=\hfill & {P}_{0}\mathrm{{[\overrightarrow{W}{\overrightarrow{W}}^{T}\mathrm{{g}_{\alpha}(1|W){g}_{\alpha}(0|W)}]-1}^{\mathrm{(A\mathrm{-\mathrm{{g}_{\alpha}\mathrm{(1|W))\overrightarrow{W}.}}}}}}\hfill \hfill}\hfill \end{array}$$

The notation *$\stackrel{\u20d7}{W}$* is used to denote the vector of main terms that is included in the logistic regression model *g _{αn}*. Note that

This influence curve is estimated by its empirical analog, given by:

$${\widehat{IC}}_{g0}\mathrm{(O)\mathrm{=\mathrm{-{a}_{n}\mathrm{\mathrm{{\widehat{IC}}_{\alpha}\mathrm{(O)}}}}}}$$

where

$$\begin{array}{ccc}{a}_{n}& =& \frac{1}{n}\mathrm{\sum _{i=1}^{n}({Y}_{i}\mathrm{-\mathrm{\widehat{Q}({A}_{i},\mathrm{{W}_{i})){\overrightarrow{W}}_{i}{h}_{{\alpha}_{n}}\mathrm{({A}_{i},\mathrm{{W}_{i}),}}}}{h}_{{\alpha}_{n}}\mathrm{({A}_{i},\mathrm{{W}_{i})}}=& [\frac{{A}_{i}{g}_{{\alpha}_{n}}\mathrm{(0|{W}_{i})}{g}_{{\alpha}_{n}}\mathrm{(1|{W}_{i})}\mathrm{-\mathrm{\frac{(1\mathrm{-\mathrm{{A}_{i}){g}_{{\alpha}_{n}}\mathrm{(1|{W}_{i})}{g}_{{\alpha}_{n}}\mathrm{(0|{W}_{i})}}]},}{}}{\widehat{IC}}_{\alpha}(O)& =& {[\frac{1}{n}\mathrm{\sum _{i=1}^{n}{\overrightarrow{W}}_{i}{\overrightarrow{W}}_{i}^{T}{g}_{{a}_{n}}(1|{W}_{i}){g}_{{\alpha}_{n}}(0|{W}_{i})}]-1}^{\mathrm{(A\mathrm{-\mathrm{{g}_{{\alpha}_{n}}(1\mathrm{|W))\overrightarrow{W}.}}}}}}}{}}}\end{array}$$

The standard error of the C-TMLE is now estimated as
$\text{SE}({\psi}_{n})\mathrm{=\mathrm{\sqrt{\mathit{\text{var}}(\mathit{\text{IC}})/n}}}$, where
$\text{var}(\mathit{\text{IC}})\mathrm{=\mathrm{1/n\mathrm{{\sum}_{i}\mathrm{{\widehat{IC}}_{i}^{2}}}}}$ is the sample variance of the estimated influence curve. A 95% confidence interval (CI) is constructed as *ψ _{n}* ± 1.96SE(

We remark that it is good practice to incorporate the additional term *IC*_{g0} in the influence curve, thereby targeting the true influence curve of the estimator. We can provide the following qualitative understanding of the contribution of *IC*_{g0} to the influence curve of the estimator. If
${Q}_{n}^{0}$ converges to the true *Q*_{0}, then the term *IC*_{g0} equals zero, and if
${Q}_{n}^{0}$ is inconsistent, and *g _{n}* converges to the fully adjusted

For each simulation we have a data structure *O* = (*W, A, Y*), where *W* = (*W*_{1}, . . . , *W*_{6}) is a set of potential confounders of the relationship between binary treatment variable *A* and continuous outcome *Y*. Our parameter of interest is the marginal causal effect of treatment on the outcome: *ψ* = *E _{W}*[

These simulations are designed specifically to illustrate features of the C-TMLE estimator, and there are other simulations for which the relative performance of the estimators would differ. For example, when a correct model for the underlying data generating distribution is known, a parametric regression approach would be optimal. When the outcome is rare, as is often the case in safety analysis, we would not expect the initial fit,
${Q}_{n}^{0}$, to have much predictive power. In this case, the fully adjusted *g*_{0} is very likely needed for full bias reduction, so creating and evaluating intermediate candidates with C-TMLE may be needlessly computationally expensive. Standard TMLE might be a better approach. Adjusting for many confounders may lead to violations of the ETA assumption when *n* is small relative to the number of confounders or if the confounders are very strongly predictive of treatment. There are two ways to deal with this. First, one could extrapolate based on model assumptions to arrive at an estimate of the desired parameter. Secondly, one could acknowledge that the parameter of interest is not identifiable from the data, and choose an adjustment set that provides bias reduction without yielding an estimate with variance so large that it is essentially meaningless. This latter approach is taken by second stage of the C-TMLE procedure, by basing the selection of confounders based on the penalized log-likelihood, while the extrapolation approach is still present through the initial first stage estimator.

Covariates *W*_{1}, . . . *W*_{5} were generated as independent normal random variables. *W*_{6} is a binary variable.

$$\begin{array}{c}{W}_{1},\mathrm{{W}_{2},\mathrm{{W}_{3},\mathrm{{W}_{4},\mathrm{{W}_{5}}}~& N(0,\mathrm{1)}& \mathit{\text{logit}}(P({W}_{6}\mathrm{=\mathrm{1|{W}_{1},\mathrm{{W}_{2},\mathrm{{W}_{3},\mathrm{{W}_{4},\mathrm{{W}_{5}))}}=& .3{W}_{1}\mathrm{+\mathrm{.2{W}_{2}\mathrm{-3{W}_{3}}}}}}}}}}\end{array}$$

Two treatment mechanisms were defined:

$$\begin{array}{l}\mathit{\text{logit}}({g}_{1,0})=\mathit{\text{logit}}(P(A=1|{W}_{1},\mathrm{{W}_{2},\mathrm{{W}_{3},\mathrm{{W}_{4},\mathrm{{W}_{5},\mathrm{{W}_{6}))=\mathrm{.3{W}_{1}\mathrm{+\mathrm{.2{W}_{2}\mathrm{-\mathrm{3{W}_{3}}}}\mathit{\text{logit}}({g}_{2,0})=\mathit{\text{logit}}(P(A=1|{W}_{1},\mathrm{{W}_{2},\mathrm{{W}_{3},\mathrm{{W}_{4},\mathrm{{W}_{5},\mathrm{{W}_{6}))=\mathrm{.15(.3{W}_{1}\mathrm{+\mathrm{.2{W}_{2}\mathrm{-\mathrm{3{W}_{3})}}}}}}}}}}\hfill}}}}}}}\hfill \end{array}$$

The observed outcome *Y* was generated as

$$Y\mathrm{=\mathrm{{Q}_{i,0}(A,\mathrm{W)\mathrm{+\mathrm{\mathrm{,\mathrm{\mathrm{\mathrm{~\mathrm{N(0,\mathrm{1)}}}}}}}}}}}$$

with corresponding regression equations:

$${Q}_{1,0}(A,\mathrm{W)\mathrm{=\mathrm{A\mathrm{+\mathrm{.5{W}_{1}\mathrm{-\mathrm{8{W}_{2}\mathrm{+\mathrm{{W}_{3}\mathrm{+\mathrm{8{W}_{3}\mathrm{-\mathrm{2{W}_{5}}}}}}}}}}}}}}$$

$${Q}_{2,0}(A,\mathrm{W)\mathrm{=\mathrm{A\mathrm{+\mathrm{.5{W}_{1}\mathrm{-\mathrm{8{W}_{2}\mathrm{+\mathrm{{W}_{3}\mathrm{+\mathrm{8{W}_{3}^{2}\mathrm{-\mathrm{2{W}_{5}}}}}}}}}}}}}}$$

We consider three different data-generating distributions, (*Q*_{1,0}, *g*_{1,0}) in simulation 1, (*Q*_{2,0}, *g*_{1,0}) in simulation 2, and (*Q*_{2,0}, *g*_{2,0}) in simulation 3. Note that *W*_{6} is strongly correlated with treatment mechanism *A* in simulations 1 and 2 (corr=0.54), but is not an actual confounder of the relationship between *A* and *Y*. *W*_{1}, *W*_{2}, and *W*_{3} are confounders. The linear nature of the confounding due to*W*_{3} in simulation 1 differs from that in simulations 2 and 3, where the true functional form is quadratic. In this way simulations 2 and 3 mimic realistic data analysis scenarios in which the unknown underlying functional form is seldom entirely captured by the regression model used in the analysis. Finally, the treatment mechanism in simulations 1 and 2 leads to ETA violations (*p*(*A* = *a | W*) ranges between 9 × 10^{−7} and 0.9999978, approximately one-third of the probabilites are outside the range (0.05, 0.95)). In simulation 3 there are no ETA violations (0.11 < *p*(*A* = *a | W*) < 0.88). In each simulation the true value of the parameter of interest is 1.

1000 samples of size *n* = 1000 were drawn from each data generating distribution. Marginal treatment effect estimates were calculated based on the unadjusted regression of *Y* on *A*, Gcomp, IPTW, DR-IPTW, propensity score and C-TMLE methods.

A main-effects model for Gcomp and DR-IPTW, , was obtained using the DSA algorithm with the maximum model size set to seven. A model for the treatment mechanism *ĝ* used in IPTW, DR-IPTW, propensity score, and Matching estimation was also selected by DSA, again restricted to main terms. The Matching function considered this treatment mechanism model as merely one additional covariate, indistinguishable from the other potential confounders, *W*. The procedure was run using default settings, except population size for each generation was increased to 200. In contrast, the C-TMLE algorithm includes an aggressive search through a larger space of models to obtain an initial estimate of the density. As a proxy for the super-learner algorithm we used the DSA algorithm to select a model for containing at most six terms, allowing quadratic terms and two-way interactions.

We expect to see that the estimators that rely on consistent estimation of *Q*_{0} are unbiased in simulation 1, (Gcomp, DR-IPTW, C-TMLE), while estimators relying on consistent estimation of *g*_{0} are unbiased in simulation 3 (IPTW, DR-IPTW, propScore, Matching, C-TMLE).

Mean estimates of the treatment effect and standard errors are shown in Table 1 for each simulation. Mean estimates and (0.025, 0.975) quantiles of the probability distribution of each estimator are plotted in Figures 1 and and22.

Mean estimates and (0.025,0.975) quantiles for each estimation method, simulations 1 and 2. Dashed line is at true parameter value.

Mean estimates and (0.025,0.975) quantiles for each estimation method, simulation 3. Dashed line is at true parameter value.

Mean estimate and standard error (SE) for each estimator based on 1000 iterations with sample size *n* = 1000. *ψ*_{0} = 1.

Figures 1 and and22 illustrate each estimator’s behavior. As expected, estimators relying on consistent estimation of *Q*_{0} are unbiased in simulation 1, estimators relying on consistent estimation of *g*_{0} are unbiased in simulation 3.

- The unadjusted estimator yields biased results in all three simulations due to its failure to adjust for confounders.
- The G-computation estimator performs well in simulation 1 when the model is correctly specified. We understand that misspecification (simulations 2 and 3) will often, though not always, lead to bias in the estimates. However the plots highlight another phenomenon that is easy to overlook. the inability of the misspecified model to adequately account for the variance in the outcome often leads to large residual variance of the estimator, and in practice would have low power to reject a null hypothesis.
- Truncation bias due to ETA violations causes the IPTW estimator using truncated weights to fail in simulations 1 and 2. The estimate is not biased in simulation 3, but the variance is so large that even in this setting where we’d expect IPTW to be reliable it would fail to produce a significant result.
- DR-IPTW estimates are unbiased and have low variance when the functional form is correctly modeled by the regression equation (simulation 1). Though we see little bias in the other two simulations, the variance is large due to misspecification of the treatment mechanism. Because
*W*_{6}is a strong predictor of*A*and is indistinguishable from a true confounder of the relationship between*Y*and*A*it is always included in the treatment mechanism, behavior that does not help achieve an accurate estimate of the true treatment effect. - Propensity score estimators are known to perform poorly when there are ETA violations, e.g. simulations 1 and 2 (Sekhon (2008)). Researchers constructing the propensity score could observe this and choose an alternate propensity score model, but without using information about the outcome this choice would likely be made based on the predictive power of the model, not the potential bias reduction. Both propensity score-based methods do a reasonable job in simulation 3. Abadie and Imbens (2006) shows that matching estimators will not obtain the semi-parametric efficiency bound. This theory is borne out in simulation 3, where neither matching methodology confidence interval is as tight as that of the collaborative targeted maximum likelihood estimator.

The double robust property of the targeted maximum likelihood estimator obviates the need for accurate estimation of both *Q*_{0} and *g*_{0} since correct specification of either one leads to consistent estimates of the parameter of interest. However, accurate estimates of both are needed to achieve the Cramer-Rao efficiency bound. Implementations of the standard targeted maximum likelihood estimator (TMLE) therefore strive for ideal estimates of both *Q*_{0} and *g*_{0}.

In contrast, the collaborative nature of the second stage of the C-TMLE estimation algorithm leads to selection of an estimator, *g _{n}*, that targets that portion of the treatment mechanism needed to reduce bias not already adequately addressed by the first stage estimator for

Covariates *W*_{1}, *W*2, and *W*_{3} were generated as independent random uniform variables over the interval [0, 1]. *W*_{4} and *W*_{5} are independent normally distributed random variables.

$$\begin{array}{c}\hfill {W}_{1},\mathrm{{W}_{2},\mathrm{{W}_{3}}}\hfill ~& \hfill U(0,\mathrm{1)}& \hfill {W}_{4},\mathrm{{W}_{5}}\hfill ~& \hfill N(0,\mathrm{1)}\end{array}$$

Treatment mechanism *g*_{0} was designed so that *W*_{3} is highly predictive of treatment:

$$\mathit{\text{logit}}({g}_{0})\mathrm{=\mathrm{\mathit{\text{logit}}(P(A\mathrm{=\mathrm{1|W))\mathrm{=\mathrm{2{W}_{1}\mathrm{+\mathrm{{W}_{2}\mathrm{-\mathrm{5{W}_{3}\mathrm{+\mathrm{{W}_{5}}}}}}}}}}}}}$$

The observed outcome *Y* was generated as

$$Y\mathrm{=\mathrm{{Q}_{0}(A,\mathrm{W)\mathrm{+\mathrm{\mathrm{,\mathrm{\mathrm{\mathrm{~\mathrm{N(0,\mathrm{1)}}}}}}}}}}}$$

with corresponding regression equation:

$${Q}_{0}\mathrm{(A,\mathrm{W)\mathrm{=\mathrm{A\mathrm{+\mathrm{4{W}_{1}\mathrm{-\mathrm{5{W}_{2}\mathrm{+\mathrm{5{W}_{4}{W}_{5}}}}}}}}}}}$$

C-TMLE and TMLE estimates of the parameter of interest, again defined as *ψ* = *E _{W}*[

For each iteration an initial regression,
${Q}_{n}^{0}$, was obtained by fitting the DSA-selected model, *Y* = *A* + *W*_{1} + *W*_{2}, on *n* observations in the sample. We expect that any estimate of *ψ* based solely on this model is likely to be incorrect because the model fails to take into account the effect on the outcome of the missing interaction term, and also fails to adjust for the confounding effect of *W*_{5}. The targeting step for both targeted maximum likelihood estimators reduces this bias.

In order to construct the covariate used to target the parameter of interest in the updating step of the TMLE algorithm we obtain an estimate *g _{n}* of

Results of the simulation are shown in Table 2. A small number of aberrant TMLE estimates were major contributers to the variance of that estimator. The three highest TMLE estimates of the treatment effect were (771.91, 37.22, 9.52). It is likely that these high values arise from atypical samples containing observations that presented unusually strong ETA issues. In contrast, all C-TMLE estimates calculated from those same samples range between 0.307 and 1.698. Both estimators’ average treatment effect estimates are not far from the true value, *ψ*_{0} = 1. As expected, the variance of the TMLE estimator is many times larger than that of the C-TMLE estimator.

Comparison of C-TMLE and TMLE estimators at different levels of truncation. Mean estimate and variance based on 1000 iterations.

Not surprisingly, *W*_{3}, the strong predictor of treatment that is not a true confounder of the relationship between treatment and outcome, is included in every one of the 1000 models for *g _{n}* selected by the DSA algorithm, but it is included in only 35 of the models constructed in the second stage of the C-TMLE algorithm. At the same time, the interaction term

This clearly demonstrates the differences between TMLE’s reliance on an external estimate of *g*_{0} and the collaborative approach to estimating the treatment mechanism used by C-TMLE. However, we note that the degradation of TMLE performance under sparsity is due to the unboundedness of the fluctuation function, and can be mitigated by employing an alternative fluctuation function that respects known bounds on the data model. Though a full discussion is beyond the scope of this paper, details may be found in Gruber and van der Laan (2010).

The variance of the influence curve provides the basis for calculation of a 95% confidence interval for the C-TMLE estimate.

$$95\%CI\mathrm{=\mathrm{{\psi}^{\mathit{\text{C\u2013TMLE}}}\mathrm{\pm \mathrm{1.96\mathrm{\sqrt{\mathit{\text{var}}(\mathit{\text{IC}})/n)}}}}}}$$

Two sets of confidence intervals were constructed for each of the 1000 iterations in simulation 4, with
${Q}_{n}^{0}$ misspecified by a main-terms only regression model. As described above, one set of CIs is based on *D** (*Q, g*), the first term of the IC. The second set is based on the variance of *D** (*Q, g*) + *IC _{g}*, which includes the contribution from the estimation of

Empirical coverage of 1000 confidence intervals constructed at a nominal 95% level. SE calculated as
$\sqrt{\mathit{\text{var}}(\mathit{\text{IC}})/n}$, where the *IC* was estimated with and without *IC*_{g}.

Confidence intervals were also created for an additional 1000 samples from the same data generating distribution that were analyzed using a correct model for
${Q}_{n}^{0}$. Coverage rates for these confidence intervals are given in Table 3. When
${Q}_{n}^{0}$ is correctly specified we observe little difference in the coverage rate whether or not we take the contribution from *IC _{g}* into account, indicating zero contribution to the variance from the estimate of

We apply the C-TMLE estimator to an observational dataset previously analyzed with the goal of identifying HIV mutations that affect response to the antiretroviral drug lopinavir. (Bembom et al., 2009, 2008) The data includes observations on *O* = (*W*, *A*, *Y*), where the outcome, *Y*, is the change in log_{10} viral load measured at baseline and at follow-up after treatment has been initiated. If follow-up viral load was beneath the limit of detection *Y* was set to the maximal change seen in the population. *A* {0, 1} is an indicator of the presence or absence of a mutation of interest, taking on the appropriate value for each of the 26 candidate mutations in 26 separate analyses. W consists of 51 covariates including treatment history, baseline characteristics, and indicators of the presence of additional HIV mutations. Practical ETA violations stemming from high correlations among some of the covariates and/or low probability of observing a given mutation of interest make it difficult to obtain stable low variance estimates of the association between *A* and *Y*. Bembom used a targeted maximum likelihood estimation approach incorporating data-adaptive selection of an adjustment set that relies on setting a limit on the maximum allowable truncation bias introduced by truncating treatment probabilities less than *α* to some specified lower limit. Covariates whose inclusion in the adjustment set introduces an unacceptable amount of bias are not selected. That study’s findings showed good greement with Stanford HIVdb mutation scores, values on a scale of 0 to 20 (http://hivdb.stanford.edu, as of September, 2007, subsequently modified), where 20 indicates evidence exists that the mutation strongly inhibits response to drug treatment and 0 signifies that the mutation confers no resistance. Because the C-TMLE method includes covariates in the treatment mechanism only if they improve the targeting of the parameter of interest without having too adverse an effect on the MSE, we expect similar performance without having to specify truncation levels or an acceptable maximum amount of bias.

The dataset consists of 401 observations on 372 subjects. Correlations due to the few subjects contributing more than one observation were ignored. Separate analyses was carried out for each mutation. In each, an initial density estimate,
${Q}_{n}^{0}$, was obtained using DSA restricted to addition moves only to select a main-terms model containing at most 20 terms, where candidate terms in *W* include pre-computed interactions detailed in Bembom et al. *A* was forced into the model. An estimate of the effect on change in viral load was recorded for each mutation. Influence curve-based variance estimates incorporating the contribution from estimating *g* given by the *IC _{g}* term, was used to construct 95% confidence intervals.

Table 4 lists the Stanford mutation score associated with each of the HIV mutations under consideration, as well as the C-TMLE estimate of the adjusted effect of mutation on lopinavir resistance. 95% confidence intervals were constructed based on the variance of the IC. Confidence intervals entirely above zero indicate a mutation increases resistance to lopinavir. Eight of the twelve mutations having a mutation score of 10 or greater fall into this category. Point estimates for the remaining four mutations were positive, but the variance was too large to produce a significant result. Only one of the six mutations thought to confer slight resistance to lopinavir was flagged by the procedure, though with the exception of p10FIRVY point estimates were positive. Stanford mutation scores of 0 for four of the five mutations found to have a significantly negative effect on drug resistance support the conclusion that these mutations do not increase resistance, but are not designed to offer confirmation that a mutation can decrease drug resistance. However, Bembom et al. report that there is some clinical evidence that two of these mutations, 30N and 88S, do indeed decrease lopinavir resistance.

Stanford score (2007), C-TMLE estimate and 95% confidence interval for each mutation. Starred confidence intervals do not include 0.

Our findings are quite consistent with the Stanford mutation scores and with the results from the previous analysis using the data-adaptively selected adjustment set targeted maximum likelihood estimation approach. The C-TMLE method was able to achieve these results without relying on ad hoc or user-specified tuning parameters.

Simulation studies demonstrate the collaborative double robustness and efficiency of C-TMLE methodology, which allows for consistent efficient estimation in situations when other estimators can fail to perform adequately. In practice these failures may lead to biased estimates and to confidence intervals that fail to attain the correct coverage, as suggested by the IPTW results in simulations 1 and 2, where weights depend on a variable highly predictive of treatment that is not a true confounder of the relationship between Y and A. It is worth noting that the unadjusted estimator applied to data from a randomized controlled trial in which randomization fails to evenly distribute confounders across treatment arms will also yield (finite sample) biased results, as we saw in simulations 1,2, and 3.

As simulations 2 and 3 demonstrate, a misspecied parametric model not only results in biased estimates, but can also easily fail to adequately explain the variance in the outcome. Therefore estimates of the parameter of interest will have a larger variance than the semiparametric information bound achieved by an efficient estimator, such as C-TMLE. Such misspecified parametric models can easily result in the construction of a confidence interval that contains 0, and therefore a failure to reject a false null hypothesis, even when the point estimate is close to the true value of the parameter of interest. Since misspecied parametric models are the rule rather than the exception, in the analysis of data from an unknown data-generating distribution, using C-TMLE combined with super learning for the initial estimator, is a prudent course of action.

Estimators that rely on nuisance parameter estimation (IPTW, DR-IPTW, TMLE, propensity score-based estimation) break down when there are ETA violations, failing to reduce bias, or even increasing bias, while incurring high variance that renders estimates meaningless (no statistical signicance). An effort to reduce variance through truncation introduces bias into the estimate, and requires a careful trade-off. C-TMLE addresses these issues, in the sense that it is able to utilize the covariates for effective bias reduction, avoiding harmful bias reduction efforts. As a targeted-MLE, the bias-variance tradeoff is targeted towards the estimation of the parameter of interest, not the estimate of the entire density. The collaborative nature of the estimation of the treatment mechanism in the C-TMLE confers three advantages:

- The treatment mechanism model will exclude covariates that are highly predictive of treatment but do not truly confound the relationship between treatment and the outcome.
- The treatment mechanism model will include only covariates that help adjust for residual bias remaining after stage 1 adjustment.
- Cross-validation based on a penalized log-likelihood will not select a treatment mechanism model that includes a term that leads to violations of the ETA assumption and thereby large variance of the corresponding targeted MLE without the benefit of a meaningful bias reduction.

Influence-curve based inference is theoretically sound, and achieves the desired coverage rate across a wide range of simulations, in addition to the ones presented. An implementation of C-TMLE methodology written for the R statistical environment that includes a demonstration based on the data-generating scheme for simulation 2 is available as supplemental materials.

Consider the goal of estimating *EY*_{1}. It is desirable for the estimator *Q _{n}* to satisfy 0 = ∑

Details of this derivation follow. Let *O* = (*W*, *A*, *Y*), with *Y* binary. Consider the score *D*(*Q*)(*O*) = *E _{Q}*(

$${E}_{Q}({Y}_{1}|O)\mathrm{=\mathrm{I(A\mathrm{=\mathrm{1)Y\mathrm{+\mathrm{I(A\mathrm{=\mathrm{0){E}_{Q}(Y|A\mathrm{=\mathrm{1,\mathrm{W).}}}}}}}}}}}$$

We wish to find the component of *D*(*Q*) that is in the tangent space of the conditional distribution of *Y*, given *A, W*. We have

$$E(D(Q)|A,\mathrm{W)\mathrm{=\mathrm{E(Y|A\mathrm{=\mathrm{1,\mathrm{W).}}}}}}$$

We decompose

$$\begin{array}{lll}D(Q)\hfill & =\hfill & {D}_{1}(Q)\mathrm{+\mathrm{{D}_{2}(Q)}}\hfill & =\hfill & \{D(Q)\mathrm{-\mathrm{E(Y|A\mathrm{=\mathrm{1,\mathrm{W)\}\mathrm{+\mathrm{\{E(Y|A\mathrm{=\mathrm{1,\mathrm{W)\mathrm{-{E}_{Q}{Y}_{1}\}.}}}}}}}}}}}\hfill \hfill \end{array}$$

The first component can be written as

$$\{E({D}_{1}(Q)|Y\mathrm{=\mathrm{1,\mathrm{A,\mathrm{W)\mathrm{-\mathrm{E({D}_{1}(Q)|Y\mathrm{=\mathrm{0,\mathrm{A,\mathrm{W)\}\mathrm{(Y\mathrm{-\mathrm{{E}_{Q}(Y|A,\mathrm{W)),}}}}}}}}}}}}}}$$

which reduces to *A*(*Y − E _{Q}*(

Consider now the score *D*(*Q*)(*O*) = *E _{Q}*(

$${E}_{Q}({Y}_{0}|O)\mathrm{=\mathrm{I(A\mathrm{=\mathrm{1){E}_{Q}(Y|A\mathrm{=\mathrm{0,\mathrm{W)\mathrm{+\mathrm{I(A\mathrm{=\mathrm{0)Y.}}}}}}}}}}}$$

We wish to find the component of *D*(*Q*) that is in the tangent space of the conditional distribution of *Y*, given *A*, *W*. We have

$$E(D(Q)|A,\mathrm{W)\mathrm{=\mathrm{E(Y|A\mathrm{=\mathrm{0,\mathrm{W).}}}}}}$$

We decompose

$$D(Q)\mathrm{=\mathrm{\{D(Q)\mathrm{-\mathrm{E(Y|A\mathrm{=\mathrm{0,\mathrm{W)\}\mathrm{+\mathrm{\{E(Y|A\mathrm{=\mathrm{0,\mathrm{W)\mathrm{-\mathrm{{E}_{Q}{Y}_{0}\}.}}}}}}}}}}}}}}$$

We can write this first component as

$$\{E({D}_{1}(Q)|Y\mathrm{=\mathrm{1,\mathrm{A,\mathrm{W)\mathrm{-\mathrm{E({D}_{1}(Q)|Y\mathrm{=\mathrm{0,\mathrm{A,\mathrm{W)\}\mathrm{(Y\mathrm{-\mathrm{{E}_{Q}(Y|A,\mathrm{W)),}}}}}}}}}}}}}}$$

which reduces to (1 − *A*)(*Y − E _{Q}*(

Suppose now that one wishes to solve the score equation of score *D*(*Q*) = *E _{Q}*(

$$\begin{array}{l}{E}_{Q}({Y}_{1}\mathrm{-\mathrm{{Y}_{0}|O)}}=\hfill & I(A\mathrm{=\mathrm{1)\{Y\mathrm{-\mathrm{{E}_{Q}(Y|A\mathrm{=\mathrm{0,\mathrm{W)\}}}}\hfill & \hfill & +I\mathrm{(A\mathrm{=\mathrm{0)\mathrm{\{{E}_{Q}(Y|A\mathrm{=\mathrm{1,\mathrm{W)\mathrm{-\mathrm{Y\}.}}}}}}}}}\hfill}}}}\hfill \hfill \end{array}$$

We wish to find the component of *D*(*Q*) that is in the tangent space of the conditional distribution of *Y*, given *A*, *W*. We have *E*(*D*(*Q*) | *A*, *W*) = *E _{Q}*(

$$D(Q)\mathrm{=\mathrm{{D}_{1}(Q)+{D}_{2}(Q)\mathrm{\mathrm{\{D(Q)-E(D(Q)|A,\mathrm{W)\}+\{E(D(Q)|A,\mathrm{W)\mathrm{-{E}_{Q}{Y}_{1}\}.}}}}}}}$$

We can write this first component as

$$\{E({D}_{1}(Q)|Y\mathrm{=\mathrm{1,\mathrm{A,\mathrm{W)\mathrm{-\mathrm{E({D}_{1}(Q)|Y\mathrm{=\mathrm{0,\mathrm{A,\mathrm{W)\}\mathrm{(Y\mathrm{-\mathrm{{E}_{Q}(Y|A,\mathrm{W)),}}}}}}}}}}}}}}$$

which reduces to (2*A −* 1)(*Y − E _{Q}*(

Therefore, if we want
${Q}_{n}^{*}$ to be an imputation estimator for both *EY*_{0}, *EY*_{1}, then we wish to have an estimator
${Q}_{n}^{*}$ that solves the bivariate score equation

$$0\mathrm{=\mathrm{\sum _{i}(1,\mathrm{{A}_{i})({Y}_{i}\mathrm{-\mathrm{{E}_{{Q}_{n}^{*}}(Y|{W}_{i},\mathrm{{A}_{i})).}}}}}}$$

This can be arranged by applying the targeted MLE update with the fluctuation function corresponding with covariate-extension (1, *A, h _{g}*(

Susan Gruber, University of California, Berkeley.

Mark J. van der Laan, University of California, Berkeley.

- Abadie A, Imbens GW. Large sample properties of matching estimators for average treatment effects. Econometrica. 2006;74:235–67. doi: 10.1111/j.1468-0262.2006.00655.x. [Cross Ref]
- Bembom O, Fessel JW, Shafer RW, van der Laan MJ. Data-adaptive selection of the adjustment set in variable importance estimation. Technical report, DigitalCommons@Florida Atlantic University [ http://digitalcommons.fau.edu/cgi/oai2.cgi] (United States), 2008. URL http://www.bepress.com/ucbbiostat/paper231.
- Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant HIV infection. Statistics in Medicine. 2009;28:152–72. doi: 10.1002/sim.3414. [PubMed] [Cross Ref]
- Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Technical report 265, Division of Biostatistics, University of California, Berkeley, May 2010. [PMC free article] [PubMed]
- Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [PubMed] [Cross Ref]
- Holland JH, Reitman JS. Cognitive systems based on adaptive algorithms. SIGART Bull. 1977;63:49–49. ISSN 0163-5719. http://doi.acm.org/10.1145/1045343.1045373.
- Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica. 2001;11(4):920–936.
- Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000;450:431–435. doi: 10.2307/2669381. [Cross Ref]
- Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association. 2000a.
- Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [Cross Ref]
- Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer, New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133.
- Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [Cross Ref]
- Rubin DB, van der Laan MJ. Doubly robust ecological inference. Technical report 236, Division of Biostatistics, University of California, Berkeley, May 2008.
- Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Software. 2008.
*Forthcoming*. - Sinisi S, van der Laan MJ. The Deletion/Substitution/Addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1)
- van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. The International Journal of Biostatistics. 2010. [PMC free article] [PubMed] [Cross Ref]
- van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003.
- van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [Cross Ref]
- van der Laan MJ, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25) doi: 10.2202/1544-6115.1309. ISSN 1. [PubMed] [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |