Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC2898626

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Collaborative double robust targeted maximum likelihood estimators
- 3. Collaborative double robustness of estimating functions in CAR censored data models
- 4. Asymptotic linearity of collaborative double robust TMLE
- 5. Targeted loss functions implied by efficient influence curve
- 6. Example: Targeted maximum likelihood estimation of the marginal structural model
- 7. Discussion
- References

Authors

Related links

Int J Biostat. 2010 January 1; 6(1): Article 17.

Published online 2010 May 17. doi: 10.2202/1557-4679.1181

PMCID: PMC2898626

Copyright © 2010 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (*Q*) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, *g*. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.

In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for *Q* coupled with a succession of increasingly non-parametric estimates for *g*. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor *Q* that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of *Q*_{0} in this sequence. A penalized-likelihood loss function for *Q* is suggested when the parameter of interest is borderline-identifiable.

We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when *Q* and *g* are both mis-specified, providing that *g* solves a specified score equation implied by the difference between the *Q* and the true *Q*_{0}. This marks an improvement over the current definition of double robustness in the estimating equation literature.

We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.

This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and *p*-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.

Researchers acknowledge that questions about our infinite-dimensional, semi-parametric world are not well-addressed by parametric models. More sophisticated tools are needed to wrest meaning from data. We can and should develop and utilize methods specifically designed to estimate a relatively small-dimensional precisely specified parameter within such a semiparametric model that is identifiable from the data. The ideal method would be entirely a priori specified, have desirable statistical properties, avoid reliance on ad hoc or arbitrary specifications, and be computationally feasible.

Suppose one observes a sample of independent and identically distributed observations from a particular data generating distribution *P*_{0} in a semiparametric model, and that one is concerned with estimation of a particular pathwise differentiable parameter of the data generating distribution. A parameter should be viewed as a mapping from the semiparametric model to the parameter space (e.g., real line). A parameter mapping is pathwise differentiable at *P*_{0} if it is differentiable along all smooth parametric sub-models through *P*_{0}, and its derivative is uniformly bounded as a linear mapping on the Hilbert space of all scores of these parametric submodels. Intuitively, a pathwise differentiable parameter is a parameter which has a finite generalized Cramer-Rao information lower bound, so that in principle, under enough regularity conditions, it is possible to construct an estimator which behaves like a sample mean of i.i.d. random variables. Due to the curse of dimensionality implied by the infinite dimension of semi-parametric models, standard (nonparametric) maximum likelihood estimation is often ill defined or breaks down due to overfitting, while, on the other hand, regularized sieve-based maximum likelihood estimation results in overly biased plug-in estimators of the target parameter of interest.

The latter is due to the fact that such likelihood based estimators seek and achieve a bias-variance trade-off that is optimal for the density of the distribution of the data itself. Since the variance of an optimally smoothed density estimator is typically much larger than the variance of a smooth (pathwise-differentiable) parameter of the density estimator, the substitution estimators are often too biased relative to their variance. That is, substitution estimators based on density estimators involving optimal (e.g., likelihood-based) bias-variance trade-off (for the whole density) are not *targeted* towards the parameter of interest.

Motivated by this problem with the bias-variance trade-off of maximum likelihood estimation in semiparametric models, while still wanting to preserve the log-likelihood as the principle criterion in estimation, in van der Laan and Rubin (2006) we introduced and developed a targeted maximum likelihood estimator of the parameter of interest.

The targeted maximum likelihood estimator of the distribution of the data is obtained by fluctuating an initial estimator of the relevant part of the data generating distribution with a parametric fluctuation model whose score at the initial estimator (i.e., at zero fluctuation) equals or includes the efficient influence curve of the parameter of interest, and estimating the fluctuation parameter (i.e., amount of fluctuation) with standard parametric maximum likelihood, treating the initial estimator as offset. Iteration of this targeted maximum likelihood modification step results in a so called *k*-th step targeted maximum likelihood estimator, and its limit in *k* solves the actual efficient influence curve equation defined by setting the empirical mean of the efficient influence curve equal to zero. The latter estimator we called the targeted maximum likelihood estimator, which also results in a corresponding plug-in targeted maximum likelihood estimator of the parameter of interest by applying the parameter mapping to the targeted maximum likelihood estimator.

This targeted maximum likelihood step using the fluctuation model removes bias of the initial estimator with respect to (w.r.t.) the target parameter, while increasing the variance of the estimator till the level of the semi-parametric information bound, thereby resulting in a consistent, asymptotically linear, and semi-parametric (locally) efficient estimator.

Although in a variety of applications the fluctuation model is known, e.g., randomized controlled trials with known treatment assignment and missingness mechanism, the fluctuation model typically depends on an unknown nuisance parameter, which then needs to be estimated as well. In censored data models satisfying the so called coarsening at random (CAR) assumption this nuisance parameter typically represents the censoring mechanism, and the density of the data factors in the relevant part of the density and the censoring mechanism density (e.g., Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)).

In this case, the bias reduction obtained at the targeted maximum likelihood step depends on how and how well we estimate the nuisance parameter. Specifically, the targeted maximum likelihood estimator is a so called double robust locally efficient estimator in censored data models (including causal inference models with the full data representing a collection of treatment regimen specific counterfactuals) in which the censoring mechanism satisfies the coarsening at random assumption. This means that, under regularity conditions, it is consistent and asymptotically linear if either the initial estimator is consistent or the nuisance parameter is consistent, and it is efficient in the semiparametric model if the initial estimator is consistent. Another approach for double robust locally efficient estimation is the estimating equation methodology (see, van der Laan and Robins (2003), and review below).

An outstanding open problem that obstructs the robust practical application of double robust estimators (in particular, in nonparametric censored data or causal inference models) is the selection of a sensible model or estimator of the nuisance parameter: this is particularly true when the efficient influence curve estimating equation involves inverse probability of censoring or treatment weighting, due to the enormous sensitivity of the estimator of the parameter of interest to the estimator of the nuisance parameter. A relevant recent discussion of these issues is found in Kang and Schafer (2007a), Ridgeway and McCaffrey (2007), Robins et al. (2007), Tan (2007), Tsiatis and Davidian (2007), Kang and Schafer (2007b).

Given an initial estimator, we are concerned with constructing an estimator of the nuisance parameter, that results in a better bias-variance trade-off (i.e. better MSE) for the resulting targeted maximum likelihood estimator of the target parameter than current practice. In this article we introduce a new strategy for nuisance parameter estimator selection for targeted maximum likelihood estimators that addresses this challenge by using the log-likelihood of the targeted maximum likelihood estimator (of the relevant density) indexed by the nuisance parameter estimator as the principal selection criterion. The nuisance parameter estimators needed for the targeting step are selected based on the relevant log-likelihood loss function of the resulting targeted maximum likelihood estimator, not on a loss function for the nuisance parameter itself. This approach takes into account the established fit of the initial estimator, and that the resulting estimator of the target parameter is indeed based on the relevant part of the likelihood.

Recognizing that the selected estimator of the nuisance parameter is very much a function of the goodness of fit of the initial estimator led to the development of a new theory of collaborative double robust estimation. The asymptotic linearity theory presented below involves characterizing a true minimal nuisance parameter indexed by the initial estimator limit, *g*_{0}(*Q*), that results in an efficient influence curve that is unbiased for the target parameter. This defines the collaborative double robustness of the efficient influence curve. Given a nested sequence of increasingly non-parametric estimators, *g _{δ}*, there is a

The theory is fascinating, and results in potentially super efficient estimators, the main intuition, in the context of CAR-censored data models, is that the covariates that enter the treatment mechanism and censoring mechanism estimator (i.e., nuisance parameter estimator) used to define the fluctuation model in the targeted maximum likelihood step should explain the difference between the initial estimator, *Q _{n}*, and the true relevant density,

Such collaborative double robust estimators involve a variety of choices, including the choice of initial estimator and the choice of collaborative nuisance parameter estimator, but all solve the efficient influence curve equation and all rely on the collaborative nuisance parameter estimator being correctly specified so that the wished unbiasedness of the efficient influence curve is achieved in the limit. We propose using cross-validation w.r.t. a targeted loss function to select among these different collaborative targeted maximum likelihood estimators of the relevant density. In addition, we suggest the square of their influence curve or square of the efficient influence curve as a particularly suitable loss function, corresponding with selection of the estimator with minimal asymptotic variance.

The construction of efficient estimators of pathwise differentiable parameters in semi-parametric models requires utilizing the so called efficient influence curve, defined as the canonical gradient of the pathwise derivative of the parameter. A fundamental result of the efficiency theory is that a regular estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve. We refer to Bickel et al. (1997), and Andersen et al. (1993). There are two distinct approaches for construction of efficient (or locally efficient) estimators: the estimating equation approach that uses the efficient influence curve as an estimating equation (e.g., one-step estimators based on the Newton-Raphson algorithm in Bickel et al. (1997)), and the targeted MLE that uses the efficient influence curve to define a targeted fluctuation function, and maximizes the likelihood in that targeted direction.

The construction of locally efficient estimators in censored data models in which the censoring mechanism satisfies the so called coarsening at random assumption (Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)) has been a particular focus area. This also includes the theory for locally efficient estimation of causal effects under the sequential randomization assumption (SRA), since the causal inference data structure can be viewed as a missing data structure on the intervention-specific counterfactuals, and SRA implies the coarsening at random assumption on the missingness mechanism, while not implying any restriction on the data generating distribution.

Gill and Robins (2001) present an implicit construction of counterfactuals as a mapping from the observed data distribution, such that the observed data structure augmented with the counterfactuals satisfies the consistency assumption and the SRA. Yu and van der Laan (2002) provide a particular explicit construction of counterfactuals from the observed data structure in terms of quantile-quantile functions, satisfying the consistency assumption and SRA. These results show that, without loss of generality, one can view causal inference as a missing data structure estimation problem. Causal graphs make explicit the real assumptions needed to claim that these counterfactuals are actually the counterfactuals of interest.

Inverse probability of censoring weighted (IPCW) estimators were originally developed to correct for confounding-induced bias in causal effect estimation. Theory for IPCW estimation and augmented locally efficient IPCW-estimator based on estimating functions defined in terms of the orthogonal complement of the nuisance tangent space in CAR-censored data models (including the optimal estimating function implied by efficient influence curve) was originally developed in Robins (1993), Robins and Rotnitzky (1992). Many papers build on this framework (see van der Laan and Robins (2003) for a unified treatment of this estimating equation methodology, and references). In particular, double robust locally efficient augmented IPCW-estimators have been developed (Robins and Rotnitzky (2001b), Robins and Rotnitzky (2001a), Robins et al. (2000b), Robins (2000a), van der Laan and Robins (2003), Neugebauer and van der Laan (2005), Yu and van der Laan (2003)).

Causal inference for multiple time-point interventions under sequential randomization was first addressed by Robins in the eighties: e.g. Robins (1986), Robins (1989).

The popular propensity score methods to assess causal effects of single time point interventions (e.g., Rosenbaum and Rubin (1983), Sekhon (2008), Rubin (2006)) have no natural generalization to multiple time-point interventions and may be inefficient (and less robust) estimators for single time point interventions, relative to the locally efficient double robust estimators such as the augmented IPCW and the targeted MLE. One crucial ingredient of these proposed methods is propensity score estimation in the absence of any knowledge of the outcomes.

Structural nested models and marginal structural models for single and multiple time point static treatment regimens were proposed by Robins as well: Robins (1997b), Robins (1997a), Robins (2000b). Many application papers on marginal structural models exist, involving the application of estimating equation methodology (IPCW and DR-IPCW): e.g., Hernan et al. (2000), Robins et al. (2000a), Bryan et al. (2003), Yu and van der Laan (2003). In van der Laan et al. (2005) history adjusted marginal structural models were proposed as a natural extension of marginal structural models, and it was shown that the latter also imply an individualized treatment rule of interest (a so called history adjusted statically optimal treatment regimen): see Petersen et al. (2005) for an application to the “when to switch” question in HIV research.

Murphy et al. (2001) present a nonparametric estimator for a mean under a dynamic treatment in an observational study. Structural nested models for modeling and estimating an optimal dynamic treatment were proposed by Murphy (2003), Robins (2003), Robins (2005a), Robins (2005b). Marginal structural models for user supplied set of dynamic treatment regimens were developed and proposed in van der Laan (2006), van der Laan and Petersen (2007) and, simultaneously and independently, in a technical report authored by Rotnizky and co-workers (2006), and Robins et al. (2008). van der Laan and Petersen (2007) also includes a data analysis application of these models to assess the mean outcome under a rule that switches treatment when CD4-count drops below a cut-off, and the optimal cut-off is estimated as well. Another practical illustration in sequentially randomized trials of these marginal structural models for realistic individualized treatment rules is presented in Bembom and van der Laan (2007).

Unified loss based learning based on cross-validation was developed in- van der Laan and Dudoit (2003), including construction of adaptive minimax estimators for infinite dimensional parameters of the full data distribution in CAR-censored data and causal inference models: see also van der Laan et al. (2006), van der Vaart et al. (2006), van der Laan et al. (2004), Dudoit and van der Laan (2005), Keleş et al. (2002), Sinisi and van der Laan (2004).

The oracle results for the cross-validation selector inspired a unified super learning methodology mapping a library of candidate estimators into a weighted combination with optimal cross-validated risk, thereby resulting in an estimator which either achieves the best possible parametric model rate of convergence up till a log-*n*-factor, or it is asymptotically equivalent with the oracle selected estimator that selects the best set of weights for the given data set. These results rely on the assumption that the loss function is uniformly bounded and that the number of candidates in the library is polynomial in sample size (van der Laan et al. (2007), Polley and van der Laan (2009)).

The super learning methodology applied to a loss function for the *G*-computation formula factor, *Q*_{0}, in causal inference, or the full-data distribution factor, *Q*_{0}, of the observed data distribution in CAR-censored data models, provides substitution estimators of the target parameter *ψ*_{0}. However, although these super learners of *Q*_{0} are optimal w.r.t. the dissimilarity with *Q*_{0} implied by the loss function, the corresponding substitution estimators will be overly biased for a smooth parameter mapping Ψ. This is due to the fact that cross-validation makes optimal choices w.r.t. the (global) loss-function specific dissimilarity, but the variance of Ψ() is of smaller order than the variance of itself.

van der Laan and Rubin (2006) integrates the loss-based learning of *Q*_{0} into the locally efficient estimation of pathwise differentiable parameters, by enforcing the restriction in the loss-based learning that each candidate estimator of *Q*_{0} needs to be a targeted maximum likelihood estimator (thereby, in particular, enforcing each candidate estimator of *Q*_{0} to solve the efficient influence curve estimating equation). Another way to think about this is that each loss function *L*(*Q*) for *Q*_{0} has a corresponding targeted loss function *L*(*Q**), with *Q** the targeted MLE algorithm applied to initial *Q*, and we apply the loss-based learning to the latter targeted version of the loss function *L*(*Q*). Rubin and van der Laan (2008) propose the square of efficient influence curve as a valid and sensible loss function *L*(*Q*) for selection and estimation of *Q*_{0} in models in which *g*_{0} can be estimated consistently, such as in randomized controlled trials.

The implications of this targeted loss based learning are that *Q*_{0} is estimated optimally (maximally adaptive to the true *Q*_{0}) w.r.t. the targeted loss function *L*(*Q**) using the super learning methodology, and due to the targeted MLE step the resulting substitution estimator of *ψ*_{0} is now asymptotically linear as well if the targeted fluctuation function is estimated at a good enough rate (and only requiring adjustment by confounders not yet accounted for by initial estimator: see collaborative targeted MLE): either way, bias reduction will occur as long as the censoring/treatment mechanism is estimated consistently. Targeted MLE have been applied in a variety of estimation problems: Bembom et al. (2008), Bembom et al. (2009) (physical activity), Tuglus and van der Laan (2008) (biomarker analysis), Rosenblum et al. (2009) (AIDS), van der Laan (2008a) (case control studies), Rose and van der Laan (2008) (case control studies), Rose and van der Laan (2009) (matched case control studies), Moore and van der Laan (2009) (causal effect on time till event, allowing for right-censoring), van der Laan (2008b) (adaptive designs, and multiple time point interventions), Moore and van der Laan (2007) (randomized trials with binary outcome). We refer to van der Laan et al. (September, 2009) for collective readings on targeted maximum likelihood estimation.

Even though the augmented IPCW-estimator is also double robust, targeted maximum likelihood estimation has the following important advantages relative to estimating equation methods such as the augmented-IPCW estimator: 1) the TMLE is a substitution estimator and thereby respects global constraints of model such as that one might be estimating a probability in [0, 1] or a (monotone) survival function at a finite set of points, 2) since, given an initial estimator, the targeted MLE step involves maximizing the likelihood along a smooth parametric targeted fluctuation model, it does *not* suffer from *multiple solutions* of a (possibly *non-smooth* in the parameter) estimating equation, 3) the TMLE does *not* require that the efficient influence curve can be represented as an estimating function in the target parameter, and thereby applies to *all* pathwise differentiable parameters 4) it can use the cross-validated log-likelihood (of the targeted maximum likelihood estimator), or any other cross-validated risk of an appropriate loss function for the relevant factor *Q*_{0} of the density of the data, as principle criterion to select among different targeted maximum likelihood estimators indexed by different initial estimators or targeted maximum likelihood steps.

The latter allows fine tuning of initial estimator of *Q*_{0} as well as the fine tuning of the estimation of the unknowns (e.g., censoring/treatment mechanism *g*_{0}) of the fluctuation function applied in the targeted maximum likelihood step, thereby utilizing the excellent theoretical and practical properties of the loss-function specific cross-validation selector. In particular, this property results in a collaborative double robust, and possibly super efficient, TMLE, as introduced and studied in this article, thereby adding theoretical and practical properties that go beyond the double robustness and efficiency. In contrast, the augmented-IPCW estimator cannot be evaluated based on a loss function for *Q*_{0} alone: the augmented-IPCW estimator is not a substitution estimator
$\Psi ({Q}_{n}^{*})$ for some
${Q}_{n}^{*}$ of *Q*_{0}, as is the TMLE. Instead the augmented-IPCW estimator *ψ _{n}* is a certain function of an initial

In Section 2 we present a description of the two stage collaborative targeted maximum likelihood methodology, the first stage representing the initial estimator, and the second stage representing the construction of a sequence of targeted maximum likelihood estimators indexed by increasingly nonparametric nuisance parameter estimators, and log-likelihood based cross-validation to select among the TMLEs and thereby select the nuisance parameter estimator. The second stage of the C-DR-TMLE can be viewed as a mapping from an initial estimator of the relevant density into a particular estimator of the nuisance parameter needed in the fluctuation function, and corresponding targeted maximum likelihood estimator using this nuisance parameter estimator in the targeted maximum likelihood step. We also provide the rational for the consistency of this C-DR-TML estimator under the collaborative double robustness assumption, relying on the earlier established oracle property of the log-likelihood-based cross-validation selector, which itself relies on the assumption that the log-likelihood loss function is uniformly bounded.

In Section 3 we define and study collaborative double robustness of the efficient influence curve. In particular, we define true nuisance parameters depending on a choice of relevant density (i.e., limit of initial estimator), which make the efficient influence curve an unbiased function for the target parameter. A collaborative targeted maximum likelihood estimator solves the efficient influence curve equation and relies on the nuisance parameter estimator to consistently estimate this true initial estimator-specific nuisance parameter or more nonparametric nuisance parameter. We also discuss alternative collaborative nuisance parameter estimators that can be used in the targeted MLE or in estimating equation methodology.

In Section 4 we prove an asymptotic linearity theorem for such collaborative double robust estimators, such as the collaborative double robust targeted maximum likelihood estimator, and discuss the conditions and implications of this theorem. In particular, this theorem provides us with influence curve based confidence intervals and tests of null hypotheses. A study of the influence curve teaches us that the C-DR-TMLE can be super efficient.

In Section 5 we consider targeted loss functions that can be used to select among different C-DR-TMLEs indexed by different initial estimators and choices of nuisance parameter estimator. These targeted loss functions can also be used to build the candidate nuisance parameter estimators within a C-DR-TMLE estimator, and thereby to construct the sequence of corresponding candidate targeted maximum likelihood estimators in the collaborative targeted maximum likelihood algorithm. Even though we enforce the use of a log-likelihood-based cross-validation selector to select among these candidate targeted maximum likelihood estimators in the C-DR-TMLE algorithm, we propose a penalized log-likelihood loss function that is more targeted towards the target parameter in the case the target parameter is borderline identifiable. This penalty is particularly important to robustify the estimation procedure in situations in which the variance of the efficient influence curve easily blows up to infinity for certain realization of the nuisance parameter estimator (e.g., close to zero inverse weights).

In section 6 we consider estimation of a causal effect in a marginal structural model, and define the collaborative double robust targeted penalized maximum likelihood estimator of the unknown parameters of the marginal structural model. In the companion paper we present a simulation study and data analysis for the C-DR-TMLE of the causal effect *EY* (1) − *Y* (0) of a binary treatment *A*, adjusting for baseline confounders *W*, based on observing *n* i.i.d. copies of a time-ordered data structure (*W*, *A*, *Y* = *Y* (*A*)).

We end this article with a discussion in Section 7, which provides a global overview.

Although the methodology is completely general, throughout the paper we ground the discussion by referring to the following example, estimation of the additive causal effect of a binary treatment on an outcome. This example is rich enough to illustrate the ideas and methods, and has been used intensively in the causal inference literature. In this subsection we provide the notation, and objects required to define the C-DR-TMLE.

Let *O* = (*W*, *A*, *Y* = *Y* (*A*)) ~ *P*_{0} be an observed missing data structure on full data structure *X* = (*W*, *Y* (0), *Y* (1)) with missingness binary variable *A* {0, 1}. For concreteness, we consider the case that *Y* is binary. Suppose the model for *P*_{0} is nonparametric, that the missingness mechanism *g*_{0}(1 | *X*) = *P*_{0}(*A* = 1 | *X*) = *P*_{0}(*A* = 1 | *W*) satisfies the coarsening at random assumption, and that our target parameter is the causal additive risk

$$\begin{array}{l}\Psi ({P}_{0})={\Psi}^{F}({Q}_{0})={E}_{0}Y(1)-Y(0)\\ ={E}_{0}\{{E}_{0}(Y|A=1,W)-{E}_{0}(Y|A=0,W)\},\end{array}$$

where *Q*_{0} = (*Q*_{01}, *Q*_{02}) denotes the marginal distribution of *W* and conditional distribution of *Y*, given *A*, *W*, respectively. For notational convenience, we will suppress the *F* from “Full Data Parameter” in Ψ* ^{F}*. We note that

The efficient influence curve of Ψ at *dP*_{0} = *Q*_{0}*g*_{0} is given by

$$D*({Q}_{0},{g}_{0})={h}_{{g}_{0}}(A,W)(Y-{Q}_{0}(A,W))+{Q}_{0}(1,W)-{Q}_{0}(0,W)-\Psi ({Q}_{0}),$$

where *Q*_{0}(*A*, *W*) = *E*_{Q0} (*Y* | *A*, *W*), *h*_{g0}(*A*, *W*) = *A*/*g*_{0}(1 | *W*) – (1 – *A*)/*g*_{0}(0 | *X*). We note that *h*_{g0} also plays the role of the clever covariate in the targeted maximum likelihood fluctuation of the conditional distribution of *Y*, given *A*, *W*: log *Q*(*ε*)/(1 – *Q*(*ε*)) = log *Q*/(1 – *Q*) + *εh*_{g0}.

We also note that an alternative representation of the efficient influence curve is given by the augmented IPCW-representation:

$$\begin{array}{l}D*({Q}_{0},{g}_{0})=\left(\frac{A}{{g}_{0}(1|X)}-\frac{1-A}{{g}_{0}(0|X)}\right)Y-\Psi ({Q}_{0})\\ -\left(\frac{A}{{g}_{0}(1|X)}-1\right){Q}_{0}(1,W)+\left(\frac{1-A}{{g}_{0}(0|X)}-1\right){Q}_{0}(0,W)\\ ={D}_{IPCW}({g}_{0},{\psi}_{0})-{D}_{CAR}({Q}_{0},{g}_{0}),\end{array}$$

where *D _{IPCW}*(

We will describe the proposed collaborative double robust targeted maximum likelihood estimators in the context of censored data models, but the generalization to general semi-parametric models is immediate. We first review targeted maximum likelihood estimation and loss-based cross-validation in order to provide a foundation for the explanation of C-DR-TMLE.

Let *O* = Φ(*C*, *X*) be a censored data structure on a full data random variable *X*, where *C* denotes the censoring variable. We assume coarsening at random so that the observed data structure *O* ~ *P*_{0} has a probability distribution whose density w.r.t an appropriate dominating measure factors as *dP*_{0}(*O*) = *Q*_{0}(*O*)*g*_{0}(*O* | *X*), where *Q*_{0} is the part of the distribution of *X* that is identifiable, and *g*_{0} denotes the conditional probability distribution of *O*, given *X*, which we often refer to as the censoring mechanism. By CAR, we have *g*_{0}(*O* | *X*) = *h*(*O*) for some measurable function *h*. If *C* is observed itself, then *g*_{0} denotes the conditional distribution of *C*, given *X*.

A semiparametric model
$\mathcal{M}$ for the probability distribution *P*_{0} of the observed data structure *O* is implied by a model
$\mathcal{Q}$ for the full-data distribution factor *Q*_{0}, and a model
$\mathcal{G}$ for the censoring mechanism *g*_{0}. The conditional distribution of *O*, given *X*, is identified by the conditional distribution of *C*, given *X*. For notational convenience, we will denote both with *g*_{0}. Let *O*_{1}, . . . , *O _{n}* be

Let Ψ :
$\mathcal{M}$ → IR* ^{d}* be a

$$\frac{d}{d\delta}\Psi (P(\delta )){|}_{\delta =0}={E}_{P}D*(P)S.$$

Because *D*^{*}(*P*) is an element of the Hilbert space in
${L}_{0}^{2}(P)$ generated by all scores *S* of these parametric submodels (the so called tangent space), it is the canonical gradient *D*^{*}(*P*), also called the efficient influence curve at *P*. Any *D*(*P*) such that *E _{P}D*

**The Targeted Maximum Likelihood estimator indexed by initial ( Q, g):** Given any

At a given (*Q*, *g*), one can now define a *k*-th step targeted maximum likelihood version
${Q}_{g}^{k}({P}_{n})$ of *Q*_{0} as follows. Let *L*(*Q*) = − log *Q* be the log-likelihood loss. Firstly, let
${Q}_{g}^{1}({P}_{n})={Q}_{g}({\epsilon}_{n}^{1})$, where

$${\epsilon}_{n}^{1}=\text{arg}\underset{\epsilon}{\text{min}}{P}_{n}L({Q}_{g}(\epsilon )).$$

Here we use the notation *Pf* = ∫ *f*(*o*)*dP*(*o*). In general,
${Q}_{gn}^{k}={Q}_{g}^{k}({P}_{n})={Q}_{g}^{k-1}({P}_{n})({\epsilon}_{n}^{k})$, where

$${\epsilon}_{n}^{k}=\text{arg}\underset{\epsilon}{\text{min}}{P}_{n}L({Q}_{g}^{k-1}({P}_{n})(\epsilon )),k=1,.\dots $$

One iterates this updating till
${\epsilon}_{n}^{k}$ equals zero within a user supplied precision. The final update is refered to as the (iterative) targeted maximum likelihood estimator
${Q}_{gn}^{*}={Q}_{g}^{*}({P}_{n})$, indexed by the initial starting point (*Q*, *g*).

**The Targeted Maximum Likelihood estimator indexed by initial estimator and estimator of nuisance parameter:** The above procedure, applied to an initial estimator
${Q}_{n}^{0}$, and an estimator *g _{n}* of

$$0={P}_{n}D*({Q}_{n}^{*},{g}_{n}).$$

**Remark: Cross-validated initial estimator in the targeted MLE.** If the initial estimator is an over-fit, then the bias reduction of the targeted MLE algorithm is not as effective. To protect against such cases one can use a cross-validated initial estimator. Specifically, let *B _{n}* {0, 1}

$${\epsilon}_{n}^{k}=\text{arg}\underset{\epsilon}{\text{min}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}L({Q}_{g}^{k-1}({P}_{n{B}_{n}}^{0})(\epsilon )),k=1,.\dots $$

Consider a loss function *L**(*Q*) for *Q*_{0} that satisfies

$${Q}_{0}=\text{arg}\underset{Q}{\text{min}}{P}_{0}L*(Q).$$

Or, more precisely, we only require that Ψ (arg min_{Q}*P*_{0}*L**(*Q*)) = Ψ(*Q*_{0}). An example of such a loss function is the the log-likelihood *L**(*Q*)(*O*) = *L*(*Q*) = − log *Q*(*O*). Each loss function has a corresponding dissimilarity *d*(*Q*, *Q*_{0}) = *P*_{0}{*L**(*Q*) – *L**(*Q*_{0})}.

Given different targeted maximum likelihood estimators,
${P}_{n}\to {\widehat{Q}}_{k}^{*}({P}_{n})$, of *Q*_{0}, for example, indexed by different initial estimators, we can use a preferred loss-function based cross-validation to select among them. Specifically, let *B _{n}* {0, 1}

$$\widehat{k}({P}_{n})=\text{arg}\underset{k}{\text{min}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}L*({\widehat{Q}}_{k}^{*}({P}_{n,{B}_{n}}^{0})).$$

The resulting targeted maximum likelihood estimator is then given by

$${\widehat{Q}}_{n}^{*}={\widehat{Q}}_{\widehat{k}({P}_{n})}^{*}({P}_{n}).$$

**Cross-validation selector:** Consider a preferred loss function that satisfies

$$\underset{Q}{\text{sup}}\frac{{\text{VAR}}_{{P}_{0}}\{L*(Q)-L*({Q}_{0})\}}{{P}_{0}\{L*(Q)-L*({Q}_{0})\}}\le {M}_{2},$$

(1)

and that is uniformly bounded

$$\underset{O,Q}{\text{sup}}|L*(Q)-L*({Q}_{0})|(O)<{M}_{1}<\infty ,$$

where the supremum is over the support of *P*_{0}, and over all possible candidate estimators of *Q*_{0} that will ever be considered. The first property (1) applies to the log-likelihood loss function and any weighted squared residual loss function, among others. The property (1) is essentially equivalent with the assumption that the loss-function based dissimilarity *d*(*Q*, *Q*_{0}) = *P*_{0}*L*^{*}(*Q*) – *L*^{*}(*Q*_{0}) is quadratic in a distance between *Q* and *Q*_{0}. The property (1) has been proven for log-likelihood loss functions and weighted *L*^{2}-loss functions, and is in essence equivalent with stating that the loss function implies a quadratic dissimilarity *d*(*Q*, *Q*_{0}) (see van der Laan and Dudoit (2003)). If this property does not hold for the loss function, the rates 1/*n* for second order terms in the below stated oracle inequality reduce to the rate
$1/\sqrt{n}$.

For such loss functions, the cross-validation selector satisfies the following (so called) oracle inequality: for any *δ* > 0,

$$\begin{array}{l}{E}_{{B}_{n}}\{{P}_{0}L({\widehat{Q}}_{\widehat{k}}({P}_{n,{B}_{n}}^{0})-L({Q}_{0})\}\le (1+2\delta ){E}_{{B}_{n}}\underset{k}{\text{min}}{P}_{0}\{L({\widehat{Q}}_{k}({P}_{n,{B}_{n}}^{0}))-L({Q}_{0})\}\\ +2C({M}_{1},{M}_{2},\delta )\frac{1+\text{log}K(n)}{np},\end{array}$$

where the constant *C*(*M*_{1}, *M*_{2}, *δ*) = 2(1 + *δ*)^{2}(*M*_{1}/3 + *M*_{2}/*δ*) (see page 25 of van der Laan and Dudoit (2003)). This result proves (see van der Laan and Dudoit (2003) for the precise statement of these implications) that, if the number of candidates *K*(*n*) is polynomial in sample size, then the cross-validation selector is either asymptotically equivalent with the oracle selector (based on sample of size of training samples, as defined on right-hand side of above inequality), or it achieves the parametric rate log *n*/*n* for convergence w.r.t. *d*(*Q*, *Q*_{0}) *P*_{0}{*L*(*Q*) – *L*(*Q*_{0})}. So in most realistic scenarios, in which none of the candidate estimators achieve the rate of convergence one would have with an a priori correctly specified parametric model, the cross-validated selected estimator selector performs asymptotically exactly as well (up till constant!) as the oracle selected estimator. These oracle results are generalized for estimated loss functions
${L}_{n}^{*}(Q)$ that approximate a fixed loss function *L*^{*}(*Q*). If arg min* _{Q}*
${P}_{0}{L}_{n}^{*}(Q)\ne {Q}_{0}$, then the oracle inequality also presents second order terms due to the estimation of the loss function.

This preferred loss function based cross-validation can now be used to select among different candidate targeted maximum likelihood estimators indexed by different initial estimators, and possibly different censoring mechanism estimators. Specifically, we will use a preferred targeted loss function to select among different collaborative targeted maximum likelihood estimators, which are just special targeted maximum likelihood estimators in the sense that *g _{n}* is estimated in collaboration with the initial

For a given loss function *L*(*Q*), and an estimator (*P _{n}*), we will refer to

A C-TMLE estimator is constructed by building a family of candidate estimators, then choosing the best among them, using cross-validation to drive the choice to *Q*_{0}. However, we also rely upon a loss function when building each candidate nuisance parameter (e.g. censoring mechanism) estimator, and it is not necessary that these two loss functions be the same. In fact, as part of building a collaborative nuisance parameter estimator in the collaborative T-MLE procedure, we couple an increase in the log-likelihood entropy of the targeted maximum likelihood estimator with an increase in the *g*_{0}-loss function specific entropy of the corresponding nuisance parameter estimator. In this manner, we arrange that, for increasing sample size, the cross-validation selector will be driven towards the selection of targeted maximum likelihood estimator with an initial estimator closer to *Q*_{0} *and simultaneously* a more and more nonparametric estimator of *g*_{0} (thereby achieving the full wished bias reduction in the limit).

That is, given a collection of candidate estimators of *g*_{0}, ordered by empirical fit w.r.t. a loss function for *g*_{0} such as the log-likelihood, we will build a sequence of targeted maximum likelihood estimators of *Q*_{0} ordered by log-likelihood entropy and indexed by increasingly nonparametric estimators of *g*_{0}, where the extend of being nonparamatric is measured by the *L*_{1}-entropy. Subsequently, we use the cross-validated log-likelihood for *Q*_{0} to choose among these candidate targeted maximum likelihood estimators.

There are many possible approaches that construct such an ordered sequence of targeted maximum likelihood estimators in which a next element in the sequence has both a higher entropy for the *Q*_{0}-loss as well as a higher *g*_{0}-loss entropy for its corresponding censoring estimator. Of course, the strict ordering is not what drives the properties of the resulting estimator, but the sequence should represent an approximately monotone function in the log-likelihood entropy of *Q*_{0} and *L*_{1}-entropy of *g*_{0}.

This procedure represents one particular approach for constructing a targeted maximum likelihood estimator that uses a collaboratively estimated nuisance parameter. We refer to any algorithm that maps into a targeted maximum likelihood estimator that uses a collaborative nuisance parameter estimator (relative to the *Q*-estimator), as a collaborative targeted maximum likelihood estimator.

We present the following template providing a class of collaborative targeted maximum likelihood estimators.

**Initial estimator of** *Q*_{0}: Build an estimator *Q _{n}* of

**Preferred loss function for** *Q*_{0}: Let *L*^{*}(*Q*) be a (targeted) loss function for *Q*_{0}. We note that the loss function can also be data dependent, and, in particular, the choice of loss function can depend on an initial estimator *Q _{n}* of

**Loss function for** *g*_{0}: Let *L*_{1}(*g*) be a loss function for *g*_{0}.

**Candidate estimators of censoring mechanism/nuisance parameter:** For each *δ* in an index set, let *g _{nδ}* be a candidate estimator of

Select a sequence *d*^{0} *> d*^{1} *> . . . > d ^{K}*.

**Select initial targeted maximum likelihood estimator:** We start out with a
${g}_{n}^{0}$ with entropy larger than *d*^{0} and a corresponding targeted maximum likelihood estimator
${Q}_{n}^{*0}={Q}_{n{g}_{n}^{0}}^{*}$ applied to initial estimator *Q _{n}*. We refer to the pair (
${g}_{n}^{0}$,
${Q}_{n}^{*0}$) as the initial targeted maximum likelihood estimator in the sequence of targeted maximum likelihood estimators that will be constructed below.

We are given an current initial estimator
${Q}_{n}^{k}$, a current targeted maximum likelihood estimator (
${g}_{n}^{k}$,
${Q}_{n}^{k*}$) in our sequence of targeted maximum likelihood estimators, with
${Q}_{n}^{k*}$ being the targeted maximum likelihood estimator applied to current initial estimator
${Q}_{n}^{k}$ and nuisance parameter estimator
${g}_{n}^{k}$. The current nuisance parameter estimator
${g}_{n}^{k}$ has entropy larger than *d ^{k}*. We are also given

Consider an algorithm that searches among a specified set of candidate estimators *g _{nδ}* with {

$$\delta \to {P}_{n}L*({Q}_{n\delta}^{k*}).$$

(2)

Recall that
${Q}_{n\delta}^{k*}$ denotes the targeted maximum likelihood estimator that uses the optimal fluctuation model identified by censoring mechanism *g _{n}_{δ}* applied to initial estimator
${Q}_{n}^{k}$. Let

$${P}_{n}L*({Q}_{n{\delta}_{n}}^{k*})<{P}_{n}L*({Q}_{n}^{k*}),$$

or the above holds for the log-likelihood loss function *L*(*Q*) = − log *Q* on which the targeted maximum likelihood algorithm operates, then we accept *δ _{n}*, and thereby the next targeted maximum likelihood estimator,
${g}_{n}^{k+1}={g}_{n{\delta}_{n}}$,
${Q}_{n}^{k+1*}={Q}_{n{\delta}_{n}}^{k*}$, in the sequence we are constructing. The algorithm now delivered its next

If this monotonicity condition fails to hold for both the log-likelihood fit as well as the preferred loss function fit, then we reject this *δ _{n}*, and update the initial estimator
${Q}_{n}^{k}$ by setting it equal to the current targeted maximum likelihood estimator
${Q}_{n}^{k*}$. We now, rerun the above procedure with initial
${Q}_{n}^{k}={Q}_{n}^{k*}$, and same

The above algorithm maps a running current initial estimator, a current targeted MLE (
${g}_{n}^{k}$,
${Q}_{n}^{*k}$) (the lastly constructed in current sequence), into a new targeted MLE (
${g}_{n}^{k+1}$,
${Q}_{n}^{*k+1}$), and possible updated current initial estimator. We start this algorithm with *k* = 0, and iterate it. This now defines the *k*-th step collaborative targeted maximum likelihood estimator (
${g}_{n}^{k}$,
${Q}_{n}^{*k}$), *k* = 0, 1, 2, . . . , *K*.

We are guaranteed that the fit of
${Q}_{n}^{*k}$ is either increasing w.r.t. the preferred loss function (most likely, since that is the loss we minimize at each step), or it is increasing w.r.t the log-likelihood loss used to define the targeted maximum likelihood step, relative to previous targeted maximum likelihood estimator
${Q}_{n}^{*k-1}$. In addition, the corresponding
${g}_{n}^{k}$ has a *L*_{1}-fit that is larger than the *L*_{1}-fit of
${g}_{n}^{k-1}$. At every step in which the initial estimator is updated, we also know that the log-likelihood fit is increasing.

Given this sequence of *k*-th step collaborative targeted maximum likelihood estimators
${P}_{n}\to ({Q}_{n}^{k*}=){\widehat{Q}}^{k*}({P}_{n})$, using estimator
${g}_{n}^{k}$, it remains to select *k*, *k* = 0, 1, . . . , *K*.

We select *k* based on the cross-validated log-likelihood:

$${k}_{n}=\underset{k}{\text{argmax}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}L({\widehat{Q}}^{k*}({P}_{n,{B}_{n}}^{0})),$$

where the random vector *B _{n}* {0, 1}

This finalizes the mapping from the initial estimator *Q _{n}*, and the data, into a collaborative estimator of the censoring mechanism,
${g}_{n}={g}_{n}^{{k}_{n}}$. We refer to
${Q}_{n}^{*}={Q}_{n}^{{k}_{n}*}$, paired with collaborative estimator

**The Collaborative (Double Robust) Targeted Maximum Likelihood Estimator:** The corresponding targeted maximum likelihood estimator of *ψ*_{0} = Ψ* ^{F}* (

$$\Psi ({Q}_{n}^{*})=\Psi ({Q}_{n}^{{k}_{n}*})=\Psi ({\widehat{Q}}^{{k}_{n}*}({P}_{n})).$$

We refer to this estimator as the collaborative (double robust) targeted maximum likelihood estimator (C-DR-TMLE or C-TMLE) of *ψ*_{0}, and we recall that it is paired with a collaborative estimator *g _{n}*.

**C-TMLE solves an efficient influence curve equation:** Since the C-TMLE is a targeted maximum likelihood estimator
${Q}_{n}^{{k}_{n}*}$, applying the fluctuation function with censoring mechanism estimator
${g}_{n}={g}_{n}^{{k}_{n}}$ to the estimator
${Q}_{n}^{{k}_{n}}$, it solves the efficient influence curve equation:

$$0={P}_{n}D*({Q}_{n}^{*},{g}_{n}).$$

This is a fundamental property of the collaborative targeted MLEs driving the targeted bias reduction w.r.t. the target parameter of interest,*ψ*_{0}.

**Selection among candidate C-TMLEs:** The collaborative targeted maximum likelihood estimator depends on a choice of initial estimator
${Q}_{n}^{0}$, and choices that concern the second stage. As a consequence, one might have a set of collaborative targeted maximum likelihood estimators (
${Q}_{nj}^{*}$, *g _{nj}*) indexed by such choices,

**Selection based on empirical efficiency maximization:** Since, under regularity conditions of our asymptotic linearity theorem, each *j*-specific C-TMLE is asymptotically linear with influence curve *IC _{j}*(
${Q}_{j}^{*}$,

$${j}_{n}=\text{arg}\underset{j}{\text{min}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}{\{D*({\widehat{Q}}_{j}^{*}({P}_{n,{B}_{n}}^{0}),{\widehat{g}}_{j}({P}_{n,{B}_{n}}^{0}))\}}^{2}.$$

**Generalization.** The above C-TMLE can also be called the collaborative minimum loss estimator. The loss function *L*(*Q*) needs to satisfy that the derivative of *ε* → *L*(*Q _{g}*(

The C-TMLE procedure starts with an initial estimator *Q _{n}* of

For simplicity, we also assume that the *k*-th targeted maximum likelihood estimator in the sequence is obtained by applying the targeted maximum likelihood algorithm to the previous targeted maximum likelihood estimator in sequence. This is not necessary, since we can apply the argument to the subsequence for which that is true (the elements in the sequence at which the targeted maximum likelihood update is actually carried out), but it simplifies the presentation.

Consider the limits
${Q}_{{kg}_{k}}^{*}$ of the targeted maximum likelihood estimators
${Q}_{n}^{*k}$ in our sequence, where *g _{k}* is the limit of
${g}_{n}^{k}$,

$${P}_{0}\text{log}{Q}_{\tilde{k}}^{*}={P}_{0}\text{log}{Q}_{\tilde{k}+1}^{*}=\dots ={P}_{0}\text{log}{Q}_{K}^{*}.$$

We know that *Q*^{*k+1} is a T-MLE with *Q*^{*k} as initial. So the above equalities are only possible if *Q*^{*k+1} = *Q*^{*k} for *k* = , . . . , *K* – 1. Thus
${Q}_{K}^{*}={Q}_{\tilde{k}}^{*}$. Since
${Q}_{K}^{*}$ is a targeted MLE at nuisance parameter *g*_{0}, it follows that
$\epsilon \to {P}_{0}\text{log}{Q}_{K,{g}_{0}}^{*}(\epsilon )$ is maximized at *ε* = 0: compare with
${P}_{n}\mathrm{\text{log}\mathrm{{Q}_{nK}^{*}(\epsilon )}}$ is maximized at *ε* = 0 by definition of the T-MLE algorithm. Since we just showed that
${Q}_{K}^{*}={Q}_{\tilde{k}}^{*}$, it also follows now

$$\epsilon \to {P}_{0}\text{log}{Q}_{\tilde{k},{g}_{0}}^{*}(\epsilon )$$

is maximized at *ε* = 0. In particular, this means that the derivative at *ε* = 0 equals zero, giving us:

$$0={P}_{0}D*({Q}_{\tilde{k}}^{*},{g}_{0}).$$

However, the efficient influence curve typically satisfies that *P*_{0}*D*^{*}(*Q*, *g*_{0}) = 0 implies Ψ(*Q*) = *ψ*_{0}, which then implies
$\Psi ({Q}_{\tilde{k}}^{*})={\psi}_{0}$. Thus
$\Psi ({Q}_{n\tilde{k}}^{*})$ is consistent, and thereby
$\Psi ({Q}_{n{k}_{n}}^{*})$ is consistent.

Figure 1 illustrates the collaborative nature of the construction of a sequence of increasingly data-adaptive nuisance parameter estimators,
$\{{g}_{n}^{1},\mathrm{\dots \mathrm{,\mathrm{{g}_{n}^{K}\}}}}$, and its relation to the performance of the initial estimator. We generated 5000 observations of *O* = (*W*, *A*, *Y*) from data generating distribution *dP*_{0} = *Q*_{0}*g*_{0} defined as:

$$\begin{array}{l}\mathit{\text{logit}}\mathrm{({g}_{0}(A|W))=.15{W}_{1}+.1{W}_{2}+{W}_{3}-{W}_{4}}& {Q}_{0}(A,W)=A+3{W}_{1}-6{W}_{2}+4{W}_{3}-5{W}_{5}+3{W}_{4}\end{array}$$

where *W*_{1} through *W*_{5} are independent random variables ~ *N*(0, 1), *Y* = *Q*_{0}(*A*, *W*) + *ε*, *ε* ~ *N*(0, 1), and *g*_{0} is the conditional density of *A* given confounding variables *W* = {*W*_{1}, *W*_{2}, *W*_{3}, *W*_{4}, *W*_{5}}. We applied the C-TMLE to estimate the effect of binary treatment *A* on outcome *Y*, adjusting for *W*, defined as *ψ*_{0} = *E _{W}* (

Construction of a sequence of nuisance parameter estimators based on a poor initial fit of the density (top) and a good initial fit for the density (bottom). Kernel estimates of true densities *Q*_{0} and *g*_{0} are shown in gray.

A kernel density estimator was applied to *Y* and to the predicted values of two initial estimators of *Q*_{0} = *E*_{0}(*Y* | *A*, *W*), which we denote with
${\widehat{Q}}_{n,\mathit{\text{poor}}}^{0}$, and
${\widehat{Q}}_{n,\mathit{\text{good}}}^{0}$, respectively. These estimators were obtained with the D/S/A algorithm (Sinisi and van der Laan, 2004), a data-adaptive machine learning approach to model selection that was set to search over all second degree polynomials of size six. The kernel density estimates are displayed in plots on the left hand side of the figure.

In addition, we plotted the kernel density estimates of the predicted values of each set of the collaboratively-constructed candidate *ĝ* estimators, and we can compare them with the density of the true predictions *g*_{0}(1 | *W*) = *P*_{0}(*A* = 1|*W*). These are plotted on the right hand side of the figure, overlaid with the density estimator applied to the true values *g*_{0}(1 | *W*). When the initial fit of *Q*_{0} is poor, the nuisance parameter estimator
${g}_{n}^{k}$ converges quickly to *g*_{0} in *k*, and the selected candidate estimator closely approximates *g*_{0}. Plots in the bottom half of the figure shows the behavior of the C-TMLE procedure when
${Q}_{n}^{0}$ is a good estimate of *Q*_{0}. When the initial fit of *Q*_{0} is good, the nuisance parameter estimator grows slowly towards *g*_{0}, and a candidate estimator that estimates a true treatment mechanism that adjusts for fewer covariates than the true treatment mechanism *g*_{0} that was used to generate the data.

Recall that the targeted maximum likelihood estimator applied to an estimator, *Q _{n}*, of

The collaborative targeted maximum likelihood estimation procedure starts with computing
${Q}_{n}^{0}$, an initial estimator of *P*(*Y* = 1 | *A*, *W*) using super learning, and then collaboratively generating a sequence of targeted maximum likelihood estimators. These use increasingly nonparametric estimators of *g*_{0}, applied to subsequent targeted maximum likelihood updates of the initial estimator (as needed to guarantee the monotonicity in fit). In this way the sequence of constructed targeted maximum likelihood estimators has increasing log-likelihood fit. The selection of the sequence of increasingly nonparametric treatment mechanism estimators was based on maximizing the fit of the corresponding targeted maximum likelihood estimators of *P*(*Y* = 1 | *A*, *W*), as outlined in our template, thus very much driven by the outcome data. Likelihood based cross-validation selects the wished targeted maximum likelihood estimator, with its paired treatment mechanism estimator, from this sequence. It is assumed that the resulting selection of the estimator of *g*_{0} is nonparametric enough so that the collaborative double robustness of the efficient influence curve as presented in next section is utilized, and, thereby, that our asymptotic linearity theorem in later section can indeed be applied.

A collaborative targeted maximum likelihood estimator constructed in this manner has made every effort to make the estimator of the additive causal effect as unbiased as possible. If we now construct a set of such collaborative targeted maximum likelihood estimators, possibly indexed by different initial estimators, and different ways of constructing the sequence of targeted maximum likelihood estimators, we can then select among these estimators the estimator with minimal estimated variance (based on the influence curve). To obtain an honest estimate of the variance of the resulting estimator, just as one obtained honest cross-validated risk of an estimator that internally uses cross-validation, one uses the honest cross-validated variance of the influence curve of the complete estimator, including cross-validating this final selection step that involves minimizing the variance.

In this section we establish a new kind of collaborative robustness of the class of estimating functions in CAR-censored data models, where, as in van der Laan and Robins (2003), the class of estimating functions is implied by the orthogonal complement of the nuisance tangent space of the target parameter Ψ :
$\mathcal{M}$ → IR* ^{d}*. This orthogonal complement of the nuisance tangent space equals the space spanned by the gradients of the pathwise derivative of Ψ, and thus includes the canonical gradient/efficient influence curve. The collaborative robustness result teaches us that the censoring mechanism required to obtain an unbiased estimating function at a mis-specified

The new form of double robustness we wish to establish is understood as follows. Consider an estimating function *D*(Ψ(*Q*), *G*, *Q*) for the parameter of interest *ψ*_{0} that is indexed by nuisance parameters (*G*_{0}, *Q*_{0}), and which is already known to satisfy the classical double robustness property: for any *G* under which *ψ*_{0} is identifiable from *P*_{Q0,}* _{G}*, we have

Firstly, we note that *P*_{0}*D*(*ψ*_{0}, *G*, *Q*) = *P*_{0}{*D*(*ψ*_{0}, *G*, *Q*) – *D*(*ψ*_{0}, *G*, *Q*_{0})} + *P*_{0}*D*(*ψ*_{0}, *G*, *Q*_{0}), and the latter term is zero under any *G* that allows identifiability of *ψ*_{0}. Thus, it remains to determine for what *G*_{0δ} we will have *P*_{0}{*D*(*ψ*_{0}, *G*_{0δ}, *Q*) – *D*(*ψ*_{0}, *G*_{0δ}, *Q*_{0})} = 0. This choice of *G*_{0δ} (e.g., it includes *G*_{0} itself) is not unique but will be dependent on a difference *Q – Q*_{0} in the sense that *X*(*δ*) has to be rich enough so that it contains a difference *Q–Q*_{0}.

By the general representation theorem for estimating functions that are orthogonal to the nuisance tangent space of the target parameter (Theorem 1.6, van der Laan and Robins (2003)), one can typically represent an estimating function *D*(*ψ*_{0}, *G*, *Q*) as an Inverse Probability of Censoring Weighted Estimating function *D _{IPCW}*(

This teaches us that *P*_{0}{*D*(*ψ*_{0}, *G*, *Q*) – *D*(*ψ*_{0}, *G*, *Q*_{0})} = *P*_{0}{*D _{CAR}*(

This is best illustrated with a concrete censored data structure. For example, consider the right censored data structures *O* = (*C*, (*C*)), where *X*(*t*) is a time dependent process, *X* = (*X*(*t*) : *t*) represents the full data structure, and (*t*) = {*X*(*s*) : *s* ≤ *t*} represents the sample path up till time *t*. For this censored data structure, one can represent the projection of *D _{IPCW}* onto

$$\begin{array}{l}{H}_{Q,G}(u,\overline{X}(u-))={E}_{Q,G}({D}_{IPCW,G}|C=u,\overline{X}(u))-E({D}_{IPCW,G}|C\ge u,\overline{X}(u))\\ d{M}_{G}(u)=I(C=u)-I(C\ge u)d{\Lambda}_{C|X}(u|X),\end{array}$$

and Λ_{C|X} is the cumulative hazard of *C*, given *X*. For details, we refer to chapter 3 in van der Laan and Robins (2003). Here *dM _{G}*(

The following much simpler (but in essence making the same point) example helps to further illustrate the general collaborative double robustness property of the efficient influence curve. Suppose the observed censored data structure is *O* = (*W*, Δ, Δ*Y*) and *X* = (*W*, *Y*) is the full data random variable, where Δ is the censoring variable. Suppose one wishes to estimate *ψ*_{0} = *E*_{0}*Y*. The efficient influence curve is given by

$$D({\psi}_{0},{\Pi}_{0},{Q}_{0})={D}_{IPCW}({\psi}_{0},{\Pi}_{0})+{D}_{CAR}({Q}_{0},{\Pi}_{0}),$$

where

$$\begin{array}{l}{D}_{IPCW}({\psi}_{0},{\Pi}_{0})=Y\frac{\Delta}{{\Pi}_{0}(W)}-{\psi}_{0}\\ {D}_{CAR}({Q}_{0},{\Pi}_{0})=-E(Y|\Delta =1,W)\left(\frac{\Delta}{{\Pi}_{0}(W)}-1\right),\end{array}$$

_{0}(*W*) = *P*_{0}(Δ = 1 | *W*) and *Q*_{0}(*W*) = *E*_{0}(*Y* | *W*, Δ = 1). Consider a *Q*. We are interested in the question under what conditional distribution _{0δ} of Δ, given a reduction *W*(*δ*) of *W*, will we still have *P*_{0}*D*(*ψ*_{0}, _{0δ}, *Q*) = 0 and thereby that *D* is an unbiased estimating function for *ψ* at this mis-specified *Q*. Firstly, we note that *P*_{0}*D*(*ψ*_{0}, , *Q*) = *P*_{0}{*D*(*ψ*_{0}, , *Q*) – *D*(*ψ*_{0}, , *Q*_{0})} + *P*_{0}*D*(*ψ*_{0}, , *Q*_{0}), and the latter term is zero under any for which *P*_{0}((*W*) *>* 0) = 1. Thus, it remains to determine for what _{0}_{δ}*P*_{0}{*D*(*ψ*_{0}, _{0}* _{δ}*,

This teaches us that *P*_{0}{*D*(*ψ*_{0}, , *Q*) – *D*(*ψ*_{0}, , *Q*_{0})} = *P*_{0}{*D _{CAR}*(

$${P}_{0}\{D({\psi}_{0},,Q)-D({\psi}_{0},,{Q}_{0})\}=\frac{(Q-{Q}_{0})(W)}{{0}_{(}}$$

Note that we used here that *Q* → *D _{CAR}*(

The proof that the conditional mean of *D _{CAR}*(

$$E\frac{(Q-{Q}_{0})\mathrm{(W)}{0}_{(}(\Delta -{0}_{(}}{}$$

In particular, we have that the conditional mean of *D _{CAR}*(

**Summary:** Consider the efficient influence curve *D*(*Q*, *G*). Suppose we already know that for *Q* with Ψ(*Q*) = *ψ*_{0} *P*_{0}*D*(*Q*_{0}, *G*) = 0 for all *G*. Given a *Q* with Ψ(*Q*) = *ψ*_{0}, characterize the set of *G*_{0}(*Q*)s for which *P*_{0}*D*(*Q*_{0}, *G*_{0}(*Q*)) – *D*(*Q*, *G*_{0}(*Q*)) = 0. For such *Q* and corresponding *G*_{0}(*Q*)’s we have *P*_{0}*D*(*Q*, *G*_{0}(*Q*)) = 0. Given the representation theorem for estimating functions derived from the orthogonal complement of the nuisance tangent space, it appears that we need to determine the conditional distributions *G*_{0δ} of *C*, given a reduction *X*(*δ*) of *X*, for which *E*_{0}*D _{CAR}*(

We will now present the general result which can be applied to any CAR-censored data model as defined and studied in van der Laan and Robins (2003).

**CAR-censored data model:** *Let O =* Φ(*C, X*) *~ P*_{0} *be a censored data structure with full data random variable X ~ P _{X}*

**Parameter of interest:** *Let* Ψ: *$\mathcal{M}$ →* IR^{d}*be pathwise differentiable parameter of interest and it is assumed that* Ψ(*P*_{0}) *=* Ψ* ^{F}* (

We make the following assumptions:

*(PCW stands for Probability of Censoring Weighted) For each Q*
$\mathcal{Q}$, *G* *$\mathcal{G}$,*

$$D*(G,Q)={D}_{h(G,Q)}(G,Q)={D}_{h(G,Q),PCW}(G,\Gamma (Q))+{D}_{h(G,Q),CAR}(G,{Q}^{\prime}),$$

*for mappings* (*G, Q*) *→ h*(*G, Q*)*,* (*h, G, Q*) *→ D _{h,PCW}*(

*(We refer to Theorem 1.3 in van der Laan and Robins (2003) for such a general representation of the efficient influence curve and, more generally, the orthogonal complement of the nuisance tangent space, where the CAR-components are elements of the tangent space T _{CAR}*

**Linearity of CAR-component:** *Q′ → D _{h,CAR}*(

$${D}_{h,CAR}(G,{{Q}^{\prime}}_{1})-{D}_{h,CAR}(G,{{Q}^{\prime}}_{2})={D}_{h,CAR}(G,{{Q}^{\prime}}_{1}-{{Q}^{\prime}}_{2}).$$

**Robustness for mis-specified censoring mechanism:** *For all Q _{0}
$\mathcal{Q}$ and G
$\mathcal{G}$*(

$${E}_{0}{D}_{h}(G,{Q}_{0})=0\mathit{\text{for}}\mathrm{\mathit{\text{all}}\mathrm{h\mathcal{H}.}}$$

**Robustness of CAR-component:** *For a reduction X*(*δ*) *of X (i.e., X*(*δ*) *= f*(*X, δ*) *for some function f), let G*_{0δ} *be the conditional distribution of C, given X*(*δ*).

*Let*
${\overline{{\mathcal{Q}}_{\delta}}}^{\prime}$ *be a set within*
${\overline{\mathcal{Q}}}^{\prime}$ *for which for each*
${\overline{Q}}^{\prime}{\overline{{\mathcal{Q}}_{\delta}}}^{\prime}$

$${E}_{0}{D}_{h,CAR}({G}_{0\delta},{\overline{Q}}^{\prime})=0.$$

*(Typically, one can select*
${\overline{{\mathcal{Q}}_{\delta}}}^{\prime}$ *as all functions in*
${\overline{\mathcal{Q}}}^{\prime}$ *that are only functions of X through X*(*δ*)*.)*

*Let* Γ(*Q*) *=* Γ(*Q*_{0}) *(typically implying* Ψ(*Q*) *= ψ*_{0}*), G*_{0}_{δ}*
$\mathcal{G}$*(*Q*_{0})*, and assume
${Q}^{\prime}-{{Q}^{\prime}}_{0}{\overline{{\mathcal{Q}}_{\delta}}}^{\prime}$, where Q′ = Q′*(*G*_{0δ}*, Q*) *and Q′*_{0} *= Q′*(*G*_{0δ}*, Q*_{0})*. Then*

$${E}_{0}D*({G}_{0\delta},Q)=0.$$

*We also have for all G
$\mathcal{G}$*(*Q*_{0})

$${E}_{0}D*(G,{Q}_{0})=0.$$

**Proof.** Suppose Γ(*Q*) = Γ(*Q*_{0}) and
${Q}^{\prime}-{{Q}^{\prime}}_{0}{\overline{{\mathcal{Q}}_{\delta}}}^{\prime}$. Let
${G}_{0}^{*}={G}_{0\delta}$ be the conditional distribution of *C*, given *X*(*δ*), and assume it is an element of
$\mathcal{G}$(*Q*_{0}).

By the “Augmented ‘PCW’-representation of efficient influence curve” assumption, we have

$${E}_{0}D*({G}_{0}^{*},Q)={E}_{0}{D}_{h}({G}_{0}^{*},Q)$$

for some *h*
$\mathcal{H}$. Thus,

$$\begin{array}{l}{E}_{0}D*({G}_{0}^{*},Q)={E}_{0}{D}_{h}({G}_{0}^{*},Q)\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={E}_{0}\{{D}_{h}({G}_{0}^{*},Q)-{D}_{h}({G}_{0}^{*},{Q}_{0})\}+{E}_{0}{D}_{h}({G}_{0}^{*},{Q}_{0}).}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

By the assumption that ${G}_{0}^{*}\mathcal{G}({Q}_{0})$, it follows that the last term ${E}_{0}{D}_{h}({G}_{0}^{*},\mathrm{{Q}_{0})=0}$.

By the “PCW-representation” assumption we have

$$\begin{array}{l}{E}_{0}\{{D}_{h}({G}_{0}^{*},Q)-{D}_{h}({G}_{0}^{*},{Q}_{0})\}={E}_{0}\{{D}_{h,PCW}({G}_{0}^{*},\Gamma (Q))-{D}_{h,PCW}({G}_{0}^{*},\Gamma ({Q}_{0}))\}\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{E}_{0}\{{D}_{h,CAR}({G}_{0}^{*},{Q}^{\prime}(Q,{G}_{0}^{*}))-{D}_{h,CAR}({G}_{0}^{*},{Q}^{\prime}({Q}_{0},{G}_{0}^{*}))\}.}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

By the assumption that Γ(*Q*) = Γ(*Q*_{0}), the first term equals zero. By the “linearity of CAR-component”-assumption we have that the last term equals:

$${E}_{0}\{{D}_{h,CAR}({G}_{0}^{*},{Q}^{\prime})-{D}_{h,CAR}({G}_{0}^{*},{{Q}^{\prime}}_{0})\}={E}_{0}{D}_{h,CAR}({G}_{0}^{*},{Q}^{\prime}-{{Q}^{\prime}}_{0}),$$

where ${Q}^{\prime}={Q}^{\prime}({G}_{0}^{*},\mathrm{Q)}$ and ${{Q}^{\prime}}_{0}={Q}^{\prime}({G}_{0}^{*},\mathrm{{Q}_{0})}$.

We assumed that ${Q}^{\prime}-{{Q}^{\prime}}_{0}{\overline{{\mathcal{Q}}_{\delta}}}^{\prime}$. Thus, by the “Robustness of CAR-component”-assumption we have that

$${E}_{0}{D}_{h,CAR}({G}_{0}^{*},{Q}^{\prime}-{{Q}^{\prime}}_{0})=0.$$

This proves ${E}_{0}{D}^{*}({G}_{0}^{*},\mathrm{Q)=0}$. □

For the sake of illustration, we will now explicitly establish the collaborative double robustness of the efficient influence curve estimating function in two additional examples. These results are also corollaries of the above general Theorem 1.

We have the following double robustness result for our additive causal effect example.

**Theorem 2** *Let dP*_{0} *= Q*_{0}*dG*_{0} *be the distribution of O =* (*W, A, Y*) *and let the model for P*_{0} *be nonparametric.*

*Let* Ψ(*Q*_{0}) *= E*_{Q01}{*E*_{Q02}(*Y | A =* 1*, W*) *– E*_{Q02}(*Y | A =* 0*, W*)} *be the parameter on this model, where it is assumed that it is identifiable from P*_{0}*. Here Q*_{01} *denotes marginal distribution of W and Q*_{02} *the conditional distribution of Y, given A, W. The efficient influence curve of* Ψ *at P =* (*Q, G*) *is given by*

$$D*(Q,G)(O)=h(G)(A,W)(Y-{Q}_{2}(A,W))+{Q}_{2}(1,W)-{Q}_{2}(0,W)-\Psi (Q),$$

*where Q*_{2}(*A, W*) *= E _{Q}*(

*Assume*

$$({Q}_{02}-{Q}_{2})(A,W)={E}_{{Q}_{0}}(Y-{Q}_{2}(A,W)|A,W)={f}_{0}(A,W(Q))$$

*is only a function of A, W*(*Q*) *for a W*(*Q*) *=* Φ(*Q*_{2}*, W*) *for some mapping* Φ*: i.e., W*(*Q*) *denotes a reduction or subset of the full vector random variable W indexed by Q.*

*Let dG*_{0}(*Q*) *be the conditional distribution of A, given W*(*Q*)*. If* Ψ(*Q*) *=* Ψ(*Q*_{0})*, then*

$${E}_{{P}_{0}}D*(Q,{G}_{0}(Q))=0.$$

*Or, equivalently, if we represent D ^{*}*(

$${E}_{{P}_{0}}D*({\psi}_{0},Q,{G}_{0}(Q))=0.$$

*We also have: If Pr*(*P _{G}*(

$${E}_{{P}_{0}}D*({Q}_{0},\mathrm{G)=0,}$$

*or equivalently,*

$${E}_{{P}_{0}}D*({\psi}_{0},{Q}_{0},G)=0.$$

**Proof.** The last statement is easy and well known (e.g., van der Laan and Robins (2003)). The first statement needs to be proved, or can be derived as a corollary of Theorem 1. Note, if Ψ(*Q*) = *ψ*_{0}, then

$${E}_{0}D*(Q,{G}_{0}(Q))={E}_{0}h({G}_{0})(A,W(Q))(Y-Q(A,W))+Q(1,W)-Q(0,W)-{\psi}_{0}.$$

If *E*_{0}(*Y* –*Q*(*A*, *W*) | *A*, *W*) = *f*_{0}(*A*, *W*(*Q*)) is only a function of *A*, *W*(*Q*), then it follows by first taking the conditional mean, given *A*, *W*, and then taking the mean of *A*, given *W*(*Q*),

$$\begin{array}{l}{E}_{0}D*(Q,{G}_{0}(Q))={E}_{0}h({G}_{0})(A,W(Q)){f}_{0}(A,W(Q))\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+Q(1,W)-Q(0,W)-{\psi}_{0}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={E}_{0}{f}_{0}(1,W(Q))-{f}_{0}(0,W(Q))+Q(1,W)-Q(0,W)}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{-{\psi}_{0}.}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

Now, note that *f*_{0}(*A*, *W*(*Q*)) = *Q*_{0}(*A*, *W*) – *Q*(*A*, *W*), which proves that the latter quantity equals zero.

□

The implication of this result is that, given an estimate *Q* of *Q*_{0}, we only need to estimate *G*_{0}(*Q*), conditioning on *W*(*Q*), or any conditional distribution that conditions on more than *W*(*Q*). Thus, if *Q* already succeeds in explaining most of the true regression *E*_{0}(*Y* | *A*, *W*), then only little inverse weighting with *G*_{0}(*Q*) = *P*(*A* = · | *W*(*Q*)) remains to be done. That is, the amount and manner of inverse weighting required to obtain a consistent estimator of the causal effect *ψ*_{0} can be adapted to the approximation error of *Q* relative to the true regression.

Let *O* = (*W*, *A*, *Y*) ~ *P*_{0}. Assume the model *E*_{0}(*Y* | *A*, *W*) – *E*_{0}(*Y* | *A* = 0, *W*) = *Aβ*_{0}*V* for some *V* *W*. If the variance of *Y*, given *A*, *W*, only depends on *W*, then the efficient score of *β*_{0} at *P*_{0} can be represented as

$$D*({0}_{,}$$

where _{0}(*W*) = *E*_{0}(*A* | *W*), and *θ*_{0}(*W*) = *E*_{0}(*Y* | *A* = 0, *W*). For the sake of illustration we will use this simpler representation, but the same double robustness applies to the general efficient influence curve representation as (e.g.) presented in van der Laan and Robins (2003).

**Theorem 3** *Suppose E*_{0}(*Y – Aβ*_{0}*V – θ*(*W*) *| A, W*) *= f*_{0}(*W*(*θ*)) *for some function f*_{0} *of W*(*θ*) *where W*(*θ*) *=* Φ(*W, θ*) *is function of W and θ. Note that this states that θ*_{0}(*W*) *– θ*(*W*) *= f*_{0}(*W*(*θ*)) *is only a function of a reduction W*(*θ*) *of W. Let* _{0}(*θ*)(*W*) *= E*_{0}(*A | W*(*θ*))*. Then*

$${E}_{0}D*({0}_{(}$$

*We also have*

$${E}_{0}D*(,{\theta}_{0},{\beta}_{0})=0$$

**Proof.** Only the first robustness result needs to be proved. First take the conditional mean, given *A*, *W*, which results in the term *E*_{0}(*A*–_{0}(*θ*)(*W*(*θ*))) *f*_{0}(*W*(*θ*)). Subsequently, we take the conditional mean, given *W*(*θ*), which proves it equals zero. □

By using a collaborative estimator *g _{n}*(

Other methods for construction of collaborative estimators *g _{n}*(

Given an initial estimator *Q _{n}*, another idea of interest for construction of a collaborative estimator

$${j}_{n}=\text{arg}\mathrm{\underset{j}{\text{min}}{\Vert {E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}D*(\widehat{Q}({P}_{n,{B}_{n}}^{0}),{\widehat{g}}_{j}({P}_{n}))\Vert}^{2},}$$

where *B _{n}* denotes a random variable in {0, 1}

Let *H*(*g*_{0}, *Q* – *Q*_{0}) be the component that needs to be adjusted for in *g*_{0} = *g*_{0}(*Q*). One could estimate this component from the data using appropriate methodology. If, given an arbitrary initial fit *g*, one would add *H*(*g*, *Q* – *Q*_{0}) as main term in a fluctuation model of *g*, and the fluctuation function is chosen so that the score of the coefficient of this main term at zero equals *D _{CAR}*(

The sufficient covariate *H*(*g*, *Q*–*Q*_{0}), that is needed to update *g*, depends on *g* itself as well, so that, even given an estimate of *Q* – *Q*_{0}, the above maximum likelihood update of an initial *g* does not work. There are two approaches that can be used to deal with this self-dependence of the minimal adjustment covariate for *g*.

Firstly, one can extract the few components of only *Q* – *Q*_{0}, and enforce nonparametric adjustment by these covariates in a fit of *g*_{0}(*Q*) of the censoring mechanism. In this manner, the resulting censoring mechanism estimator will estimate a true conditional distribution that conditions on covariates that imply the value of *H*(*g*_{0}(*Q*), *Q* – *Q*_{0}). Secondly, one can also only adjust for *H*(*g*^{0}, *Q*–*Q*_{0}) as a main term, given an estimate *g*^{0}, and iterate this updating process of *g* till convergence. In the latter case, as mentioned above, it is assumed that the score of the fluctuation of *g* implied by this main term extension *H*(*g*, *Q*–*Q*_{0}) at zero equals *D _{CAR}*(

For example, in the additive causal effect example with *O* = (*W*, *A*, *Y*), *ψ*_{0} = *EY* (1)–*Y* (0) and *Y* continuous, we have
$H({g}_{0},\mathrm{Q-{Q}_{0})=\frac{1}{{g}_{0}(1|W)}E(Y-Q|A=1,\mathrm{W)+\frac{1}{{g}_{0}(0|W)}E(Y-Q|A=0,\mathrm{W)}}}$, and the score of *ε*, at *ε* = 0, of logistic regression model *g _{ε}*(1 |

One can estimate *E*(*Y* – *Q*|*A* = 1, *W*) and *E*(*Y* – *Q*|*A* = 0, *W*) with a machine learning algorithm, treating an initial estimate *Q* as offset (possibly cross-validated to make the offset independent of *Y _{i}*). If

Alternatively, given an initial estimator
${g}_{n}^{0}$, one could obtain an estimate
${H}_{n}^{0}$ by plugging in an estimate of *Q*_{0} – *Q*, and this initial
${g}_{n}^{0}$. One could now force in this
${H}_{n}^{0}$ as a main term in
${g}_{n}^{0}$, resulting in an updated
${g}_{n}^{1}$. This process is iterated till convergence. In the limit we have that *g _{n}* solves
$0={P}_{n}H({g}_{n},\mathrm{\widehat{Q-{Q}_{0}})(A-{g}_{n}(1|W))}$. Here
${g}_{n}^{0}$ would already be a collaborative estimator of

We do not advise starting the above iterative algorithm at a purposely misspecified estimator
${g}_{n}^{0}$. Instead we want to apply the above iterative algorithm at a collaborative estimator
${g}_{n}^{0}$, such as the one presented in our template of the C-TMLE in Section 2. For example, after having run the C-TMLE in Section 2, we would carry out a subsequent update of the resulting collaborative estimator *g _{n}* by applying the above iterative updating algorithm, starting at
${g}_{n}^{0}={g}_{n}$, and using an estimator of

In the above additive causal risk example, we estimate *H*(*g*, *Q* – *Q*_{0}) by plugging in an initial estimate *g*^{0} and *Q* – *Q*_{0}, and the iterative adjustment succeeds in its goal as long as the estimate of *Q* – *Q*_{0} is correct, even if (the initial) *g* is misspecified. One could also use a representation such as *H*(*g*_{0}, *Q*–*Q*_{0}) = *E*_{g0},_{Q0}(*D _{IPCW}*(

The collaborative targeted maximum likelihood estimator
${Q}_{n}^{*}$ equals a *k _{n}*-th step collaborative targeted maximum likelihood estimator, and thereby equals a targeted maximum likelihood estimator with a starting estimator
${Q}_{n}^{k}$ (e.g., the

Thus, just like the targeted maximum likelihood estimator, the collaborative targeted maximum likelihood estimator
${\psi}_{n}=\Psi ({Q}_{n}^{*})$ of *ψ*_{0} solves the efficient influence curve estimating equation

$$0={P}_{n}D*({Q}_{n}^{*},{g}_{n},{\psi}_{n}).$$

For simplicity, we will make the assumption that the efficient influence curve at a *P _{Q,g}* can be represented as an estimating function in

It is a reasonable assumption that
${Q}_{n}^{*}$ converges to some element *Q*^{*} in the model for *Q*_{0}, where *Q*^{*} is not necessarily equal to the true *Q*_{0}. In addition, let’s assume that, for each *δ*, the *δ*-specific censoring mechanism estimator *g _{nδ}* converges to some

For a given *Q*, we define *δ*(*Q*) as the index *δ* with entropy *d*(*δ*) minimal and so that

$${P}_{0}D*(Q,{g}_{0\delta (Q)},{\psi}_{0})=0.$$

In other words, given the family of adjustments indexed by *δ*, *δ* (*Q*) represents the minimal adjustment necessary in the censoring mechanism to obtain the collaborative double robustness/unbiased estimating function for *ψ*_{0}. It is then a natural assumption that

$${P}_{0}D*(Q,{g}_{0\delta},{\psi}_{0})=0\text{for}\mathrm{\text{each}\mathrm{\delta \mathrm{\text{with}\mathrm{d(\delta )\ge d(\delta (Q)).}}}}$$

In other words, if one uses a more nonparametric estimator of the censoring mechanism than needed (i.e.., than *δ*(*Q*)), then one certainly obtains the wished unbiasedness.

We will assume that, as *n* converges to infinity, then the selected censoring mechanism estimator *g _{n}* =

It is assumed that *d*(*δ*_{0}) ≥ *d*(*Q*^{*}) so that

$$0={P}_{0}D*(Q*,{g}_{0},{\psi}_{0}),$$

which will be the fundamental assumption for asymptotic normality of the CTMLE. In other words, it is assumed that our collaborative C-TMLE procedure selects a nonparametric enough estimator *g _{n}* for the censoring mechanism (in collaboration with
${Q}_{n}^{*}$) so that the required unbiasedness of the efficient influence curve estimating function is achieved.

To derive the influence curve of $\Psi ({Q}_{n}^{*})$, the asymptotic linearity theorem below assumes also that the limit of the selected censoring mechanism estimator satisfies

$${P}_{0}D*({Q}_{n}^{*},{g}_{0},{\psi}_{0})={o}_{P}(1/\sqrt{n}).$$

(3)

As a consequence of this assumption (3), the influence curve does not involve a contribution requiring the analysis of a function of
${Q}_{n}^{*}$. This important simplification of the influence curve allows straightforward calculation of standard errors for the C-TMLE. The assumption (3) requires the limit *g*_{0} to be nonparametric enough w.r.t. the actual estimator
${Q}_{n}^{*}$ so that enough orthogonality is achieved to make the contribution
${P}_{0}{D}^{*}({Q}_{n}^{*},\mathrm{{g}_{0},\mathrm{{\psi}_{0})}}$ second order.

**Why assumption (3) holds for C-TMLE:** We now explain why this assumption is reasonable for the C-TMLE.

Define
${g}_{0}({Q}_{n}^{*})$ as
${g}_{0{\delta}_{n}^{*}}$ with
${\delta}_{n}^{*}=\text{min}\{\delta \mathrm{:\mathrm{d(\delta )\mathrm{\ge \mathrm{d({\delta}_{0}),\mathrm{{P}_{0}{D}^{*}({Q}_{n}^{*},\mathrm{{g}_{0\delta},\mathrm{{\psi}_{0})=0\}}}}}}}}$. In other words,
${g}_{0}({Q}_{n}^{*})$ corresponds with the limit of the least nonparametric estimator (among all estimators more nonparametric than the one identified by *δ*_{0}) that still yields the wished unbiasedness of the estimating function at
${Q}_{n}^{*}$, and it as close as possible to *g*_{0} = *g*_{0δ0}.

We note that

$$\begin{array}{l}{P}_{0}D*({Q}_{n}^{*},{g}_{0},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})={P}_{0}D*({Q}_{n}^{*},{g}_{0}({Q}_{n}^{*}),{\psi}_{0})\mathrm{-D*(Q*,{g}_{0}({Q}_{n}^{*}),{\psi}_{0})}& \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{R}_{n},}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

where *R _{n}* is a second order term (like

By definition of ${g}_{0}({Q}_{n}^{*})$, we do not only have

$${P}_{0}D*({Q}_{n}^{*},{g}_{0}({Q}_{n}^{*}),{\psi}_{0})=0,$$

but also that
${g}_{0}({Q}_{n}^{*})$ is equally or more nonparametric than *g*_{0}(*Q*^{*}) so that

$${P}_{0}D*(Q*,{g}_{0}({Q}_{n}^{*}),{\psi}_{0})=0.$$

This implies now that indeed

$${P}_{0}D*({Q}_{n}^{*},{g}_{0},{\psi}_{0})={o}_{P}(1/\sqrt{n}).$$

Finally, we note that the next theorem can be applied to any collaborative double robust estimator, as discussed in previous section, not only the collaborative double robust targeted maximum likelihood estimator.

**Theorem 4** *Let* (*Q, g, ψ*) → *D*^{*}(*Q, g, ψ*) *be a well defined function that maps any possible* (*Q, g,* Ψ(*Q*)) *into a function of O. Let O*_{1}, . . . , *O*_{n} ~ *P*_{0} *be i.i.d, and let P _{n}*

*Let Q ^{*}*

Assume

**Efficient Influence Curve Estimating Equation:**
$0={P}_{n}{D}^{*}({Q}_{n}^{*},\mathrm{{g}_{n},\mathrm{{\psi}_{n})}}$, *where*
${\psi}_{n}=\Psi ({Q}_{n}^{*})$.

**Censoring Mechanism Estimator is Nonparametric Enough:**

$${P}_{0}D*(Q*,{g}_{0},\mathrm{{\psi}_{0})=0.}$$

$${P}_{0}D*({Q}_{n}^{*},{g}_{0},{\psi}_{0})={o}_{P}(1/\sqrt{n}).$$

*(Above we show why the latter is indeed a second order term for the C-TMLE.)*

**Consistency:**

$${P}_{0}{(D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{0}))}^{2}\to 0\mathrm{\mathit{\text{in}}\mathrm{\mathit{\text{probability}},}}$$

*as n → ∞. And the same is assumed if one or two of the triplets*
$({Q}_{n}^{*},\mathrm{{g}_{n},\mathrm{{\psi}_{n})}}$ *is replaced by its limit* (*Q*^{*}*, g*_{0}*, ψ*_{0}).

**Identifiability/Invertibility:** *c*_{0} = –*d/dψ*_{0}*P*_{0}*D*^{*}(*Q*^{*}, *g*_{0}, *ψ*_{0}) *exists and is invertible.*

**Donsker Class:** {*D*^{*}(*Q, g,* Ψ(*Q*)) : *Q, g*} *is P*_{0}*-Donsker, where* (*Q, g*) *vary over sets that contain* (
${Q}_{n}^{*}$*, g _{n}*), (

**Contribution due to Censoring Mechanism Estimation:** *Define the mapping g* → Φ(*g*) *P*_{0}*D*^{*}(*Q*^{*}, *g*, *ψ*_{0})*. Assume*
$\Phi ({g}_{n})-\Phi ({g}_{0})=({P}_{n}-{P}_{0})I{C}_{{g}_{0}}+{o}_{P}(1/\sqrt{n})$ *for some mean zero function*
${IC}_{{g}_{0}}{L}_{0}^{2}({P}_{0})$.

**Second order terms:** *Define second order term*

$$\begin{array}{l}{R}_{n1}={P}_{0}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})\}\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{-{P}_{0}\{D*({Q}_{n}^{*},{g}_{0},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})\},}}}}}}}}}}}}}\end{array}$$

*and assume*
${R}_{n1}={o}_{P}(1/\sqrt{n})$*. Note R _{n}*

*Define second order term*

$$\begin{array}{l}{R}_{n2}={P}_{0}\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})\}\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{-{P}_{0}\{D*(Q*,{g}_{n},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})\},}}}}}}}}}}}}}\end{array}$$

*and assume*
${R}_{n2}={o}_{P}(1/\sqrt{n})$*. Note R*_{n2} *is a second order term involving differences g _{n}*

*Then, ψ _{n}*

$$IC({P}_{0})={c}_{0}^{-1}\{D*(Q*,{g}_{0},{\psi}_{0})+I{C}_{{g}_{0}}\}.$$

*That is,*

$${\psi}_{n}-{\psi}_{0}=({P}_{n}-{P}_{0})IC({P}_{0})+{o}_{P}(1/\sqrt{n}).$$

*In particular,*
$\sqrt{n}({\psi}_{n}-{\psi}_{0})$ *converges in distribution to a multivariate normal distribution with mean zero and covariance matrix* Σ_{0} = *E*_{0}*IC*(*P*_{0})*IC*(*P*_{0})^{}.

**Proof:** The principal equations are
$0={P}_{n}{D}^{*}({Q}_{n}^{*},\mathrm{{g}_{n},\mathrm{{\psi}_{n})}}$ and *P*_{0}*D*^{*}(*Q*^{*}, *g*_{0}, *ψ*_{0}) = 0. So, we have

$${P}_{0}D*(Q*,{g}_{0},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{0})=-\{{P}_{n}D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-{P}_{0}D*(Q*,{g}_{0},{\psi}_{n})\}.$$

Let ${c}_{0}=-\frac{d}{d{\psi}_{0}}{P}_{0}{D}^{*}({Q}^{*},\mathrm{{g}_{0},\mathrm{{\psi}_{0})}}$. Then,

$$\begin{array}{l}{c}_{0}({\psi}_{n}-{\psi}_{0})+o(|{\psi}_{n}-{\psi}_{0}|)=({P}_{n}-{P}_{0})D*(Q*,{g}_{0},{\psi}_{n})\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{P}_{n}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})\}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{P}_{n}\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})\}.}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

We denote the three terms on the right with I,II and III, and deal with them separately below.

**I:** By the Donsker condition, and consistency condition, we have

$$({P}_{n}-{P}_{0})\{D*(Q*,{g}_{0},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{0})\}={o}_{P}(1/\sqrt{n}).$$

Thus, we obtain $({P}_{n}-{P}_{0}){D}^{*}({Q}^{*},\mathrm{{g}_{0},\mathrm{{\psi}_{0})+{o}_{P}(1/\sqrt{n})}}$ as first term approximation. We refer to van der Vaart and Wellner (1996) for this empirical process theorem.

**II:** We have

$$\begin{array}{l}{P}_{n}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})\}=\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{{P}_{n}-{P}_{0})\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})\}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{P}_{0}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})\}.}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

The first term is ${o}_{p}(1/\sqrt{n})$ by our Donsker class condition, and consistency condition at ${Q}_{n}^{*},\mathrm{{g}_{n},\mathrm{{\psi}_{n}}}$. We also have

$${P}_{0}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})\}={P}_{0}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})+{R}_{n1},$$

where

$$\begin{array}{l}{R}_{n1}={P}_{0}\{D*({Q}_{n}^{*},{g}_{n},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{n})-D*({Q}_{n}^{*},{g}_{0},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})\}\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={o}_{P}(1/\sqrt{n}),}}}}}}}}}}}\end{array}$$

by assumption.

*R _{n}*

**III:** We have

$$\begin{array}{l}{P}_{n}\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})=\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{({P}_{n}-{P}_{0})\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})\}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{P}_{0}\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})\}.}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

The first term is
${o}_{P}(1/\sqrt{n})$ by Donsker class condition, and consistency condition at
${Q}_{n}^{*}$, *g _{n}*,

$${P}_{0}\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})\}={P}_{0}\{D*(Q*,{g}_{n},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})\}+{R}_{n2},$$

where

$$\begin{array}{l}{R}_{n2}={P}_{0}\{D*(Q*,{g}_{n},{\psi}_{n})-D*(Q*,{g}_{0},{\psi}_{n})-D*(Q*,{g}_{n},{\psi}_{0})-D*(Q*,{g}_{0},{\psi}_{0})\}\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={o}_{P}(1/\sqrt{n}),}}}}}}}}}}}\end{array}$$

by assumption. Thus the third term equals *P*_{0}*D*^{*}(*Q*^{*}, *g _{n}*,

We can thus conclude that

$${\psi}_{n}-{\psi}_{0}=({P}_{n}-{P}_{0}){c}_{0}^{-1}\{D*(Q*,{g}_{0},{\psi}_{0})+I{C}_{{g}_{0}}\}+{o}_{P}(|{\psi}_{n}-{\psi}_{0}|)+{o}_{P}(1/\sqrt{n}).$$

This implies $|{\psi}_{n}-{\psi}_{0}|={O}_{P}(1/\sqrt{n})$, and thereby the stated asymptotic linearity. □

If *Q*^{*} = *Q*_{0}, then *IC*_{g0} = 0, so that the influence curve reduces to the efficient influence curve *D*^{*}(*Q*_{0}, *g*_{0}, *ψ*_{0}) at a possibly weakly adjusted *g*_{0}. If *g _{n}* converges to the fully adjusted conditional distribution, given

One can estimate the covariance matrix Σ = *E*_{0}*ICIC*^{} of the influence curve with the empirical covariance matrix
${\sum}_{n}=1/n{{\sum}_{i=1}^{n}\hat{IC}({O}_{i})\hat{IC}({O}_{i})}^{}$, and statistical inference can be based on the corresponding mean zero multivariate normal distribution, as usual.

Suppose that we have a set of candidate collaborative targeted maximum likelihood estimators
$({\widehat{Q}}_{k}^{*}({P}_{n}),\mathrm{{\widehat{g}}_{k}({P}_{n}))}$, *k* = 1, . . . , *K*. Suppose that each of these estimators satisfy the conditions of the theorem. For example, these might be collaborative targeted maximum likelihood estimators as defined in our template, using different initial estimators indexed by *k*, but the same collaborative estimator for the censoring mechanism *as a function of the data and the initial estimator* (thus still resulting in different realizations if the initial estimators are different). Then
$\Psi ({\widehat{Q}}_{k}^{*}({P}_{n}))$ is asymptotically linear with influence curve
${D}^{*}({Q}_{k}^{*},\mathrm{{g}_{0k},\mathrm{{\psi}_{0})}}$, *k* = 1, . . . , *K*. We can now select among these candidate C-DR-TMLEs by maximizing the estimated efficiency, as in Rubin and van der Laan (2008).

Specifically, let Ψ be a one-dimensional parameter. We now select the *k* that minimizes the cross-validated variance of the influence curve:

$${k}_{n}=\text{arg}\mathrm{\underset{k}{\text{min}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}D{*}^{2}({\widehat{Q}}_{k}^{*}({P}_{n,{B}_{n}}^{0}),{\widehat{g}}_{k}({P}_{n,{B}_{n}}),{\psi}_{n}).}$$

Thus, we would use the estimator
${\psi}_{n}=\Psi ({\widehat{Q}}_{{k}_{n}}^{*}({P}_{n}))$. If Ψ is multidimensional, then one needs to agree on a real valued criterion applied to the covariance matrix of the influence curve, such as the sum of the variances along the diagonal, and minimize over *k* the criterion of the cross-validated covariance matrix of the *k*-specific influence curve.

If *g _{n}* converges to the fully adjusted

Due to the particular way *g _{n}* is constructed in response to

The super efficiency may have very attractive features in practice. For example, there might be a covariate that is very predictive of censoring/treatment, but have no relation to the outcome. The C-TMLE will now decide to not adjust for this covariate at all in the selected censoring mechanism, and as a consequence, it might achieve the efficiency bound for the data structure excluding this covariate, but still assuming CAR, so that the C-TMLE will have smaller asymptotic variance than the efficiency bound. The resulting super efficient estimator not only shows improved precision, but also yields more reliable confidence intervals, by avoiding heavily non-robust (and harmful) operations. In most practical scenarios, such a covariate will still have a weak link with the outcome. In this case, for very large sample sizes, the C-TMLE will adjust for this covariate and thereby only be asymptotically efficient, but it will still behave as a super efficient estimator for practical sample sizes, by not adjusting for this covariate. That is, it invests in effective bias reduction focussing on covariates that are still predictive of the outcome, taking into account the already included initial estimator. This behavior is completely compatible with an estimator that aims to minimize mean squared error of the estimator of the target parameter, and certainly avoids steps that both increase bias as well as variance.

Finally, we remark that in simulations in which *Q _{n}* converges fast to the true

The template of the collaborative targeted maximum likelihood estimators is based on 1) a log-likelihood loss function (i.e., same loss function that is maximized at targeted maximum likelihood step) to select among candidate targeted maximum likelihood estimators indexed by increasingly nonparametric estimators of censoring mechanism, and 2) a preferred loss function to compare targeted maximum likelihood estimators using different censoring mechanism estimators, in order to build these candidate censoring mechanism estimators. One can also use a preferred loss function to select among different candidate collaborative targeted maximum likelihood estimators (e.g., indexed by different initial estimators).

In this section we propose targeted loss functions implied by the efficient influence curve of Ψ. Firstly, the log-likelihood can be replaced by a penalized log-likelihood that is sensitive to sparse data bias w.r.t. target parameter, as defined in next subsection This penalized log-likelihood can also play the role of the preferred loss function. In the second subsection we propose as preferred loss function the cross-validated variance of the efficient influence curve, relying on an overall collaborative estimator of censoring mechanism w.r.t. an initial estimator, or a candidate specific collaborative estimator. In the last subsection, we utilize the mean of the efficient influence curve as a criterion to generate a targeted loss function for *Q* that incorporates a sequence of increasingly nonparametric estimators of the censoring mechanism.

In the C-DR-TMLE we applied loglikelihood based cross-validation to select among different targeted maximum likelihood estimators, indexed by different censoring mechanism estimators. We propose here a penalized log-likelihood criterion that results in robust estimators in the context of sparse data w.r.t. the parameter of interest.

Consider candidate (e.g., collaborative) targeted maximum likelihood estimators
${P}_{n}\to {\widehat{Q}}_{\delta}^{*}({P}_{n})$ of the true *Q*_{0}
$\mathcal{M}$, targeting a parameter *ψ*_{0} = Ψ(*Q*_{0}), indexed by *δ*. Our proposed criterion for selecting *δ* is

$${\delta}_{n}=\underset{\delta}{\text{argmax}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}\text{log}{\widehat{Q}}_{\delta}^{*}({P}_{n,{B}_{n}}^{0})-MSE({P}_{n})(\delta ),$$

where the first term is the cross-validated log-likelihood for the candidate estimator
${\widehat{Q}}_{\delta}^{*}({P}_{n})$, and *MSE*(*P _{n}*)(

If the target parameter cannot be reasonably identified from the data the log-likelihood of the targeted maximum likelihood estimator is not sensitive enough to such a singularity: in fact, on many occasions this just means that the targeted maximum likelihood algorithm will be ineffective (i.e., the maximum likelihood fluctuations get too noisy) so that in essence the log-likelihood of the initial estimator drives the selection.

Therefore it is crucial that the log-likelihood terms are penalized by a term that blows up (in the negative direction) for *δ*-values for which the variance (or bias, addressed in next subsection) of the targeted maximum likelihood estimator
$\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n}))$ relative to its limit
${\psi}_{0}(\delta )=\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{0}))$ gets large. Since we can derive the influence curve of the targeted maximum likelihood estimator
$\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n}))$ as an estimator of *ψ*_{0}(*δ*), this variance can be estimated with the variance of this influence curve at this targeted maximum likelihood estimator
${\widehat{Q}}_{\delta}^{*}({P}_{n})$. As follows from the study of TMLE in van der Laan and Rubin (2006) one can often use as approximate influence curve the efficient influence curve D^{*}(
${Q}_{\delta}^{*}$, *g _{δ}*) at the limit of the targeted maximum likelihood estimator (
${\widehat{Q}}_{\delta}^{*}$,

We first define the cross-validated covariance matrix

$$\frac{\Sigma ({P}_{n})(\delta )}{n}=\frac{1}{n}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}\{D*({\widehat{P}}_{\delta}^{*}({P}_{n,{B}_{n}}^{0}))D*{({\widehat{P}}_{\delta}^{*}({P}_{n,{B}_{n}}^{0}))}^{}$$

For example, if the target parameter is 1-dimensional (i.e., *d* = 1), then we have

$$\frac{{\sigma}^{2}({P}_{n})(\delta )}{n}=\frac{1}{n}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}{\{D*({\widehat{P}}_{\delta}^{*}({P}_{n,{B}_{n}}^{0}))\}}^{2}.$$

For example, one can define the variance term of the MSE as

$${\sigma}^{2}({P}_{n})(\delta )=a\Sigma ({P}_{n})(\delta ){a}^{}$$

for a user supplied vector *a*, so that *σ*^{2}(*P _{n}*)

Our proposal presented in the MSE-subsection below will actually have the form

$${\sigma}^{2}({P}_{n})(\delta )=\sum _{j=1}^{d}{a}_{j}^{}\Sigma ({P}_{n})(\delta ){a}_{j},$$

(4)

where *a _{j}* are the row vectors of the square root of a user supplied matrix such as the inverse of the the correlation matrix of Σ(

We might also wish to estimate the bias of the targeted maximum likelihood estimator
$\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n}))$ relative to its limit *ψ*_{0}(*δ*) (even though in most applications the variance appears to drive the MSE term). For example, this could be done with the bootstrap:

$${E}_{{P}_{n}}\{\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n}^{\#}))-\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n}))\},$$

where
${P}_{n}^{\#}$ represents the empirical distribution of a bootstrap sample
${O}_{1}^{\#},\dots ,{O}_{n}^{\#}$ from the empirical distribution *P _{n}*. However, this would be much too computer intensive in many applications in which the targeted maximum likelihood estimator involves data adaptive model or algorithm selection. By noting that a bootstrap sample corresponds on average with 2

$$B({P}_{n})(\delta )={E}_{{B}_{n3}}\{\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n,{B}_{n}}^{0}))-\Psi ({\widehat{Q}}_{\delta}^{*}({P}_{n})\},$$

where *B _{n}*

If *d* = 1, then we will add to the variance term in the previous section the squared bias *B*(*P _{n}*)

$$b{({P}_{n})}^{2}(\delta )\sum _{j}{({a}_{j}^{}2}^{.}$$

**Additional rationale behind bias term:** To provide further understanding of this kind of bias estimate *B*(*P _{n}*), we note the following. Let (

$$\widehat{\Psi}({P}_{n})-\widehat{\Psi}({P}_{0})=({P}_{n}-{P}_{0})D({P}_{0})+R({P}_{n}),$$

(5)

where *D*(*P*_{0}) is the influence curve of the estimator, and *R*(*P _{n}*) is the remainder. The asymptotic linearity assumption now assumes that
$R({P}_{n})={o}_{P}(1/\sqrt{n})$.

The representation (5) of the mapping *P _{n}* → (

$$\begin{array}{l}B({P}_{n})={E}_{{B}_{n}}\widehat{\Psi}({P}_{n{B}_{n}}^{0})-\widehat{\Psi}({P}_{n})\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={E}_{{B}_{n}}\{\widehat{\Psi}({P}_{n{B}_{n}}^{0})-\widehat{\Psi}({P}_{0})\}-\{\widehat{\Psi}({P}_{n})-\widehat{\Psi}({P}_{0})\}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={E}_{{B}_{n}}\{({P}_{n{B}_{n}}^{0}-{P}_{0})D({P}_{0})+R({P}_{n,{B}_{n}}^{0})\}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{-\{({P}_{n}-{P}_{0})D({P}_{0})+R({P}_{n})\}}}\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{={E}_{{B}_{n}}R({P}_{n,{B}_{n}}^{0})-R({P}_{n}),}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

where we use that
${E}_{{B}_{n}}{P}_{n,{B}_{n}}^{0}D({P}_{0})={P}_{n}D({P}_{0})$. Thus, our proposed bias estimate *B*(*P _{n}*)

If *d* = 1, then we define the MSE term as

$$MSE({P}_{n})(\delta )=\frac{{\sigma}^{2}({P}_{n})(\delta )}{n}+B{({P}_{n})}^{2}.$$

If *d >* 1, then we assume that we are provided with a user-specified *d* × *d* symmetric positive definite matrix *ρ*, so that the square root of this matrix *ρ*^{1/2} exists. Our MSE term will represent the expectation of the Euclidean norm of *ρ*^{1/2}( – *ψ*), or equivalently, the expectation of ( – *ψ*)^{}*ρ*( – *ψ*). One concrete proposal is to set *ρ*^{1/2} equal to the square root of the inverse of an estimate of the correlation matrix of the asymptotic covariance matrix of
$\sqrt{n}(\widehat{\Psi}-\psi )$, so that the linearly transformed vector has uncorrelated components.

Let *a _{j}* be the

$$MSE({P}_{n})(\delta )=\frac{1}{n}\sum _{j}{a}_{j}^{}\Sigma ({P}_{n})(\delta ){a}_{j}+n\{{a}_{j}^{}$$

This is equivalent to defining a variance term

$$\frac{{\sigma}^{2}({P}_{n})(\delta )}{n}=\frac{1}{n}\sum _{j}{a}_{j}^{}\Sigma ({P}_{n})(\delta ){a}_{j},$$

a bias term

$$b({P}_{n})(\delta )=\sum _{j}\{{a}_{j}^{}$$

and defining

$$MSE({P}_{n})(\delta )=\frac{{\sigma}^{2}({P}_{n})(\delta )}{n}+{\{b({P}_{n})(\delta )\}}^{2}.$$

Let *D*^{*}(*Q*_{0}, *g*_{0}, *ψ*_{0}) be the efficient influence curve at *dP*_{0} = *Q*_{0}*g*_{0} for the parameter Ψ :
$\mathcal{M}$ → IR. Consider a set of estimators * ^{k}*(

We can use as criterion for selection among * ^{k}*, the cross-validated estimated variance of
${D}^{*}({Q}_{{g}^{0}}^{k*},\mathrm{{g}^{0},\mathrm{\Psi ({Q}_{{g}^{0}}^{k*}))}}$, where
${Q}_{{g}^{0}}^{k*}$ denotes the limit of the targeted maximum likelihood estimator of

By our asymptotic linearity theorem, this selection among * ^{k}* corresponds with minimizing the variance of the influence curve of the (collaborative) targeted maximum likelihood estimators
$\Psi ({\widehat{Q}}_{{\widehat{g}}^{0}}^{k*})$ based on initial estimator

This criterion can now be used as the preferred loss function in our template for the collaborative targeted maximum likelihood estimator to build the candidate censoring mechanism estimators. It will require obtaining a single collaborative estimator *ĝ*^{0} after having obtained the initial estimator: for example, one could carry out a dimension reduction based on improved fits relative to initial estimator, and estimate it with a super learner only adjusting for the selected variables.

We already discussed that the same criterion can be used to select among any set of candidate collaborative targeted maximum likelihood estimators (
${\widehat{Q}}_{{\widehat{g}}^{k}}^{k*}$, *ĝ ^{k}*), relying on collaborative estimators

$${k}_{n}=\text{arg}\underset{k}{\text{min}}{E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}{D}^{*2}({\widehat{Q}}^{k*}({P}_{n,{B}_{n}}^{0}),{\widehat{g}}^{k}({P}_{n,{B}_{n}}^{0}),\Psi ({\widehat{Q}}^{k*}({P}_{n})).$$

However, the above only relies on a single collaborative estimator *ĝ*^{0}, and is therefore particularly suitable for building the second stage within the collaborative targeted MLE template.

Consider an initial estimator . We wish to develop a criterion to select among candidate estimators of *Q*_{0} that are more nonparametric than , such as estimators using as offset. A special application of the loss function presented in this subsection is that is empty. Consider a sequence of increasingly nonparametric estimators *ĝ _{j}*,

$$Q\to \sum _{j=1}^{J}w(j){\Vert {P}_{0}D*(Q,{\widehat{g}}_{j}({P}_{0}))\Vert}^{2},$$

for a list of weights/scalars (*w*(*j*) : *j* = 1, . . . , *J*), where ‖ · ‖ denotes a possibly weighted Euclidean norm applied to the vector of the form *P*_{0}*D*^{*}(*Q*, *g*).

Firstly, if *Q* = *Q*_{0}, then the criterion is minimized. In addition, let *j*_{0}(*Q*) be such that *P*_{0}*D*^{*}(*Q*, *ĝ _{j}*(

The cross-validated analogue of this criterion is given by

$${D}_{n}({\widehat{Q}}^{k}){E}_{{B}_{n}}\sum _{j=1}^{J}w(j){\Vert {P}_{n,{B}_{n}}^{1}D*({\widehat{Q}}^{k}({P}_{n,{B}_{n}}^{0}),{\widehat{g}}_{j}({P}_{n}))\Vert}^{2}.$$

If *ψ*_{0} is one dimensional, a sensible choice of weights are given by the inverse of the variance of the empirical means, so that it downweights the noisy *j*-specific signals:

$$w{(j)}^{-1}={P}_{n,{B}_{n}}^{1}{\{D*({\widehat{Q}}^{k}({P}_{n,{B}_{n}}^{0}),{\widehat{g}}_{j}({P}_{n}))\}}^{2},$$

and makes the criterion *D _{n}*() unit free. Analogues of such weighting can be obtained for multi-dimensional

One can add this criterion to a valid loss function such as the log-likelihood criterion *L*(* ^{k}*), giving a more targeted loss function

$$L*({\widehat{Q}}^{k})L({\widehat{Q}}^{k})+{D}_{n}({\widehat{Q}}^{k}).$$

The additional term *D _{n}* preserves the validity of the loss function

Suppose we observe *O* = (*W*, *A*, *Y* = *Y* (*A*)), where *W* are baseline covariates, *A* is a discrete treatment, and *Y* is a subsequently measured outcome. It is assumed that *A* is realized in response to the realization of *W*, and *Y* is realized in response to both *W* and *A*. The full data structure on the experimental unit is *X* = (*W*, (*Y* (*a*) : *a*)), so that *A* represents the missingness variable for the missing data structure *O* on *X*. We assume the randomization assumption: *g*_{0}(*a* | *X*) = *P*_{0}(*A* = *a* | *X*) = *P*_{0}(*A* = *a* | *W*).

Consider a marginal structural model for the full data distribution

$${E}_{0}(Y(a)|V)=m(a,V|{\beta}_{0})$$

that models the causal effect of a treatment intervention *A* = *a* on the outcome *Y*. For example, one might assume a simple linear model *m*(*a*, *V* | *β*_{0}) = *β*_{0}(*a*, *V*, *aV*).

Since it is often unreasonable to assume such a parametric form, but such parametric forms can still provide very meaningful projections of the true causal curve, we consider the nonparametric extensions of the parameter *β*_{0}:

$${\Psi}_{h}({P}_{0})=\underset{\beta}{\text{argmin}}{E}_{{P}_{0}}\sum _{a}h(a,V){({Q}_{0}(a,W)-m(a,V|\beta ))}^{2},$$

where *Q*_{0}(*a*, *w*) = *E*_{0}(*Y* | *A* = *a*, *W*). We have that Ψ* _{h}*(

$${\Psi}_{h}({P}_{0})=\underset{\beta}{\text{argmin}}{E}_{{P}_{0}}\sum _{a}h(a,V){(E(Y(a)|V)-m(a,V|\beta ))}^{2}.$$

In particular, if *E*_{0}(*Y* (*a*) | *V*) = *m*(*a*, *V* | *β*_{0}), then for each *h* we have Ψ* _{h}*(

We note that this nonparametric extension only depends on *P*_{0} through the conditional mean of *Y*, given *A*, *W*, and the marginal distribution of *W*. For simplicity, we will also use the notation Ψ* _{h}*(

The efficient estimating function for this nonparametric extension Ψ* _{h}* of

$$\begin{array}{l}{D}_{h}({P}_{0})(O)=\frac{{h}_{1}(A,V)}{{g}_{0}(A|W)}(Y-{Q}_{0}(A,W))\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+\sum _{a}{h}_{1}(a,V)({Q}_{0}(a,W)-m(a,V|{\Psi}_{h}({P}_{0})),}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

where
${h}_{1}(a,\mathrm{V)=h(a,\mathrm{V)\frac{d}{d\psi}m(a,\mathrm{V|\psi )}}}$. We will assume that *h*_{1}(*A*, *V*) = *d/dψ*_{0}*m*(*A*, *V* | *ψ*_{0})*h*(*A*, *V*) is chosen so that *h*_{1} does not depend on *ψ*_{0}, which is easily arranged for the case that *m* is linear in *ψ* and that *m* is logistic linear. For example, if we use the linear form (1, *a*, *V*, *aV*), then, if *m* is linear, then we can choose *h*_{1} = *d/dψ*_{0}*m*(*A*, *V* | *ψ*_{0})*g*(*A* | *V*) = ((1, *A*, *V*, *AV*)*g*(*A* | *V*), and, if *m* is logistic linear, then we can choose *h*_{1} = *d/dψ*_{0}*m*(*A*, *V* | *ψ*_{0})*/*(*m*(1 – *m*)(*A*, *V* | *ψ*_{0}))*g*(*A* | *V*), which equals (1, *A*, *V*, *AV*)*g*(*A* | *V*), again.

Let
${D}_{h}^{*}({P}_{0})=-{c}_{0}^{-1}{D}_{h}({P}_{0})$ be the corresponding efficient influence curve obtained by standardizing the efficient estimating function by the negative of the inverse of the derivative matrix *c*_{0} = *d/dψ*_{0}*E*_{0}*D _{h}*(

If *Y* is continuous and we use a normal error regression model as a working model, then a targeted maximum likelihood estimator of *ψ _{h}*

$${C}_{h}(g)(A,W)=\frac{{h}_{1}(A,V)}{g(A|W)},$$

for some fit *g* of *g*_{0}, and fitting *ε* with maximum likelihood (i.e., least squares estimation) using *Q*^{0} as offset. The resulting update *Q*^{1}(*A*, *W*) is now a first step targeted maximum likelihood estimator. It is also the actual targeted maximum likelihood estimator, since iteration is not resulting in further updates (the clever covariate does not change at a targeted MLE update). One estimates the distribution of *W* with the empirical distribution. The estimate *Q*^{1} and the empirical distribution of *W* now yields a substitution estimate of the target parameter *ψ _{h}*

Let (*P _{n}*) be an initial regression estimator of

$${\widehat{\Sigma}}_{CV}({P}_{n})(\widehat{g})={E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}{D}_{h}^{*}{({\widehat{Q}}_{\widehat{g}}^{*}({P}_{n,{B}_{n}}^{0}),\widehat{g}({P}_{n,{B}_{n}}^{0}))}^{2}$$

be the cross-validated estimate of the covariance matrix of the efficient influence curve at the estimator and *ĝ*. We also consider the empirical estimate of this covariance matrix

$$\widehat{\Sigma}({P}_{n})(\widehat{g})={P}_{n}{D}_{h}^{*}{({\widehat{Q}}_{\widehat{g}}^{*}({P}_{n}),\widehat{g}({P}_{n}))}^{2}.$$

Let

$$\widehat{B}({P}_{n})(\widehat{g})={E}_{{B}_{n}}{\widehat{\Psi}}_{\widehat{g}}({P}_{n,{B}_{n}}^{0})-{\widehat{\Psi}}_{\widehat{g}}({P}_{n})$$

be the bias estimator for the targeted maximum likelihood estimator
${\widehat{\Psi}}_{\widehat{g}}({P}_{n})={\Psi}_{h}({\widehat{Q}}_{\widehat{g}}^{*}({P}_{n}))$ obtained by plugging in
${\widehat{Q}}_{\widehat{g}}^{*}({P}_{n})$ in the parameter mapping Ψ. Here we can use three-fold cross-validation as choice for *B _{n}*.

We will penalize the log-likelihood with the an estimate of the following average mean squared error

$$\frac{1}{n}{\sum _{i=1}^{n}E\left(m({a}_{i},{v}_{i}|{\widehat{\Psi}}_{\widehat{g}}({P}_{n})))-m({a}_{i},{v}_{i}|{\widehat{\Psi}}_{\widehat{g}}({P}_{0}))\right)}^{2}.$$

This mean squared error can be decomposed as $1/n{\sum}_{i=1}^{n}\text{Var}(m({a}_{i},\mathrm{{v}_{i}|{\widehat{\Psi}}_{\widehat{g}}({P}_{n})))}$ and $1/n{\sum}_{i=1}^{n}{\text{Bias}}^{2}(m({a}_{i},\mathrm{{v}_{i}|{\widehat{\Psi}}_{\widehat{g}}({P}_{n})))}$. The variance terms of this mean squared error can be estimated by

$$\frac{{\sigma}_{i}^{2}(\widehat{g})}{n}\frac{z{({a}_{i},{v}_{i})}^{}n}{,}$$

where

$${z({a}_{i},{v}_{i})=\frac{d}{d\beta}m({a}_{i},{v}_{i}|\beta )|}_{\beta ={\widehat{\Psi}}_{\widehat{g}}({P}_{n})}.$$

We keep open the option that one uses either the cross-validated covariance matrix * _{CV}* (

The bias terms of this mean squared error can be estimated as

$${B}_{i}(\widehat{g}){E}_{{B}_{n}}m({a}_{i},{v}_{i}|{\widehat{\Psi}}_{\widehat{g}}({P}_{n,{B}_{n}}^{0}))-m({a}_{i},{v}_{i}|{\widehat{\Psi}}_{\widehat{g}}({P}_{n})).$$

If *m* is linear in *β*, then this reduces to

$${B}_{i}(\widehat{g})=m({a}_{i},{v}_{i}|B({P}_{n})).$$

Thus, we obtain the following mean squared error estimate for the targeted maximum likelihood estimator * _{ĝ}*(

$$\widehat{MSE}({P}_{n})(\widehat{g})\frac{1}{n}\sum _{i=1}^{n}\left\{\frac{{\sigma}_{i}^{2}(\widehat{g})}{n}+{B}_{i}{(\widehat{g})}^{2}\right\}.$$

We suggest that the penalized log-likelihood could also only be penalized by the empirical variance component of the MSE. Therefore, we also define

$${\sigma}^{2}({P}_{n})(\widehat{g})\frac{1}{n}\sum _{i=1}^{n}\frac{{\sigma}_{i}^{2}(\widehat{g})}{n}.$$

Consider now the following two penalized log-likelihood criterions for *ĝ*, given the initial estimator ^{0}:

$$L(\widehat{g}|{\widehat{Q}}^{0})=\frac{1}{n}\sum _{i=1}^{n}{({Y}_{i}-{\widehat{Q}}_{\widehat{g}}^{*}({P}_{n})({W}_{i},{A}_{i}))}^{2}+\widehat{MSE}({P}_{n})(\widehat{g}),$$

or

$$L(\widehat{g}|{\widehat{Q}}^{0})=\frac{1}{n}\sum _{i=1}^{n}{({Y}_{i}-{\widehat{Q}}_{\widehat{g}}^{*}({P}_{n})({W}_{i},{A}_{i})))}^{2}+{\sigma}^{2}({P}_{n})(\widehat{g}).$$

For *Y* binary, the RSS is replaced by the log-likelihood of *Y* , given *A*, *W*.

We will use the penalized log-likelihood as loss function to build the candidate treatment mechanism estimators, according to the template for the collaborative targeted MLE in Section 2.

Given any candidate adjustment set *W*^{*} *W*, let an estimator *ĝ*(*P _{n}*)(

$$L(W*|\widehat{Q})\to L(\widehat{g}({P}_{n})(W*)|\widehat{Q}).$$

Thus, one can evaluate/score any given adjustment set *W*^{*} with *L*(*W*^{*} | ).

Given , one can now use this empirical criterion in adjustment sets to construct an estimator of *g*_{0}() with a greedy type algorithm maximizing over a set of candidate adjustment sets. One starts with the empty adjustment set and selects the best addition move among a set of candidate addition moves based on the criterion. One iterates this process until there does not exist an addition move that improves the criterion. More aggressive greedy algorithms can be implemented as well, as with any machine learning algorithm that is based on iterative local maximization of an empirical criterion. One could apply this algorithm to candidate adjustments sets that have a certain size or entropy for the corresponding *ĝ*(*P _{n}*)(

Alternatively, one creates a sequence of nested (increasing in size) adjustment sets
${W}_{j}^{*}$, *j* = 1, . . . , *J*, for each
${W}_{j}^{*}$ one obtains a particular estimator *ĝ _{j}*(

In our algorithm in the next subsection defining the sequence of C-TMLEs we apply this greedy algorithm to candidate estimators that are more nonparametric than the selected estimator of *g*_{0} in the previous step.

Given an initial estimator of *E*(*Y* | *A*, *W*) and a corresponding estimator *ĝ*() defined above, sometimes denoted with *ĝ*, we define a resulting targeted maximum likelihood estimator

$${\widehat{Q}}_{g}^{*}({P}_{n})=\widehat{Q}({P}_{n})+{\epsilon}_{n}h(\widehat{g}(\widehat{Q})({P}_{n})),$$

where *ε _{n}* is the least squares estimator of the regression coefficient

$${\widehat{Q}}^{1}({P}_{n})=\widehat{Q}({P}_{n})+{\epsilon}_{h}^{1}h(\widehat{g}(\widehat{Q})({P}_{n})).$$

This process can now be iterated by replacing (*P _{n}*) by this update

$${\widehat{Q}}^{2}({P}_{n})={\widehat{Q}}^{1}({P}_{n})+{\epsilon}_{h}^{2}h(\widehat{g}({\widehat{Q}}^{1})({P}_{n})),$$

where we require that the next censoring mechanism estimator *ĝ*(^{1})(*P _{n}*) is obtained with the same algorithm as presented in above subsection, but now maximizing over candidate estimators that are more nonparametric than the previously obtained

In general, we define the *k*-th step of this targeted maximum likelihood estimator as

$${\widehat{Q}}^{k}({P}_{n})={\widehat{Q}}^{k-1}({P}_{n})+{\epsilon}_{h}^{k}h(\widehat{g}({\widehat{Q}}^{k-1})({P}_{n})),$$

where *ĝ*(^{k}^{−1})(*P _{n}*) involves maximizing over more nonparametric candidate estimators than

This algorithm results in a sequence of *k*-th step collaborative targeted maximum likelihood estimators Ψ(* ^{k}*(

We could also have defined candidate targeted maximum likelihood estimators using a forward selection algorithm each time finding the best next term to add in the treatment mechanism, so that *k* denotes the number of terms included in *ĝ ^{k}*. In that case,

If the initial estimator is indexed by a choice *δ*_{1} and the choice of algorithm *ĝ*() is indexed by a *δ*_{2}, then, for each *δ*_{1}, *δ*_{2}, this results in candidate *k*-th step collaborative targeted maximum likelihood estimators
${P}_{n}\to {\widehat{Q}}_{{\delta}_{1},{\delta}_{2}}^{k}({P}_{n})$, corresponding treatment mechanism estimators
${P}_{n}\to {\widehat{g}}_{{\delta}_{2}}^{k}({P}_{n})$, and corresponding
${P}_{n}\to \Psi ({\widehat{Q}}_{{\delta}_{1},{\delta}_{2}}^{k}({P}_{n}))$ targeted maximum likelihood estimators of *ψ*_{0}, indexed by *k*.

For each *δ*_{1}, *δ*_{2}, in order to select among these candidate targeted maximum likelihood estimators indexed by *k*, we use the cross-validated penalized log-likelihood defined as:

$$\begin{array}{l}L(k,{\delta}_{1},{\delta}_{2})={E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}{\left(Y-{\widehat{Q}}_{{\delta}_{1},{\delta}_{2}}^{k}({P}_{n,{B}_{n}}^{0})(W,A)\right)}^{2}\\ \mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{\mathrm{+{\widehat{MSE}}_{CV}({P}_{n})({\widehat{Q}}_{{\delta}_{1},{\delta}_{2},}^{k}{\widehat{g}}_{{\delta}_{2}}^{k}).}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}\end{array}$$

This results now in candidate (*δ*_{1}, *δ*_{2})-specific collaborative targeted maximum likelihood estimators
${\widehat{Q}}_{{\delta}_{1},{\delta}_{2}}^{*}$, with corresponding initial estimator _{δ1} and collaborative treatment mechanism estimator *ĝ*_{δ1,δ2} (note the choice of *k* is now a function of *δ*_{1}, *δ*_{2} so that also the collaborative estimator *ĝ* is indexed by these choices).

We could select *δ*_{1}, *δ*_{2} by minimizing the same cross-validated penalized log-likelihood, e.g., by simply simultaneously minimizing the above criterion over the triplets (*k*, *δ*_{1}, *δ*_{2}). Alternatively, we could employ empirical efficiency maximization for all these candidate collaborative targeted maximum likelihood estimators that are assumed to be asymptotically linear with influence curve the efficient influence curve plus a contribution from the collaborative estimator of *g*_{0}, as stated in our asymptotic linearity theorem. Thus, by ignoring this latter contribution to the influence curve, we could also select *δ*_{1}, *δ*_{2} as the minimizer of the sum of the variances of the components of the efficient influence curve of *ψ*_{0}: (with *δ* = (*δ*_{1}, *δ*_{2}))

$${\delta}_{n}=\text{arg}\underset{\delta}{\text{min}}\sum _{j=1}^{d}{P}_{n}{D}_{j}^{*}{({\widehat{Q}}_{\delta}^{*},{\widehat{g}}_{\delta})}^{2}.$$

Other criteria based on the vector-efficient influence curve could be considered as well.

The resulting collaborative targeted maximum likelihood estimator *Q _{n}* =

$${\Sigma}_{n}={E}_{{B}_{n}}{P}_{n,{B}_{n}}^{1}D*{(\widehat{Q}*({P}_{n,{B}_{n}}^{0}),\widehat{g}*({P}_{n,{B}_{n}}^{0}))}^{2},$$

where one should include the *g _{n}*-component to obtain more accurate inference. Statistical inference would be based on the normal working model

For most data sets little to no knowledge is available about the data generating distribution. Consequently, the true model is a large infinite dimensional semi-parametric model. In such models many data adaptive approaches can be considered for fitting the true distribution of the data, based on different approximation function spaces, different searching strategies for maximizing an empirical criterion (such as the empirical log-likelihood) over these spaces, and different methods for selecting the fine tuning parameters indexing the function spaces and search strategies. Depending on the true data generating distribution, these algorithms will have very different levels of performance in approximating the true data generating distribution. As a consequence, cross-validation based super learning should be employed to find the best weighted combination among a large user supplied set of candidate estimators of the true data generating distribution. The user has an option to select the wished loss function. The oracle property of the cross-validation selector (van der Vaart et al. (2006), van der Laan et al. (2006)) teaches us that the super learner will asymptotically perform exactly as well, w.r.t. the loss-based (e.g., Kullback-Leibler) dissimilarity measure, as the best weighted combination of the candidate algorithms optimized for each data set. A crucial assumption is that the loss function used in this super learning methodology is uniformly bounded in all the candidate estimators. This could be viewed as a semi-parametric model assumption, and it is essential for robustness of the resulting estimator against outliers. It can be arranged by bounding the candidate estimators so that the loss function remains bounded.

The super learner fit represents a best fit for the purpose of estimation of the whole distribution of the data w.r.t. the loss-function specific dissimilarity, so that the bias-variance trade-off is not targeted (enough) w.r.t. the parameter of interest. Overall, the resulting estimate it overly biased for the smooth target parameter.

Therefore, our methodology involves a second targeted modification of the first stage super learner fit that aims to reduce the bias w.r.t the target parameter, while simultaneously increasing the likelihood fit. A single fluctuation function that would yield asymptotic optimal bias reduction as defined by the efficient influence curve of the target parameter is determined. This fluctuation function needs to have a score-vector at zero fluctuation whose linear span includes the efficient influence curve of the target parameter. This fluctuation function depends on an unknown nuisance parameter of the data generating distribution, such as a censoring mechanism.

In one particular embodiment of our C-TMLE, we define an iterative sequence of subsequent fluctuations, starting with the initial super learner fit, where the subsequent fluctuation functions are estimated with increasingly nonparametric estimates of the nuisance parameter, including a final fully non-parametrically estimated fluctuation function. In addition, by construction, we make sure that for each fluctuation function the nuisance parameter estimator that results in maximal increase in likelihood (or preferred loss function) fit is selected, among the candidate nuisance parameter estimators that are more nonparametric than the one selected at previous fluctuation function. In this way, we arrange that most of the targeted bias reduction occurs in the first few fluctuations. The actual number of times we carry out the subsequent update is selected with likelihood based cross-validation.

Essentially, we try to move towards the asymptotically optimal bias reduction (identified by the true nuisance parameter/censoring mechanism used in the data generating distribution) along a sequence of targeted bias reduction steps, but we stop moving towards this asymptotically optimal bias reduction when it results in a loss of likelihood fit as measured by the cross-validated log-likelihood. In general, our template for C-TMLE allow a fine sequence of nested targeted bias reduction steps (i.e., a fine sequence of candidate second stage estimators indexed by increasingly nonparametric nuisance parameter estimators) whose fits contain this set of candidate-fits as a subsequence, thereby potentially providing an additional improvement in practical performance of the resulting C-TMLE.

Theoretical results teach us that this push towards the asymptotically optimal bias reduction takes into account how well the initial estimator approximates the true distribution, by giving preference to targeted bias reduction steps that improve the log-likelihood fit using the initial estimator as off-set. As a consequence, the C-TMLE is able to avoid selecting irrelevant or harmful (w.r.t. relevant factor of density) fits of the nuisance parameter, even though such fits might improve the overall fit of the nuisance parameter. That is, the fit of the nuisance parameter is targeted towards our primary goal, the parameter of interest.

Specifically, we establish a general asymptotic linearity theorem for collaborative targeted maximum likelihood estimators, which provides us with the influence curve of our estimator, and thereby statistical inference. Fortunately, the influence curve happens to be equal to the efficient influence curve at the collaborative nuisance parameter estimator (its limit) plus a contribution of the nuisance parameter estimator as an estimator of its limit, providing immediate variance estimates. Gains in efficiency, resulting in possible super efficiency, are obtained by estimating the nuisance parameter collaboratively, in relation to the initial estimator. An inspection of the efficient influence curve allows us to precisely define the sufficient and minimal term *H*(*g*_{0}, *Q*–*Q*_{0}) one needs to adjust for in *g*_{0} to achieve the wished full bias reduction. We propose to estimate this term and force adjustment by this term in the candidate censoring mechanism estimators within the C-TMLE procedure, without relying on its correct specification, thereby preserving and only enhancing the double robustness of the C-TMLE.

The targeted maximum likelihood estimator relies itself on maximizing an empirical mean of a loss function over a fluctuation function, and the derivative of this empirical criterion at zero fluctuation needs to be the empirical mean of the efficient influence curve at the estimator to be fluctuated. We refer to this loss as the log-likelihood loss, but it needs to be understood that this choice can already be targeted (e.g,. van der Laan (2008b)) in the sense that it is typically implied by the efficient influence curve of the target parameter (e.g, one would not use a likelihood based on factors that are not needed to evaluate the target parameter). In addition, we propose to replace this log-likelihood by a penalized log-likelihood, where the penalty is scaled appropriately, has negligible contribution for nicely identifiable target parameters, but blows up for fits that result in extremely variable or biased estimators of the parameter of interest. Even though the penalty’s effect on the Kullback-Leibler dissimilarity is asymptotically negligible for identifiable parameters, for parameters that are borderline identifiable, this penalty can yield dramatic additional finite sample improvements to the C-TMLE. In essence, it builds in a sensible robustness of the resulting C-TMLE as an estimator of the target parameter.

Given the initial estimator, the candidate censoring mechanism estimators, and the corresponding sequence of targeted MLE indexed by these increasingly nonparametric estimators of the censoring mechanism, the log-likelihood or penalized log-likelihood is used for cross-validation selection of the depth of the bias reduction (i.e, for selection among these candidate targeted MLEs) in the C-TMLE algorithm. However, a preferred loss function can be used to build the initial estimator, to build the candidate censoring mechanism estimators, and to select among different collaborative C-TMLEs. In particular, one could use the penalized log-likelihood as preferred loss function.

We propose as a possibly more targeted selection criterion the cross-validated variance of the square of the efficient influence curve (e.g., if target parameter is one-dimensional), where a collaborative estimator is used for the nuisance parameter (censoring mechanism) in the efficient influence curve: i.e. we propose as loss function for a candidate *Q* the square of the efficient influence curve at its targeted version, using a collaborative estimator of *g*_{0}. Indeed,
${E}_{0}{D}^{*2}({Q}_{{g}_{0}}^{*},\mathrm{{g}_{0},\mathrm{\Psi ({Q}_{{g}_{0}}^{*})={E}_{0}{D}^{*2}({Q}_{{g}_{0}}^{*},\mathrm{{g}_{0},\mathrm{{\psi}_{0})}}}}$ is a variance of a collaborative double robust estimator of *ψ*_{0}, indexed by initial estimator *Q*, using known *g*_{0}, and is thereby a valid and sensible loss function. In order to also take into account that *g*_{0} is estimated, one could also minimize the variance of the actual influence curve of the collaborative double robust targeted MLE using *Q* as initial and using *g _{n}* as collaborative estimator. This would imply the same square of influence curve, but now the influence curve equals

Finally, we are also able to select a targeted loss function for *g*_{0} in the CTMLE template, by making its loss-based dissimilarity a measure of goodness of fit of the *g*_{0}-specific fluctuation function in which the estimator of *g*_{0} is used.

The collaborative double robust maximum likelihood estimator utilizes 1) loss-based super learning to obtain the best initial estimator of *Q*_{0} (grounded by theory for cross-validation selector), 2) loss-based super learning to generate best estimators of candidate censoring mechanisms adjusting for increasingly large adjustment sets (grounded by theory for cross-validation selector), 3) specification of loss functions that target the needed *Q*_{0}, *g*_{0} for identification of the efficient influence curve, 4) targeted maximum likelihood estimation along a fluctuation function using such a censoring mechanism estimator to remove bias for target parameter (grounded by theory for targeted maximum likelihood estimation), 4) *Q*_{0}-(penalized)-likelihood based cross-validation to select among such candidate censoring mechanism estimators, and thereby obtain the collaborative estimator of the censoring mechanism for the fluctuation function that removes the bias, while avoids unnecessary loss in variance (grounded by oracle results for cross-validation, collaborative double robustness, and our asymptotic linearity theorem), 5) efficient influence curve based dimension reductions (the minimal sufficient covariate) to be included in the censoring mechanism estimators that allow for effective bias reduction in the T-MLE, if correctly specified, (and will not harm the collaborative double robustness if not) (grounded by theory on efficient influence curve representation and our collaborative double robustness theorem), 6) efficient influence curve based loss function for *Q*_{0} to build more targeted candidate censoring mechanism estimators 7) efficient influence curve based loss functions for *Q*_{0} to make the initial estimator more targeted, to make the selection among candidate C-TMLE more targeted, resulting in smaller asymptotic variance of the resulting C-TML estimator of the target parameter (grounded by empirical efficiency maximization theory, and our asymptotic linearity theorem for C-TMLE).

To summarize, this article provides a template for loss-based adaptive (super) efficient estimators in semiparametric models that are targeted towards a particular target feature of the distribution of the data, and for which statistical inference is available.

^{*}We like to thank the referees for their helpful comments. This research was supported by NIH grant R01 A1074345-03.

- Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer-Verlag; New York: 1993.
- Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant hiv infection Statistics in Medicine, page http://www3.interscience.wiley.com/journal/121422393/abstract, 2008 [PubMed]
- Bembom O, van der Laan MJ, Haight T, Tager IB. Lifetime and current leisure time physical activity and all-cause mortality in an elderly cohort. Epidemiology. 2009 doi: 10.1097/EDE.0b013e31819e3f28. [PubMed] [Cross Ref]
- Bembom Oliver, van der Laan Mark. Statistical methods for analyzing sequentially randomized trials, commentary on JNCI article Adaptive therapy for androgen independent prostate cancer: A randomized selection trial including four regimens, by Peter F. Thall, C. Logothetis, C. Pagliaro, S. Wen, M.A. Brown, D. Williams, R. Millikan. Journal of the National Cancer Institute. 2007;2007;99(21):1577–1582. doi: 10.1093/jnci/djm185. [PMC free article] [PubMed] [Cross Ref]
- Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997.
- Bryan J, Yu Z, van der Laan MJ. Analysis of longitudinal marginal structural models. Biostatistics. 2003;5(3):361–380. doi: 10.1093/biostatistics/kxg041. [PubMed] [Cross Ref]
- Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology. 2005;2(2):131–154. doi: 10.1016/j.stamet.2005.02.003. [Cross Ref]
- Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case. Ann Stat. 2001;29(6)
- Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY, Fleming TR, editors. Proceedings of the First Seattle Symposium in Biostatistics. New York: Springer Verlag; 1997. pp. 255–94.
- Heitjan DF, Rubin DB. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. doi: 10.1214/aos/1176348396. [Cross Ref]
- Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [PubMed] [Cross Ref]
- Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. doi: 10.1214/aos/1176324622. [Cross Ref]
- Kang J, Schafer J. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007a;22:523–39. doi: 10.1214/07-STS227. [PMC free article] [PubMed] [Cross Ref]
- Kang J, Schafer J. Rejoinder: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007b;22:574–80. doi: 10.1214/07-STS227REJ. [PMC free article] [PubMed] [Cross Ref]
- Keleş S, van der Laan M, Dudoit S. Asymptotically optimal model selection method for regression on censored outcomes. Technical Report, Division of Biostatistics, UC Berkeley. 2002
- Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes Technical report 215, Division of Biostatistics, University of California; Berkeley: April2007
- Moore KL, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009.
- Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B. 2003;65(2)
- Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic treatment regimens. Journal of the American Statistical Association. 2001;96:1410–1424. doi: 10.1198/016214501753382327. [PMC free article] [PubMed] [Cross Ref]
- Neugebauer R, van der Laan MJ. Why prefer double robust estimates. Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [Cross Ref]
- Petersen Maya L, Deeks Steven G, Martin Jeffrey N, van der Laan Mark J. History-adjusted marginal structural models: Time-varying effect modification and dynamic treatment regimens Technical report 199, Division of Biostatistics, University of California; Berkeley: December2005
- Polley EC, van der Laan MJ. Predicting optimal treatment assignment based on prognostic factors in cancer patients. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009.
- Ridgeway G, McCaffrey D. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:540–43. doi: 10.1214/07-STS227C. [PMC free article] [PubMed] [Cross Ref]
- Robins J, Orallana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008;27(23):4678–4721. doi: 10.1002/sim.3301. [PubMed] [Cross Ref]
- Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica. 2001a;11(4):920–936.
- Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000a;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [PubMed] [Cross Ref]
- Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000b;450:431–435. doi: 10.2307/2669381. [Cross Ref]
- Robins JM, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science. 2007;22:544–559. doi: 10.1214/07-STS227D. [Cross Ref]
- Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association; 2000a.
- Robins JM. Discussion of “optimal dynamic treatment regimes” by susan a. murphy. Journal of the Royal Statistical Society: Series B. 2003;65(2):355–366.
- Robins JM. Optimal structural nested models for optimal sequential decisions. Heagerty PJ, Lin DY, editors. Proceedings of the 2nd Seattle symposium in biostatistics. 2005a;179:189–326.
- Robins JM. Optimal structural nested models for optimal sequential decisions Technical report, Department of Biostatistics, Havard University; 2005b
- Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [Cross Ref]
- Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach in causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health SErvices Research; Washington D.C.: 1989. pp. 113–159.
- Robins JM. Proceeding of the Biopharmaceutical section. American Statistical Association; 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers; pp. 24–33.
- Robins JM. Causal inference from complex longitudinal data. In: Berkane Editor M., editor. Latent Variable Modeling and Applications to Causality. Springer Verlag; New York: 1997a. pp. 69–117.
- Robins JM. Structural nested failure time models. In: Armitage P, Colton T, Andersen PK, Keiding N, editors. The Encyclopedia of Biostatistics. John Wiley and Sons; Chichester, UK: 1997b.
- Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133.
- Robins JM, Rotnitzky A. Comment on Inference for semiparametric models: some questions and an answer, by Bickel, P.J. and Kwon, J. Statistica Sinica. 2001b;11:920–935.
- Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers AIDS Epidemiology, Methodological issues. Bikhäuser1992
- Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol4/iss1/19/, 2008. 10.2202/1557-4679.1115 [PMC free article] [PubMed] [Cross Ref]
- Rose S, van der Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol5/iss1/1/, 2009. 10.2202/1557-4679.1127 [PMC free article] [PubMed] [Cross Ref]
- Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [Cross Ref]
- Rosenblum M, Deeks SG, van der Laan MJ, Bangsberg DR. The risk of virologic failure decreases with duration of hiv suppression, at greater than 50% adherence to antiretroviral therapy PLoS ONE 49e7196doi:10.1371/journal.pone.0007196, 2009. 10.1371/journal.pone.0007196 [PMC free article] [PubMed] [Cross Ref]
- Rubin DB. Matched Sampling for Causal Effects. Cambridge University Press; Cambridge, MA: 2006.
- Rubin DB, van der Laan MJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis The International Journal of Biostatistics 41, Article 5, 2008. 10.2202/1557-4679.1084 [PMC free article] [PubMed] [Cross Ref]
- Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Sotware, Forthcoming. 2008
- Sinisi S, van der Laan MJ. The deletion/substitution/addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1)
- Tan Z. Comment: Understanding or, ps and dr. Statistical Science. 2007;22:560–568. doi: 10.1214/07-STS227A. [Cross Ref]
- Tsiatis A, Davidian M. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:569–73. doi: 10.1214/07-STS227B. [PMC free article] [PubMed] [Cross Ref]
- Tuglus C, van der Laan MJ. Targeted methods for biomarker discovery, the search for a standard UC Berkeley Working Paper Series, page http://www.bepress.com/ucbbiostat/paper233/, 2008
- van der Laan MJ. Causal effect models for intention to treat and realistic individualized treatment rules Technical report 203, Division of Biostatistics, University of California; Berkeley: 2006
- van der Laan MJ. Estimation based on case-control designs with known prevalance probability The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol4/iss1/17/, 2008a. 10.2202/1557-4679.1114 [PubMed] [Cross Ref]
- van der Laan MJ. The construction and analysis of adaptive group sequential designs Technical report 232, Division of Biostatistics, University of California; Berkeley: March2008b
- van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples Technical report, Division of Biostatistics, University of California; Berkeley: November2003
- van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics. 2007;3(1) [PMC free article] [PubMed]
- van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003.
- van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [Cross Ref]
- van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1036. [PubMed] [Cross Ref]
- van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics. 2005;1(1):10–20. doi: 10.2202/1557-4679.1003. [Cross Ref]
- van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. doi: 10.1524/stnd.2006.24.3.373. [Cross Ref]
- van der Laan MJ, Polley E, Hubbard A. Super learner Statistical Applications in Genetics and Molecular Biology 6252007. ISSN 1.10.2202/1544-6115.1309 [PubMed] [Cross Ref]
- van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254/, September2009
- van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996.
- van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. doi: 10.1524/stnd.2006.24.3.351. [Cross Ref]
- Yu Z, van der Laan MJ. Construction of counterfactuals and the G-computation formula Technical report, Division of Biostatistics, University of California; Berkeley: 2002
- Yu Z, van der Laan MJ. Double robust estimation in longitudinal marginal structural models Technical report, Division of Biostatistics, University of California; Berkeley: 2003

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |